Phase 8 · Chapter 8.02

Canary Deployment

কয়লা খনিতে canary পাখি বিষ গ্যাস বুঝে দিত। Software-এ canary = small % user-কে নতুন version দিয়ে "বিষ" আগে ধরো।

Concept

Big bang ≠ deploy

v2 সরাসরি 100% user-কে দিলে bug মানে full outage। Canary = প্রথমে 1%, metric ঠিক থাকলে 10%, তারপর 50%, শেষে 100%। প্রতিটা step-এ automatic analysis gate।

Canary vs A/B

দুটো এক না

A/B test: goal = কোনটা better প্রমাণ করা। Long-running, statistical।
Canary: goal = safely rollout। Short, operational gate (latency, error)।
একই system দুটো serve করতে পারে — same routing, different decision।

NGINX Canary

Simple weight-based

yamlproduction

# stable Ingress (existing)
# + canary Ingress with same host
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: iris-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "5"      # 5%
spec:
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /predict
            pathType: Prefix
            backend:
              service: { name: iris-v2, port: { number: 80 } }

Weight 5 → 25 → 50 → 100 ধাপে বাড়াও — manual বা automated।

Argo Rollouts

Automated progressive delivery

yamlproduction

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata: { name: iris-api }
spec:
  replicas: 5
  strategy:
    canary:
      analysis:
        templates: [{ templateName: success-rate }]
        startingStep: 1
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - setWeight: 25
        - pause: { duration: 10m }
        - setWeight: 50
        - pause: { duration: 15m }
        - setWeight: 100
  selector:
    matchLabels: { app: iris-api }
  template:
    spec:
      containers:
        - name: api
          image: ghcr.io/me/iris-api:v2.0.0

Analysis Template

Auto-rollback gate

yamlproduction

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata: { name: success-rate }
spec:
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.99
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{app="iris-api",status!~"5.."}[2m]))
              / sum(rate(http_requests_total{app="iris-api"}[2m]))

Success rate < 99% হলে rollout auto-abort এবং stable version restore।

ML-Specific Gates

শুধু error-rate যথেষ্ট নয়

Prediction distribution shift < 10% (vs stable)।
Average confidence > 0.7।
p99 latency < 200ms।
Business KPI (CTR, conv) within ±5% of stable।

Patterns

Practical playbook

Shadow traffic: v2-কে real request পাঠাও কিন্তু response ফেলে দাও — zero-risk validation।
Dark launch: feature off-by-default, slowly enable per cohort।
Region-based canary: ছোট region (e.g. internal) আগে, তারপর tier-1।

Pitfalls

যা canary নষ্ট করে

1% canary-এ traffic এত কম যে metric noisy — minimum sample size রাখো।
Sticky session ছাড়া user back-and-forth — confusing UX।
Database migration canary-এর সাথে coupled — rollback ভাঙে।
Manual approval blocking — automate analysis।

Mini Project

Argo Rollouts demo

kind cluster-এ Argo Rollouts install করো।
Iris v1 stable rollout deploy করো।
v2 image push করে rollout trigger।
`kubectl argo rollouts get rollout iris-api -w` দেখো।

Takeaway

মনে রাখো

Canary = small blast radius + automated gate। Risk কমে, confidence বাড়ে।

← Roadmap-এ ফিরুন পরবর্তী: Blue-Green Deployment