হোম/Roadmap/Chapter 8.02
Phase 8 · Chapter 8.02

Canary Deployment

কয়লা খনিতে canary পাখি বিষ গ্যাস বুঝে দিত। Software-এ canary = small % user-কে নতুন version দিয়ে "বিষ" আগে ধরো।

Concept

Big bang ≠ deploy

v2 সরাসরি 100% user-কে দিলে bug মানে full outage। Canary = প্রথমে 1%, metric ঠিক থাকলে 10%, তারপর 50%, শেষে 100%। প্রতিটা step-এ automatic analysis gate

Canary vs A/B

দুটো এক না

  • A/B test: goal = কোনটা better প্রমাণ করা। Long-running, statistical।
  • Canary: goal = safely rollout। Short, operational gate (latency, error)।
  • একই system দুটো serve করতে পারে — same routing, different decision।
NGINX Canary

Simple weight-based

yamlproduction
# stable Ingress (existing)
# + canary Ingress with same host
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: iris-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "5"      # 5%
spec:
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /predict
            pathType: Prefix
            backend:
              service: { name: iris-v2, port: { number: 80 } }

Weight 5 → 25 → 50 → 100 ধাপে বাড়াও — manual বা automated।

Argo Rollouts

Automated progressive delivery

yamlproduction
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata: { name: iris-api }
spec:
  replicas: 5
  strategy:
    canary:
      analysis:
        templates: [{ templateName: success-rate }]
        startingStep: 1
      steps:
        - setWeight: 5
        - pause: { duration: 5m }
        - setWeight: 25
        - pause: { duration: 10m }
        - setWeight: 50
        - pause: { duration: 15m }
        - setWeight: 100
  selector:
    matchLabels: { app: iris-api }
  template:
    spec:
      containers:
        - name: api
          image: ghcr.io/me/iris-api:v2.0.0
Analysis Template

Auto-rollback gate

yamlproduction
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata: { name: success-rate }
spec:
  metrics:
    - name: success-rate
      interval: 1m
      successCondition: result[0] >= 0.99
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus:9090
          query: |
            sum(rate(http_requests_total{app="iris-api",status!~"5.."}[2m]))
              / sum(rate(http_requests_total{app="iris-api"}[2m]))

Success rate < 99% হলে rollout auto-abort এবং stable version restore।

ML-Specific Gates

শুধু error-rate যথেষ্ট নয়

  • Prediction distribution shift < 10% (vs stable)।
  • Average confidence > 0.7।
  • p99 latency < 200ms।
  • Business KPI (CTR, conv) within ±5% of stable।
Patterns

Practical playbook

  • Shadow traffic: v2-কে real request পাঠাও কিন্তু response ফেলে দাও — zero-risk validation।
  • Dark launch: feature off-by-default, slowly enable per cohort।
  • Region-based canary: ছোট region (e.g. internal) আগে, তারপর tier-1।
Pitfalls

যা canary নষ্ট করে

  • 1% canary-এ traffic এত কম যে metric noisy — minimum sample size রাখো।
  • Sticky session ছাড়া user back-and-forth — confusing UX।
  • Database migration canary-এর সাথে coupled — rollback ভাঙে।
  • Manual approval blocking — automate analysis।
Mini Project

Argo Rollouts demo

  1. kind cluster-এ Argo Rollouts install করো।
  2. Iris v1 stable rollout deploy করো।
  3. v2 image push করে rollout trigger।
  4. `kubectl argo rollouts get rollout iris-api -w` দেখো।
Takeaway

মনে রাখো

Canary = small blast radius + automated gate। Risk কমে, confidence বাড়ে।