Phase 8 · Chapter 8.02
Canary Deployment
কয়লা খনিতে canary পাখি বিষ গ্যাস বুঝে দিত। Software-এ canary = small % user-কে নতুন version দিয়ে "বিষ" আগে ধরো।
Concept
Big bang ≠ deploy
v2 সরাসরি 100% user-কে দিলে bug মানে full outage। Canary = প্রথমে 1%, metric ঠিক থাকলে 10%, তারপর 50%, শেষে 100%। প্রতিটা step-এ automatic analysis gate।
Canary vs A/B
দুটো এক না
- A/B test: goal = কোনটা better প্রমাণ করা। Long-running, statistical।
- Canary: goal = safely rollout। Short, operational gate (latency, error)।
- একই system দুটো serve করতে পারে — same routing, different decision।
NGINX Canary
Simple weight-based
yamlproduction
# stable Ingress (existing)
# + canary Ingress with same host
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: iris-canary
annotations:
nginx.ingress.kubernetes.io/canary: "true"
nginx.ingress.kubernetes.io/canary-weight: "5" # 5%
spec:
rules:
- host: api.example.com
http:
paths:
- path: /predict
pathType: Prefix
backend:
service: { name: iris-v2, port: { number: 80 } }Weight 5 → 25 → 50 → 100 ধাপে বাড়াও — manual বা automated।
Argo Rollouts
Automated progressive delivery
yamlproduction
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata: { name: iris-api }
spec:
replicas: 5
strategy:
canary:
analysis:
templates: [{ templateName: success-rate }]
startingStep: 1
steps:
- setWeight: 5
- pause: { duration: 5m }
- setWeight: 25
- pause: { duration: 10m }
- setWeight: 50
- pause: { duration: 15m }
- setWeight: 100
selector:
matchLabels: { app: iris-api }
template:
spec:
containers:
- name: api
image: ghcr.io/me/iris-api:v2.0.0Analysis Template
Auto-rollback gate
yamlproduction
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata: { name: success-rate }
spec:
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.99
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{app="iris-api",status!~"5.."}[2m]))
/ sum(rate(http_requests_total{app="iris-api"}[2m]))Success rate < 99% হলে rollout auto-abort এবং stable version restore।
ML-Specific Gates
শুধু error-rate যথেষ্ট নয়
- Prediction distribution shift < 10% (vs stable)।
- Average confidence > 0.7।
- p99 latency < 200ms।
- Business KPI (CTR, conv) within ±5% of stable।
Patterns
Practical playbook
- Shadow traffic: v2-কে real request পাঠাও কিন্তু response ফেলে দাও — zero-risk validation।
- Dark launch: feature off-by-default, slowly enable per cohort।
- Region-based canary: ছোট region (e.g. internal) আগে, তারপর tier-1।
Pitfalls
যা canary নষ্ট করে
- 1% canary-এ traffic এত কম যে metric noisy — minimum sample size রাখো।
- Sticky session ছাড়া user back-and-forth — confusing UX।
- Database migration canary-এর সাথে coupled — rollback ভাঙে।
- Manual approval blocking — automate analysis।
Mini Project
Argo Rollouts demo
- kind cluster-এ Argo Rollouts install করো।
- Iris v1 stable rollout deploy করো।
- v2 image push করে rollout trigger।
- `kubectl argo rollouts get rollout iris-api -w` দেখো।
Takeaway
মনে রাখো
Canary = small blast radius + automated gate। Risk কমে, confidence বাড়ে।