Phase 5 · Chapter 5.03

Load Balancing AI Services

১০ replica আছে কিন্তু সব traffic ১টায় গেলে scaling অর্থহীন। Load balancer = traffic distribute + health check + failover।

Layers

L4 vs L7 load balancing

L4 (Transport): TCP/UDP-level, fast, content দেখে না। Example: AWS NLB।
L7 (Application): HTTP path/header দেখে route। Example: NGINX, Envoy, AWS ALB।

ML API-এর জন্য সাধারণত L7 — কারণ /predict/v1 vs /predict/v2 route করতে হয়।

Algorithms

কোন request কোন pod-এ?

Round-robin: ঘুরিয়ে ঘুরিয়ে। Default, simple।
Least-connection: সবচেয়ে কম active request যেখানে। Long inference-এ ভালো।
Latency-based: দ্রুততম pod-কে বেশি traffic।
Consistent hash: একই user → একই pod (cache hit ↑, sticky)।
Weighted: v2 model-কে 10% traffic — canary রোলআউট।

NGINX Ingress

Weighted canary example

yamlproduction

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: iris-canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"
spec:
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /predict
            pathType: Prefix
            backend:
              service:
                name: iris-svc-v2
                port: { number: 80 }

Envoy

Least-request + outlier detection

yamlproduction

clusters:
  - name: iris_cluster
    type: STRICT_DNS
    lb_policy: LEAST_REQUEST
    outlier_detection:
      consecutive_5xx: 3
      base_ejection_time: 30s
      max_ejection_percent: 50
    health_checks:
      - timeout: 1s
        interval: 5s
        unhealthy_threshold: 2
        healthy_threshold: 2
        http_health_check:
          path: /health

ML-Specific Tips

যা generic API-এর জন্য না

Warm-up: Pod ready হওয়ার পরও model load বাকি — readinessProbe-এ dummy predict include করো।
GPU affinity: একই GPU pod-এ batched request পাঠাতে consistent hashing।
Large payload: Image/video — proxy buffer size বাড়াও (NGINX client_max_body_size 50m)।
Streaming response: LLM token stream — proxy buffering OFF।

Pitfalls

যা ভাঙে

Health check /health শুধু process alive বলে — model loaded কিনা চেক করো।
Timeout default 60s — slow inference 504 দেয়।
Sticky session ছাড়া WebSocket / SSE ভাঙে।
Single AZ LB — region failure-এ down।

Mini Project

Canary deployment

Iris v1 ও v2 আলাদা deployment হিসেবে চালু করো।
NGINX canary annotation দিয়ে v2-তে 10% traffic পাঠাও।
v2-র latency + error rate Prometheus-এ দেখো।
OK হলে weight 50% → 100%।

Takeaway

মনে রাখো

Load balancer = traffic director + health watchdog + canary tool। Algorithm workload-ভিত্তিক বেছে নাও।

← Roadmap-এ ফিরুন পরবর্তী: High Availability Systems