Phase 5 · Chapter 5.04

High Availability Systems

HA মানে এক pod, এক node, এক AZ, এক region মরলেও user টের না পায়। Redundancy + automation + practice।

The Numbers

৯-এর hierarchy

textproduction

SLO         Downtime/year     Downtime/month
99%         3.65 days         7.2 hours
99.9%       8.76 hours        43.2 minutes
99.95%      4.38 hours        21.6 minutes
99.99%      52.6 minutes      4.32 minutes
99.999%     5.26 minutes      25.9 seconds

প্রতি extra "৯" cost কয়েকগুণ বাড়ায়। Business need অনুযায়ী target ঠিক করো।

HA Layers

৪ level redundancy

Pod: replicas ≥ 2, PodDisruptionBudget।
Node: topology spread — সব pod এক node-এ যেন না বসে।
AZ: Multi-AZ cluster, AZ-aware load balancer।
Region: Multi-region active-active বা active-passive + DNS failover।

PodDisruptionBudget

Voluntary disruption থেকে রক্ষা

yamlproduction

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: iris-pdb }
spec:
  minAvailable: 2          # node drain-এ অন্তত 2 pod always up
  selector:
    matchLabels: { app: iris-api }

Topology Spread

AZ-aware placement

yamlproduction

spec:
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels: { app: iris-api }

৩ replica → ৩ AZ-তে ছড়ানো। এক AZ down হলেও ২ replica live।

Active-Active vs Active-Passive

Multi-region pattern

Active-Active: দুই region-এই traffic, Route53 latency routing। RTO ~০, খরচ ২x।
Active-Passive: Primary down হলে DNS failover। RTO ৫–১৫ মিনিট, খরচ কম।
Pilot-light: Standby region minimal রাখো, failover-এ scale-up।

SLO + Error Budget

Engineering culture-এ গাঁথো

textproduction

SLI:  fraction of /predict requests with status 2xx AND latency < 300ms
SLO:  99.9% over 30 days
Error budget: 0.1% = 43.2 minutes/month

Rule: budget burn rate > 2x → freeze releases, focus on reliability.

ML-Specific HA

Model layer-এও redundancy

Model fallback: v2 fail করলে v1-এ auto-route।
Graceful degradation: Personalized recommendation না দিতে পারলে popular items return।
Cached predictions: Redis fallback — model down হলেও stale answer।
Circuit breaker: Upstream model service বারবার fail করলে kill request, fast-fail।

Anti-patterns

HA কে নষ্ট করে যা

Database single instance — app HA হলেও DB SPOF।
Shared cache, shared queue without replication।
Manual failover — practice না করলে incident-এ ভাঙবে।
Backup test না করা — restore সময় বুঝবে kaj করে না।
Runbook নেই — মাঝরাতে on-call engineer-এর কি করার কথা?

Mini Project

Chaos test

Iris deployment ৩ replica + PDB + topology spread apply করো।
Load generator চালু রাখো (hey)।
এক node cordon + drain করো — error rate দেখো।
সব AZ-এ pod আছে কিনা kubectl get pod -o wide দিয়ে verify করো।

Phase 5 Complete

তুমি যা শিখলে

Kubernetes basics, scaling strategies, load balancing, এবং high availability — production-grade ML service চালানোর পুরো toolbox। পরবর্তী Phase: Data & Pipeline Management।

← Roadmap-এ ফিরুন

পরবর্তী: Data Pipelinesশীঘ্রই