Phase 10 · Chapter 10.03

Scalable AI Infrastructure Design

100 req/sec → 100k req/sec — same architecture কাজ করবে না। কীভাবে design করলে scale করা যায়, সেটাই এই chapter-এর focus।

Scaling Axes

কোন dimension-এ scale করছ

Vertical: বড় GPU/CPU machine — limit আছে, costly।
Horizontal: অনেক replica + load balancer — ML-এ default।
Sharding: user/region-অনুযায়ী data partition।
Caching: result/embedding cache — compute কমায়।

Kubernetes HPA

Auto-scale replica on latency / QPS

yamlproduction

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: inference-hpa }
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference
  minReplicas: 3
  maxReplicas: 50
  metrics:
    - type: Pods
      pods:
        metric: { name: inference_qps }
        target: { type: AverageValue, averageValue: "100" }
    - type: Resource
      resource:
        name: cpu
        target: { type: Utilization, averageUtilization: 70 }

GPU Pool

Multi-tenant GPU sharing (MIG / Triton)

yamlproduction

# NVIDIA MIG — A100 কে 7 ভাগে ভাগ
apiVersion: v1
kind: Pod
spec:
  containers:
    - name: inference
      image: nvcr.io/nvidia/tritonserver:24.05-py3
      resources:
        limits:
          nvidia.com/mig-1g.10gb: 1   # 1/7 of A100
      args: ["tritonserver", "--model-repository=/models"]

Triton একই GPU-তে multiple model serve করে, dynamic batching করে throughput 3–5x বাড়ায়।

Cache Layer

Compute কমানোর সবচেয়ে সস্তা উপায়

pythonproduction

import hashlib, redis, json
r = redis.Redis()

def cached_predict(text: str):
    key = "pred:" + hashlib.sha256(text.encode()).hexdigest()
    if (hit := r.get(key)):
        return json.loads(hit)
    out = model.predict(text)
    r.setex(key, 3600, json.dumps(out))
    return out

Embedding cache — semantic search-এ 60–80% hit।
Result cache — top-N popular query।
Feature cache — feature store-এর hot key।

Multi-Region

Latency + DR

Model artifact replicate করো (S3 cross-region replication)।
Geo-DNS (Route53, Cloudflare) — user nearest region-এ route।
Active-active vs active-passive — RTO/RPO requirement-এর উপর নির্ভর।
Vector DB sync — eventual consistency okay অনেক recsys-এর জন্য।

Capacity Planning

কত replica লাগবে — হিসাব

textproduction

replicas = ceil( (peak_qps * p99_latency_sec) / target_utilization )

উদাহরণ:
  peak QPS = 2000
  p99 latency = 0.15s
  target util = 0.7
  replicas = ceil( (2000 * 0.15) / 0.7 ) = 429 / batch_size

Pitfalls

Scale করতে গিয়ে ভাঙে

Cold start — model load 30s, autoscale পিছিয়ে পড়ে। Pre-warm pool রাখো।
Thundering herd — cache expire হলে সবাই একসাথে recompute। Lock + stale-while-revalidate।
DB bottleneck — feature store overload। Read replica + cache।
GPU idle waste — batched না করলে GPU 20% utilization।

Mini Project

Plan a 10k QPS system

BERT-base sentiment model, p99 target 100ms।
Batch size, GPU type, replica count, cache hit rate — সব হিসাব করো।
Monthly cost estimate করো (GPU + cache + network egress)।

Takeaway

মূল কথা

Scale করা মানে শুধু replica বাড়ানো না — caching, batching, sharding, region distribution — চারটাই একসাথে চিন্তা করতে হয়।

← Roadmap-এ ফিরুন

পরবর্তী: Low Latency AI Systemsশীঘ্রই