Model Scaling Strategies

৫ replica কখন? GPU sharing কীভাবে? Batch inference কখন? — Scaling এক জিনিস না, অনেক কৌশলের সমষ্টি।

Two Axes

Vertical vs Horizontal

Vertical: একই pod-কে আরও CPU/RAM/GPU দাও। সীমা — একটা node-এর হার্ডওয়্যার।
Horizontal: আরও replica যোগ করো। সীমা — stateless হতে হবে।

ML inference সাধারণত stateless — তাই horizontal scaling default।

HPA

Horizontal Pod Autoscaler

yamlproduction

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: iris-hpa }
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: iris-api
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target: { type: Utilization, averageUtilization: 70 }
    - type: Resource
      resource:
        name: memory
        target: { type: Utilization, averageUtilization: 80 }

KEDA

Custom metric scaling (Prometheus, queue length)

HPA শুধু CPU/RAM বোঝে। KEDA দিয়ে Prometheus query, Kafka lag, SQS depth — যেকোনো metric দিয়ে scale করো।

yamlproduction

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: iris-scaler }
spec:
  scaleTargetRef:
    name: iris-api
  minReplicaCount: 1
  maxReplicaCount: 30
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        threshold: "100"  # requests per second per pod
        query: sum(rate(http_requests_total{app="iris-api"}[1m]))

GPU Strategies

দামি GPU কে কাজে লাগাও

Time-slicing: NVIDIA device plugin দিয়ে এক GPU multiple pod share করে।
MIG (Multi-Instance GPU): A100/H100-এ hardware partition।
Triton Inference Server: এক process-এ multi-model, dynamic batching, GPU saturation।
Spot/Preemptible nodes: training-এ ৬০–৮০% সস্তা; checkpoint যোগ করো।

Dynamic Batching

Latency সামান্য বাড়ে, throughput কয়েকগুণ

pythonproduction

# Triton config.pbtxt snippet
dynamic_batching {
  preferred_batch_size: [4, 8, 16]
  max_queue_delay_microseconds: 5000  # 5ms wait
}

৫ms wait করে যদি ৮টা request জড়ো হয়, এক GPU pass-এ ৮টা serve হয় — GPU utilization ↑।

Patterns

Workload-ভিত্তিক pattern

Real-time small model: CPU pod + HPA (CPU 70%)।
Real-time large model: GPU pod + Triton + dynamic batching।
Async / batch: Queue (SQS/Kafka) + KEDA scaler + spot GPU।
Bursty traffic: Cluster Autoscaler + min replica 0 (cold-start tolerable হলে)।

Pitfalls

যেগুলো ভুলে যায়

HPA stabilization window default — scale-down হয় ৫ মিনিট পর। Traffic drop-এ বিল বাড়ে।
Replica বাড়ালেই throughput বাড়ে না যদি DB/cache bottleneck হয়।
Cold start measure না করা — first replica spin-up 30s+।
Node autoscaler না থাকলে HPA pod pending-এ আটকায়।

Mini Project

HPA load test

Iris deployment-এ HPA apply করো (CPU 60%, min 1, max 10)।
hey -z 60s -c 50 http://api/predict চালাও।
kubectl get hpa -w দিয়ে replica বাড়া দেখো।

Takeaway

মনে রাখো

Scaling = right metric × right strategy × right hardware। CPU-bound inference-এ HPA যথেষ্ট, GPU workload-এ Triton + batching।

← Roadmap-এ ফিরুন পরবর্তী: Load Balancing AI Services