হোম/Roadmap/Chapter 10.04
Phase 10 · Chapter 10.04

Low Latency AI Systems

Ad bidding 10ms, fraud detection 50ms, autonomous vehicle 5ms — কিছু system-এ latency miss = business miss। সেই world-এর techniques।

Latency Budget

প্রতি ms কোথায় যাচ্ছে

textproduction
Total budget: 50ms (p99)
─────────────────────────
Network (client → LB)      5ms
LB → service hop           2ms
Auth + parse               3ms
Feature fetch (Redis)      5ms
Model inference            25ms   ← optimization target
Post-process + response    5ms
Network (return)           5ms
Model Optimization

ONNX + TensorRT pipeline

pythonproduction
# 1. PyTorch → ONNX
import torch
torch.onnx.export(model, dummy, "model.onnx",
    opset_version=17, dynamic_axes={"input": {0: "batch"}})

# 2. ONNX → TensorRT FP16 engine
# trtexec --onnx=model.onnx --saveEngine=model.plan \
#   --fp16 --workspace=4096 --minShapes=input:1x3x224x224 \
#   --optShapes=input:8x3x224x224 --maxShapes=input:32x3x224x224

# 3. Serve with Triton
# /models/resnet/config.pbtxt
# platform: "tensorrt_plan"
# max_batch_size: 32
# dynamic_batching { preferred_batch_size: [8, 16, 32] max_queue_delay_microseconds: 2000 }

ResNet50 FP32 PyTorch → FP16 TensorRT: 35ms → 4ms (single image)।

Quantization

INT8 — accuracy trade-off সহ

pythonproduction
# Post-training INT8 calibration
from onnxruntime.quantization import quantize_static, QuantType

quantize_static(
    "model.onnx", "model_int8.onnx",
    calibration_data_reader=calib_reader,
    quant_format=QuantType.QInt8,
    per_channel=True,
)
  • INT8 = 4x smaller, 2–3x faster, ~1% accuracy drop।
  • QAT (Quantization-Aware Training) accuracy ফেরায়।
  • Calibration dataset representative হওয়া critical।
Distillation

বড় teacher → ছোট student

  • BERT-large (340M) → DistilBERT (66M) — 60% faster, 97% accuracy।
  • Soft label (teacher logit) + hard label একসাথে train।
  • Latency-critical path-এ distilled model, batch path-এ full model।
Edge Serving

Round-trip কাটাও — device-এ চালাও

javascriptproduction
// Browser-side inference with ONNX Runtime Web
import * as ort from 'onnxruntime-web';

const session = await ort.InferenceSession.create('/model.onnx', {
  executionProviders: ['webgpu', 'wasm'],
});
const out = await session.run({ input: new ort.Tensor('float32', data, [1,3,224,224]) });
  • Mobile: CoreML (iOS), NNAPI (Android), TFLite।
  • Browser: ONNX Runtime Web + WebGPU।
  • IoT: Jetson, Coral TPU।
Infra Techniques

System-level latency cut

  • Connection pooling: TCP handshake বাঁচাও।
  • HTTP/2 + gRPC: multiplexing, header compression।
  • Co-locate: model + feature store same AZ — cross-AZ latency 1–3ms।
  • Warm pool: idle replica pre-load model।
  • Speculative execution: দুটো model parallel call করে যেটা আগে আসে নাও।
Pitfalls

Latency optimization-এর ফাঁদ

  • Average latency দেখে happy হওয়া — p99 measure করো।
  • GC pause, JIT warmup — first request 10x slow।
  • Tokenizer Python loop — Rust tokenizer ব্যবহার করো।
  • Logging sync — async logger লাগাও।
  • Over-optimization — 5ms থেকে 4ms কাটতে গিয়ে accuracy হারানো।
Mini Project

50ms → 10ms challenge

  1. BERT sentiment FastAPI service বানাও, baseline p99 measure।
  2. ONNX + INT8 quantize, Triton-এ deploy।
  3. Redis feature cache + gRPC client।
  4. p99 latency 5x কমানোর target।
Phase 10 Complete

তুমি যা শিখলে

Microservices, event-driven, scalable infra, low-latency — production-grade AI architecture-এর চারটা স্তম্ভ। পরবর্তী Phase: Industry Projects — beginner থেকে advanced real-world build।