Phase 10 · Chapter 10.04
Low Latency AI Systems
Ad bidding 10ms, fraud detection 50ms, autonomous vehicle 5ms — কিছু system-এ latency miss = business miss। সেই world-এর techniques।
Latency Budget
প্রতি ms কোথায় যাচ্ছে
textproduction
Total budget: 50ms (p99)
─────────────────────────
Network (client → LB) 5ms
LB → service hop 2ms
Auth + parse 3ms
Feature fetch (Redis) 5ms
Model inference 25ms ← optimization target
Post-process + response 5ms
Network (return) 5msModel Optimization
ONNX + TensorRT pipeline
pythonproduction
# 1. PyTorch → ONNX
import torch
torch.onnx.export(model, dummy, "model.onnx",
opset_version=17, dynamic_axes={"input": {0: "batch"}})
# 2. ONNX → TensorRT FP16 engine
# trtexec --onnx=model.onnx --saveEngine=model.plan \
# --fp16 --workspace=4096 --minShapes=input:1x3x224x224 \
# --optShapes=input:8x3x224x224 --maxShapes=input:32x3x224x224
# 3. Serve with Triton
# /models/resnet/config.pbtxt
# platform: "tensorrt_plan"
# max_batch_size: 32
# dynamic_batching { preferred_batch_size: [8, 16, 32] max_queue_delay_microseconds: 2000 }ResNet50 FP32 PyTorch → FP16 TensorRT: 35ms → 4ms (single image)।
Quantization
INT8 — accuracy trade-off সহ
pythonproduction
# Post-training INT8 calibration
from onnxruntime.quantization import quantize_static, QuantType
quantize_static(
"model.onnx", "model_int8.onnx",
calibration_data_reader=calib_reader,
quant_format=QuantType.QInt8,
per_channel=True,
)- INT8 = 4x smaller, 2–3x faster, ~1% accuracy drop।
- QAT (Quantization-Aware Training) accuracy ফেরায়।
- Calibration dataset representative হওয়া critical।
Distillation
বড় teacher → ছোট student
- BERT-large (340M) → DistilBERT (66M) — 60% faster, 97% accuracy।
- Soft label (teacher logit) + hard label একসাথে train।
- Latency-critical path-এ distilled model, batch path-এ full model।
Edge Serving
Round-trip কাটাও — device-এ চালাও
javascriptproduction
// Browser-side inference with ONNX Runtime Web
import * as ort from 'onnxruntime-web';
const session = await ort.InferenceSession.create('/model.onnx', {
executionProviders: ['webgpu', 'wasm'],
});
const out = await session.run({ input: new ort.Tensor('float32', data, [1,3,224,224]) });- Mobile: CoreML (iOS), NNAPI (Android), TFLite।
- Browser: ONNX Runtime Web + WebGPU।
- IoT: Jetson, Coral TPU।
Infra Techniques
System-level latency cut
- Connection pooling: TCP handshake বাঁচাও।
- HTTP/2 + gRPC: multiplexing, header compression।
- Co-locate: model + feature store same AZ — cross-AZ latency 1–3ms।
- Warm pool: idle replica pre-load model।
- Speculative execution: দুটো model parallel call করে যেটা আগে আসে নাও।
Pitfalls
Latency optimization-এর ফাঁদ
- Average latency দেখে happy হওয়া — p99 measure করো।
- GC pause, JIT warmup — first request 10x slow।
- Tokenizer Python loop — Rust tokenizer ব্যবহার করো।
- Logging sync — async logger লাগাও।
- Over-optimization — 5ms থেকে 4ms কাটতে গিয়ে accuracy হারানো।
Mini Project
50ms → 10ms challenge
- BERT sentiment FastAPI service বানাও, baseline p99 measure।
- ONNX + INT8 quantize, Triton-এ deploy।
- Redis feature cache + gRPC client।
- p99 latency 5x কমানোর target।
Phase 10 Complete
তুমি যা শিখলে
Microservices, event-driven, scalable infra, low-latency — production-grade AI architecture-এর চারটা স্তম্ভ। পরবর্তী Phase: Industry Projects — beginner থেকে advanced real-world build।