Phase 10 · Chapter 10.04

Low Latency AI Systems

Ad bidding 10ms, fraud detection 50ms, autonomous vehicle 5ms — কিছু system-এ latency miss = business miss। সেই world-এর techniques।

Latency Budget

প্রতি ms কোথায় যাচ্ছে

textproduction

Total budget: 50ms (p99)
─────────────────────────
Network (client → LB)      5ms
LB → service hop           2ms
Auth + parse               3ms
Feature fetch (Redis)      5ms
Model inference            25ms   ← optimization target
Post-process + response    5ms
Network (return)           5ms

Model Optimization

ONNX + TensorRT pipeline

pythonproduction

# 1. PyTorch → ONNX
import torch
torch.onnx.export(model, dummy, "model.onnx",
    opset_version=17, dynamic_axes={"input": {0: "batch"}})

# 2. ONNX → TensorRT FP16 engine
# trtexec --onnx=model.onnx --saveEngine=model.plan \
#   --fp16 --workspace=4096 --minShapes=input:1x3x224x224 \
#   --optShapes=input:8x3x224x224 --maxShapes=input:32x3x224x224

# 3. Serve with Triton
# /models/resnet/config.pbtxt
# platform: "tensorrt_plan"
# max_batch_size: 32
# dynamic_batching { preferred_batch_size: [8, 16, 32] max_queue_delay_microseconds: 2000 }

ResNet50 FP32 PyTorch → FP16 TensorRT: 35ms → 4ms (single image)।

Quantization

INT8 — accuracy trade-off সহ

pythonproduction

# Post-training INT8 calibration
from onnxruntime.quantization import quantize_static, QuantType

quantize_static(
    "model.onnx", "model_int8.onnx",
    calibration_data_reader=calib_reader,
    quant_format=QuantType.QInt8,
    per_channel=True,
)

INT8 = 4x smaller, 2–3x faster, ~1% accuracy drop।
QAT (Quantization-Aware Training) accuracy ফেরায়।
Calibration dataset representative হওয়া critical।

Distillation

বড় teacher → ছোট student

BERT-large (340M) → DistilBERT (66M) — 60% faster, 97% accuracy।
Soft label (teacher logit) + hard label একসাথে train।
Latency-critical path-এ distilled model, batch path-এ full model।

Edge Serving

Round-trip কাটাও — device-এ চালাও

javascriptproduction

// Browser-side inference with ONNX Runtime Web
import * as ort from 'onnxruntime-web';

const session = await ort.InferenceSession.create('/model.onnx', {
  executionProviders: ['webgpu', 'wasm'],
});
const out = await session.run({ input: new ort.Tensor('float32', data, [1,3,224,224]) });

Mobile: CoreML (iOS), NNAPI (Android), TFLite।
Browser: ONNX Runtime Web + WebGPU।
IoT: Jetson, Coral TPU।

Infra Techniques

System-level latency cut

Connection pooling: TCP handshake বাঁচাও।
HTTP/2 + gRPC: multiplexing, header compression।
Co-locate: model + feature store same AZ — cross-AZ latency 1–3ms।
Warm pool: idle replica pre-load model।
Speculative execution: দুটো model parallel call করে যেটা আগে আসে নাও।

Pitfalls

Latency optimization-এর ফাঁদ

Average latency দেখে happy হওয়া — p99 measure করো।
GC pause, JIT warmup — first request 10x slow।
Tokenizer Python loop — Rust tokenizer ব্যবহার করো।
Logging sync — async logger লাগাও।
Over-optimization — 5ms থেকে 4ms কাটতে গিয়ে accuracy হারানো।

Mini Project

50ms → 10ms challenge

BERT sentiment FastAPI service বানাও, baseline p99 measure।
ONNX + INT8 quantize, Triton-এ deploy।
Redis feature cache + gRPC client।
p99 latency 5x কমানোর target।

Phase 10 Complete

তুমি যা শিখলে

Microservices, event-driven, scalable infra, low-latency — production-grade AI architecture-এর চারটা স্তম্ভ। পরবর্তী Phase: Industry Projects — beginner থেকে advanced real-world build।

← Roadmap-এ ফিরুন