Phase 9 · Chapter 9.03
Computer Vision Production System
Image বড়, model বড়, GPU দামি। CV system-এ pipeline efficiency ই difference।
The Pipeline
Request → Result
textproduction
Upload (S3 / multipart)
└─> Validation (size, format, MIME)
└─> Preprocessing (decode, resize, normalize)
└─> Inference (GPU, batched)
└─> Postprocess (NMS, decode mask, draw box)
└─> Response (JSON + signed URL to output)Sync vs Async
Workload pattern
- Sync: small image, < 500ms — FastAPI + GPU pod।
- Async: video / batch — upload → SQS/Kafka → worker pool → callback / S3।
- Edge: mobile / IoT — quantized ONNX / TFLite / CoreML।
FastAPI Endpoint
Sync image classify
pythonproduction
from fastapi import FastAPI, UploadFile, HTTPException
from PIL import Image
import io, torch
from torchvision import transforms
app = FastAPI()
model = torch.jit.load("resnet50.ts").eval().cuda()
tfm = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225]),
])
@app.post("/classify")
async def classify(file: UploadFile):
if file.content_type not in {"image/jpeg", "image/png"}:
raise HTTPException(415, "Unsupported type")
raw = await file.read()
if len(raw) > 10 * 1024 * 1024:
raise HTTPException(413, "Too large")
img = Image.open(io.BytesIO(raw)).convert("RGB")
x = tfm(img).unsqueeze(0).cuda()
with torch.inference_mode():
logits = model(x)
probs = logits.softmax(-1)[0]
top5 = torch.topk(probs, 5)
return {
"predictions": [
{"label": LABELS[i], "prob": float(p)}
for p, i in zip(top5.values, top5.indices)
]
}Optimization
GPU saturate করো
- Torch compile / TorchScript / ONNX / TensorRT — 2–10x speedup।
- FP16 / INT8 quantization — memory + latency কমে।
- Dynamic batching — Triton inference server।
- Pinned memory + async transfer — CPU→GPU copy hide।
- Preprocess on GPU — DALI / Kornia।
Async Video Pipeline
Long-running workload
pythonproduction
# producer (API)
@app.post("/video/analyze")
async def submit(file: UploadFile):
key = f"jobs/{uuid4()}.mp4"
s3.upload_fileobj(file.file, BUCKET, key)
sqs.send_message(QueueUrl=Q, MessageBody=json.dumps({"key": key}))
return {"job_id": key}
# worker (k8s deployment, GPU)
while msg := sqs.receive():
job = json.loads(msg.body)
frames = decode_video(job["key"])
for batch in batched(frames, 16):
results = model(batch.cuda())
save_results(job["key"], results)
s3.put_object(Bucket=BUCKET, Key=f"{job['key']}.json", Body=json.dumps(results))
sqs.delete_message(msg)Edge Deployment
Device-এ চালাও
- Mobile: CoreML (iOS), TFLite / NNAPI (Android)।
- Browser: ONNX Runtime Web, TF.js, WebGPU।
- Embedded: Jetson + TensorRT, Coral Edge TPU।
- Trade-off: latency ↓, privacy ↑, accuracy ↓, update কঠিন।
Pitfalls
CV-specific
- EXIF orientation ignore — image উল্টে inference।
- Color space mismatch (BGR vs RGB) — silent accuracy drop।
- Mixed image size — padding না করলে batching ভাঙে।
- Memory leak — PIL/OpenCV handle close করো না।
- Adversarial input — small perturbation, large error। Robustness test।
Mini Project
YOLO object detection API
- Ultralytics YOLOv8 ONNX export।
- FastAPI
/detectendpoint — image upload → boxes JSON। - Triton-এ deploy দিয়ে dynamic batching চালু করো।
- p99 latency measure (CPU vs GPU)।
Takeaway
মনে রাখো
CV production = preprocessing + batching + optimized runtime। Model accuracy যথেষ্ট না, throughput চাই।