Phase 9 · Chapter 9.03

Computer Vision Production System

Image বড়, model বড়, GPU দামি। CV system-এ pipeline efficiency ই difference।

The Pipeline

Request → Result

textproduction

Upload (S3 / multipart)
  └─> Validation (size, format, MIME)
        └─> Preprocessing (decode, resize, normalize)
              └─> Inference (GPU, batched)
                    └─> Postprocess (NMS, decode mask, draw box)
                          └─> Response (JSON + signed URL to output)

Sync vs Async

Workload pattern

Sync: small image, < 500ms — FastAPI + GPU pod।
Async: video / batch — upload → SQS/Kafka → worker pool → callback / S3।
Edge: mobile / IoT — quantized ONNX / TFLite / CoreML।

FastAPI Endpoint

Sync image classify

pythonproduction

from fastapi import FastAPI, UploadFile, HTTPException
from PIL import Image
import io, torch
from torchvision import transforms

app = FastAPI()
model = torch.jit.load("resnet50.ts").eval().cuda()
tfm = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                         [0.229, 0.224, 0.225]),
])

@app.post("/classify")
async def classify(file: UploadFile):
    if file.content_type not in {"image/jpeg", "image/png"}:
        raise HTTPException(415, "Unsupported type")
    raw = await file.read()
    if len(raw) > 10 * 1024 * 1024:
        raise HTTPException(413, "Too large")

    img = Image.open(io.BytesIO(raw)).convert("RGB")
    x = tfm(img).unsqueeze(0).cuda()
    with torch.inference_mode():
        logits = model(x)
        probs = logits.softmax(-1)[0]
    top5 = torch.topk(probs, 5)
    return {
        "predictions": [
            {"label": LABELS[i], "prob": float(p)}
            for p, i in zip(top5.values, top5.indices)
        ]
    }

Optimization

GPU saturate করো

Torch compile / TorchScript / ONNX / TensorRT — 2–10x speedup।
FP16 / INT8 quantization — memory + latency কমে।
Dynamic batching — Triton inference server।
Pinned memory + async transfer — CPU→GPU copy hide।
Preprocess on GPU — DALI / Kornia।

Async Video Pipeline

Long-running workload

pythonproduction

# producer (API)
@app.post("/video/analyze")
async def submit(file: UploadFile):
    key = f"jobs/{uuid4()}.mp4"
    s3.upload_fileobj(file.file, BUCKET, key)
    sqs.send_message(QueueUrl=Q, MessageBody=json.dumps({"key": key}))
    return {"job_id": key}

# worker (k8s deployment, GPU)
while msg := sqs.receive():
    job = json.loads(msg.body)
    frames = decode_video(job["key"])
    for batch in batched(frames, 16):
        results = model(batch.cuda())
        save_results(job["key"], results)
    s3.put_object(Bucket=BUCKET, Key=f"{job['key']}.json", Body=json.dumps(results))
    sqs.delete_message(msg)

Edge Deployment

Device-এ চালাও

Mobile: CoreML (iOS), TFLite / NNAPI (Android)।
Browser: ONNX Runtime Web, TF.js, WebGPU।
Embedded: Jetson + TensorRT, Coral Edge TPU।
Trade-off: latency ↓, privacy ↑, accuracy ↓, update কঠিন।

Pitfalls

CV-specific

EXIF orientation ignore — image উল্টে inference।
Color space mismatch (BGR vs RGB) — silent accuracy drop।
Mixed image size — padding না করলে batching ভাঙে।
Memory leak — PIL/OpenCV handle close করো না।
Adversarial input — small perturbation, large error। Robustness test।

Mini Project

YOLO object detection API

Ultralytics YOLOv8 ONNX export।
FastAPI /detect endpoint — image upload → boxes JSON।
Triton-এ deploy দিয়ে dynamic batching চালু করো।
p99 latency measure (CPU vs GPU)।

Takeaway

মনে রাখো

CV production = preprocessing + batching + optimized runtime। Model accuracy যথেষ্ট না, throughput চাই।

← Roadmap-এ ফিরুন পরবর্তী: NLP API Systems