Phase 4 · Chapter 4.03

Serverless AI Deployment

Server নেই, scaling নেই, idle bill নেই — শুধু request এলেই code চলে। ML inference-এর জন্য কি এটা সোনার ডিম?

What is Serverless

মূল ধারণা

তুমি শুধু function লেখো। Cloud provider request এলে container spin-up করে, response পাঠায়, তারপর shut down করে। Billing = execution time × memory।

AWS Lambda — 15 min max, 10GB memory, container image support।
Google Cloud Run — full HTTP container, scale-to-zero, 60 min request।
Azure Functions — event-driven, multi-runtime।

Lambda Example

Iris inference as Lambda (container)

pythonproduction

# app.py
import json, joblib

model = joblib.load("/var/task/model.joblib")  # loaded once per warm container

def handler(event, context):
    body = json.loads(event["body"])
    features = [[
        body["sepal_length"], body["sepal_width"],
        body["petal_length"], body["petal_width"],
    ]]
    pred = model.predict(features)[0]
    return {
        "statusCode": 200,
        "body": json.dumps({"prediction": str(pred), "version": "v1"}),
    }

Dockerfile

Lambda container image

dockerfileproduction

FROM public.ecr.aws/lambda/python:3.11

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY app.py model.joblib ./

CMD ["app.handler"]

Cloud Run

Full FastAPI container, scale-to-zero

bashproduction

# build + push
gcloud builds submit --tag gcr.io/PROJECT/iris-api

# deploy
gcloud run deploy iris-api \
  --image gcr.io/PROJECT/iris-api \
  --region asia-south1 \
  --memory 1Gi \
  --cpu 1 \
  --min-instances 0 \
  --max-instances 10 \
  --allow-unauthenticated

When to use

Serverless জিতে যখন

Low / spiky traffic (দিনে কয়েকশো request)।
Internal tool, webhook handler, async batch।
Prototype / MVP — infra cost ০ থেকে শুরু।

Limits

Serverless হারে যখন

Cold start: Large model load করতে 5–30s — user-facing latency খারাপ।
GPU নেই Lambda-তে; Cloud Run এ সীমিত।
Package size limit: Lambda zip 250MB; container 10GB।
Sustained high traffic: per-request pricing dedicated VM-এর চেয়ে দামি হয়।
Long-lived connection (WebSocket, streaming) কঠিন।

Cold Start Mitigation

Latency কমানোর কৌশল

Provisioned concurrency (Lambda) / min-instances (Cloud Run) = 1।
Model lazy-load করো না — global scope-এ load করো।
Lightweight runtime — ONNX বা quantized model।
Smaller dependency tree — সব sklearn import কোরো না।

Mini Project

Iris on Lambda

Container image build করে ECR-এ push করো।
Lambda function create — image source।
API Gateway দিয়ে public URL expose করো।
Cold vs warm latency measure করো (10 request)।

Phase 4 Complete

তুমি যা শিখলে

Cloud basics, managed model hosting, এবং serverless inference — তিন approach-এর tradeoff। পরবর্তী Phase: Kubernetes & Scaling — production-grade orchestration।

← Roadmap-এ ফিরুন

পরবর্তী: Kubernetes Basicsশীঘ্রই