Phase 7 · Chapter 7.01

Model Monitoring

Deploy করে fire-and-forget? Model নীরবে পচে যায়। Monitoring মানে production-এ model-এর pulse অনবরত মাপা।

Why

Software vs ML monitoring

Software fail করলে error throw করে — alert সহজ। Model fail করে নীরবে — same 200 OK, কিন্তু prediction ভুল। তাই extra layer চাই।

3 Layers

কী measure করবে

Operational: latency, QPS, error rate, CPU/mem — DevOps-এর জগৎ।
Statistical: input drift, output drift, prediction distribution।
Business / Outcome: conversion, revenue per prediction, user feedback।

Instrumentation

FastAPI + Prometheus

pythonproduction

from prometheus_client import Counter, Histogram, make_asgi_app
from fastapi import FastAPI, Request
import time

app = FastAPI()
app.mount("/metrics", make_asgi_app())

PRED_COUNT = Counter(
    "predictions_total", "Total predictions",
    ["model_version", "predicted_class"],
)
PRED_LATENCY = Histogram(
    "prediction_latency_seconds", "Inference latency",
    ["model_version"],
)
INPUT_FEATURE = Histogram(
    "input_petal_length", "Petal length distribution",
    buckets=[1, 2, 3, 4, 5, 6, 7],
)

@app.post("/predict")
def predict(req: IrisFeatures):
    INPUT_FEATURE.observe(req.petal_length)
    start = time.perf_counter()
    label = model.predict([req.to_array()])[0]
    PRED_LATENCY.labels("v1").observe(time.perf_counter() - start)
    PRED_COUNT.labels("v1", label).inc()
    return {"prediction": label, "model_version": "v1"}

What to Track

Per-layer checklist

textproduction

Operational
  - request rate, error rate, p50/p95/p99 latency
  - container CPU/mem, GPU utilization

Statistical
  - input feature mean/std/histogram per hour
  - prediction class distribution
  - confidence score histogram
  - drift score (PSI / KS)

Business
  - click-through, conversion
  - delayed label accuracy (when truth arrives)
  - user override / thumbs-down rate

The Stack

Tooling combinations

Open-source: Prometheus + Grafana + Loki + Evidently।
Managed: Datadog, New Relic — operational excellent।
ML-specific: Arize, WhyLabs, Fiddler, Aporia।
DIY: log to S3 → batch job → BigQuery → dashboard।

Anti-patterns

যা ভুল হয়

শুধু latency monitor করা — model accuracy চুপচাপ ঝরে যায়।
Ground truth-এর জন্য অপেক্ষা — proxy metric (input drift) আগে দেখো।
High-cardinality label (user_id) Prometheus-এ — series explode।
Alert fatigue — সব metric-এ alert বসিয়ে গুরুত্বপূর্ণ মিস।

Mini Project

Iris monitoring

Iris API-তে উপরের ৩ metric যোগ করো।
Docker Compose-এ Prometheus + Grafana চালু করো।
1000 dummy request পাঠাও, Grafana dashboard বানাও।

Takeaway

মনে রাখো

Monitoring = operational + statistical + business। তিনটার এক layer ও skip কোরো না।

← Roadmap-এ ফিরুন পরবর্তী: Data Drift Detection