হোম/Roadmap/Chapter 7.04
Phase 7 · Chapter 7.04

Logging & Alert Systems

Log তোমার flight recorder, alert তোমার fire alarm। দুটো ছাড়া production debug মানে অন্ধকারে হাতড়ানো।

Three Pillars

Observability stack

  • Metrics: numeric time series (Prometheus)।
  • Logs: discrete events with context (Loki / ELK)।
  • Traces: request journey across services (Jaeger / Tempo)।
Structured Logging

JSON, not plain text

Plain-text log grep দিয়ে চলে না scale-এ। JSON log = query-able।

pythonproduction
import structlog

log = structlog.get_logger()

log.info(
    "prediction",
    request_id=req.id,
    user_id=req.user_id,
    model_version="v1",
    predicted_class=label,
    confidence=float(prob),
    latency_ms=round(elapsed * 1000, 2),
)
jsonproduction
{
  "event": "prediction",
  "request_id": "abc-123",
  "user_id": 42,
  "model_version": "v1",
  "predicted_class": "setosa",
  "confidence": 0.97,
  "latency_ms": 38.2,
  "timestamp": "2026-05-24T12:34:56Z"
}
Log Levels

ব্যবহারের নিয়ম

  • DEBUG: dev only, prod-এ off।
  • INFO: normal events (predict, deploy)।
  • WARN: recoverable anomaly (fallback triggered)।
  • ERROR: request failed but service alive।
  • CRITICAL: service-wide problem।
Stack

Loki + Promtail + Grafana

yamlproduction
# docker-compose.yml snippet
loki:
  image: grafana/loki:latest
  ports: ["3100:3100"]

promtail:
  image: grafana/promtail:latest
  volumes:
    - /var/log:/var/log
    - ./promtail-config.yml:/etc/promtail/config.yml

grafana:
  image: grafana/grafana:latest
  ports: ["3000:3000"]
logqlproduction
# LogQL queries in Grafana
{app="iris-api"} |= "error"
{app="iris-api"} | json | confidence < 0.5
sum by (predicted_class) (count_over_time({app="iris-api"} | json [5m]))
Alertmanager

Prometheus alert rule

yamlproduction
groups:
  - name: iris-api
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
            / sum(rate(http_requests_total[5m])) > 0.05
        for: 5m
        labels: { severity: page }
        annotations:
          summary: "Iris API 5xx rate > 5% for 5m"

      - alert: P99LatencyHigh
        expr: |
          histogram_quantile(0.99,
            sum by (le) (rate(predict_latency_seconds_bucket[5m]))
          ) > 0.5
        for: 10m
        labels: { severity: warn }

      - alert: ModelDrift
        expr: model_psi_score > 0.25
        for: 1h
        labels: { severity: warn }
        annotations:
          summary: "Feature drift detected — review retrain"
Alert Hygiene

SRE wisdom

  • প্রতি alert-এ runbook link দাও।
  • Severity: page (wake on-call) vs warn (ticket)।
  • for: 5m — flapping alert থামায়।
  • Symptom-based alert, cause-based নয় (user impact matter করে)।
  • Alert fatigue হলে threshold tune করো — ignore হওয়াই বড় বিপদ।
ML-Specific Alerts

যেগুলো শুধু ML system-এ লাগে

  • Feature drift PSI > 0.25।
  • Prediction class skew > baseline ± 20%।
  • Average confidence drop > 15%।
  • Rolling accuracy (label available) < SLO।
  • Model file size / hash mismatch on rollout।
Pitfalls

যা log + alert ভাঙে

  • PII log করা — GDPR/audit violation।
  • সবকিছু INFO — disk fill, search slow।
  • Alert no-owner — কেউ দেখে না।
  • Runbook outdated — alert আসে, কেউ জানে না কী করতে হবে।
Mini Project

Full observability stack

  1. Iris API-তে structlog যোগ করো।
  2. Compose-এ Loki + Promtail + Grafana চালু করো।
  3. Alertmanager-এ HighErrorRate rule define করো।
  4. Test: artificially 500 throw করে alert fire করো।
Phase 7 Complete

তুমি যা শিখলে

Monitoring, drift detection, performance tracking, logging + alerting — production ML-এর চোখ-কান-হাত সব। পরবর্তী Phase: Advanced MLOps — A/B test, canary, blue-green, AutoML।

← Roadmap-এ ফিরুন
পরবর্তী: A/B Testing for Modelsশীঘ্রই