Phase 7 · Chapter 7.04
Logging & Alert Systems
Log তোমার flight recorder, alert তোমার fire alarm। দুটো ছাড়া production debug মানে অন্ধকারে হাতড়ানো।
Three Pillars
Observability stack
- Metrics: numeric time series (Prometheus)।
- Logs: discrete events with context (Loki / ELK)।
- Traces: request journey across services (Jaeger / Tempo)।
Structured Logging
JSON, not plain text
Plain-text log grep দিয়ে চলে না scale-এ। JSON log = query-able।
pythonproduction
import structlog
log = structlog.get_logger()
log.info(
"prediction",
request_id=req.id,
user_id=req.user_id,
model_version="v1",
predicted_class=label,
confidence=float(prob),
latency_ms=round(elapsed * 1000, 2),
) jsonproduction
{
"event": "prediction",
"request_id": "abc-123",
"user_id": 42,
"model_version": "v1",
"predicted_class": "setosa",
"confidence": 0.97,
"latency_ms": 38.2,
"timestamp": "2026-05-24T12:34:56Z"
}Log Levels
ব্যবহারের নিয়ম
- DEBUG: dev only, prod-এ off।
- INFO: normal events (predict, deploy)।
- WARN: recoverable anomaly (fallback triggered)।
- ERROR: request failed but service alive।
- CRITICAL: service-wide problem।
Stack
Loki + Promtail + Grafana
yamlproduction
# docker-compose.yml snippet
loki:
image: grafana/loki:latest
ports: ["3100:3100"]
promtail:
image: grafana/promtail:latest
volumes:
- /var/log:/var/log
- ./promtail-config.yml:/etc/promtail/config.yml
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"] logqlproduction
# LogQL queries in Grafana
{app="iris-api"} |= "error"
{app="iris-api"} | json | confidence < 0.5
sum by (predicted_class) (count_over_time({app="iris-api"} | json [5m]))Alertmanager
Prometheus alert rule
yamlproduction
groups:
- name: iris-api
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels: { severity: page }
annotations:
summary: "Iris API 5xx rate > 5% for 5m"
- alert: P99LatencyHigh
expr: |
histogram_quantile(0.99,
sum by (le) (rate(predict_latency_seconds_bucket[5m]))
) > 0.5
for: 10m
labels: { severity: warn }
- alert: ModelDrift
expr: model_psi_score > 0.25
for: 1h
labels: { severity: warn }
annotations:
summary: "Feature drift detected — review retrain"Alert Hygiene
SRE wisdom
- প্রতি alert-এ runbook link দাও।
- Severity:
page(wake on-call) vswarn(ticket)। for: 5m— flapping alert থামায়।- Symptom-based alert, cause-based নয় (user impact matter করে)।
- Alert fatigue হলে threshold tune করো — ignore হওয়াই বড় বিপদ।
ML-Specific Alerts
যেগুলো শুধু ML system-এ লাগে
- Feature drift PSI > 0.25।
- Prediction class skew > baseline ± 20%।
- Average confidence drop > 15%।
- Rolling accuracy (label available) < SLO।
- Model file size / hash mismatch on rollout।
Pitfalls
যা log + alert ভাঙে
- PII log করা — GDPR/audit violation।
- সবকিছু INFO — disk fill, search slow।
- Alert no-owner — কেউ দেখে না।
- Runbook outdated — alert আসে, কেউ জানে না কী করতে হবে।
Mini Project
Full observability stack
- Iris API-তে structlog যোগ করো।
- Compose-এ Loki + Promtail + Grafana চালু করো।
- Alertmanager-এ HighErrorRate rule define করো।
- Test: artificially 500 throw করে alert fire করো।
Phase 7 Complete
তুমি যা শিখলে
Monitoring, drift detection, performance tracking, logging + alerting — production ML-এর চোখ-কান-হাত সব। পরবর্তী Phase: Advanced MLOps — A/B test, canary, blue-green, AutoML।
← Roadmap-এ ফিরুন
পরবর্তী: A/B Testing for Modelsশীঘ্রই