হোম/Roadmap/Chapter 12.03
Phase 12 · Chapter 12.03

Production AI Best Practices

Tutorial-এ পাওয়া যায় না — এমন principle যা শুধু production-এ পুড়ে শেখা যায়।

Reliability

System যেন না ভাঙে

  • SLO define করো (99.5% uptime, p99 < 200ms) — vague target নয়।
  • Graceful degradation — model down হলে cached / rule-based fallback।
  • Circuit breaker — downstream slow হলে fail fast।
  • Idempotency key — retry-এ duplicate write এড়াও।
  • Chaos engineering — production-এ pod kill করে test।
Cost

ML cost-এর 80% optimization

textproduction
Layer            Saving Technique
─────────────────────────────────────────────────────
GPU inference    Batching + quantization (FP16/INT8) → 5x
Embedding        Cache hash(input) → vec               → 60-80% hit
LLM token        Model routing (small → big fallback)  → 70%
Storage          S3 Intelligent-Tiering                → 30-50%
Train job        Spot instance + checkpoint            → 70%
Data egress      Same-region keep                      → expensive cross
Security

AI-specific threat

  • Prompt injection: user input system prompt override করতে পারে। Input sanitize + output filter।
  • PII leak: training data মুখে আসে। Redact pipeline + DLP scanner।
  • Model theft: rate limit + query pattern detection।
  • Adversarial: CV-তে imperceptible perturbation। Robust training।
  • Supply chain: HF model untrusted code execute। Sandbox + signed model।
Governance

Audit, compliance, fairness

  • Model card: training data, metric, intended use, limitation document।
  • Lineage: data → feature → model → prediction trace করতে পারো।
  • Bias audit: protected attribute (gender, race, age) group-wise metric।
  • Right-to-explanation: SHAP / LIME দিয়ে individual decision explain।
  • Data retention: GDPR/CCPA — delete request honor।
On-call

3 AM page-এ যা করতে হয়

textproduction
1. Acknowledge (5 min) — page-কে quiet করো
2. Triage         — user-facing impact কতটুকু? scope?
3. Mitigate       — rollback আগে, root cause পরে
4. Communicate    — status page + Slack incident channel
5. Resolve        — service healthy confirm
6. Postmortem     — blameless, action item, timeline

"Fix first, understand later" — production-এ এটাই rule।

Team

Senior MLOps-এর soft skill

  • RFC culture: বড় change-এ লেখো, review পাও, then ship।
  • Runbook: common incident-এর step-by-step doc — junior on-call পারবে।
  • Mentor: code review সময় শেখাও, just approve করো না।
  • Push back: "AI add করো" এর আগে ROI demand করো।
  • Boring tech: hype tool না, proven stack পছন্দ করো।
Pitfalls

Senior-রাও যা ভুল করে

  • "Works on my machine" — staging-prod environment parity ভাঙা।
  • Manual deploy — Friday 5 PM-এ disaster recipe।
  • Monitoring alert fatigue — সবাই mute করে দেয়।
  • Model retrain schedule নেই — silent drift।
  • Documentation শুধু code review-এর জন্য, real user-এর জন্য না।
Checklist

Production launch-এর আগে

  1. SLO + alert defined? on-call schedule আছে?
  2. Rollback strategy tested?
  3. Cost dashboard ready?
  4. Security scan (Snyk, Trivy) clean?
  5. Model card + runbook published?
  6. Load test passed (2x peak)?
Takeaway

মূল কথা

Production AI = boring engineering + careful experimentation। সবচেয়ে fancy model না, সবচেয়ে reliable system জিতে।

← Roadmap-এ ফিরুন
পরবর্তী: Open Source Contributionশীঘ্রই