Production AI Best Practices

Tutorial-এ পাওয়া যায় না — এমন principle যা শুধু production-এ পুড়ে শেখা যায়।

Reliability

System যেন না ভাঙে

SLO define করো (99.5% uptime, p99 < 200ms) — vague target নয়।
Graceful degradation — model down হলে cached / rule-based fallback।
Circuit breaker — downstream slow হলে fail fast।
Idempotency key — retry-এ duplicate write এড়াও।
Chaos engineering — production-এ pod kill করে test।

Cost

ML cost-এর 80% optimization

textproduction

Layer            Saving Technique
─────────────────────────────────────────────────────
GPU inference    Batching + quantization (FP16/INT8) → 5x
Embedding        Cache hash(input) → vec               → 60-80% hit
LLM token        Model routing (small → big fallback)  → 70%
Storage          S3 Intelligent-Tiering                → 30-50%
Train job        Spot instance + checkpoint            → 70%
Data egress      Same-region keep                      → expensive cross

Security

AI-specific threat

Prompt injection: user input system prompt override করতে পারে। Input sanitize + output filter।
PII leak: training data মুখে আসে। Redact pipeline + DLP scanner।
Model theft: rate limit + query pattern detection।
Adversarial: CV-তে imperceptible perturbation। Robust training।
Supply chain: HF model untrusted code execute। Sandbox + signed model।

Governance

Audit, compliance, fairness

Model card: training data, metric, intended use, limitation document।
Lineage: data → feature → model → prediction trace করতে পারো।
Bias audit: protected attribute (gender, race, age) group-wise metric।
Right-to-explanation: SHAP / LIME দিয়ে individual decision explain।
Data retention: GDPR/CCPA — delete request honor।

On-call

3 AM page-এ যা করতে হয়

textproduction

1. Acknowledge (5 min) — page-কে quiet করো
2. Triage         — user-facing impact কতটুকু? scope?
3. Mitigate       — rollback আগে, root cause পরে
4. Communicate    — status page + Slack incident channel
5. Resolve        — service healthy confirm
6. Postmortem     — blameless, action item, timeline

"Fix first, understand later" — production-এ এটাই rule।

Team

Senior MLOps-এর soft skill

RFC culture: বড় change-এ লেখো, review পাও, then ship।
Runbook: common incident-এর step-by-step doc — junior on-call পারবে।
Mentor: code review সময় শেখাও, just approve করো না।
Push back: "AI add করো" এর আগে ROI demand করো।
Boring tech: hype tool না, proven stack পছন্দ করো।

Pitfalls

Senior-রাও যা ভুল করে

"Works on my machine" — staging-prod environment parity ভাঙা।
Manual deploy — Friday 5 PM-এ disaster recipe।
Monitoring alert fatigue — সবাই mute করে দেয়।
Model retrain schedule নেই — silent drift।
Documentation শুধু code review-এর জন্য, real user-এর জন্য না।

Checklist

Production launch-এর আগে

SLO + alert defined? on-call schedule আছে?
Rollback strategy tested?
Cost dashboard ready?
Security scan (Snyk, Trivy) clean?
Model card + runbook published?
Load test passed (2x peak)?

Takeaway

মূল কথা

Production AI = boring engineering + careful experimentation। সবচেয়ে fancy model না, সবচেয়ে reliable system জিতে।

← Roadmap-এ ফিরুন

পরবর্তী: Open Source Contributionশীঘ্রই