Phase 12 · Chapter 12.02

System Design Interviews for AI

45-60 মিনিটে একটা AI system design করে দেখানো — interview-এর সবচেয়ে high-signal round। Framework ছাড়া attempt করা = fail।

Framework

6-step ML system design

textproduction

1. Clarify       — problem, scale, constraint (5 min)
2. Metric        — offline + online, business goal (5 min)
3. Data          — source, labeling, freshness (5 min)
4. Model         — baseline → advanced, trade-off (10 min)
5. System        — train pipeline + serving + monitoring (15 min)
6. Scale + risk  — bottleneck, failure mode, future (10 min)

Step 1: Clarify

Question to ask interviewer

QPS? Daily active users? Geographic scope?
Latency requirement (p99)?
Personalization level (per-user vs cohort)?
Cold start expected? New users / new items?
Privacy / regulatory constraint (GDPR, HIPAA)?

Step 2: Metric

Offline vs Online — দুটোই বলো

Recsys: offline NDCG@10, online CTR + dwell time।
Search: offline MRR, online click-through, zero-result rate।
Fraud: offline PR-AUC, online $ saved / false-positive cost।
LLM: offline BLEU/eval suite, online thumbs-up rate।
Always tie to business: revenue, retention, cost।

Step 5: Architecture

Reference diagram — always draw this

textproduction

┌────────────┐   ┌──────────────┐   ┌──────────────┐
│  Client    │──▶│ API Gateway  │──▶│  Inference   │
└────────────┘   │ (auth, rate) │   │  Service     │
                 └──────────────┘   │  (GPU pool)  │
                                    └──────┬───────┘
                                           │
                  ┌────────────────────────┼───────────────┐
                  ▼                        ▼               ▼
            ┌──────────┐          ┌──────────────┐  ┌──────────┐
            │ Feature  │          │ Model Store  │  │ Cache    │
            │ Store    │          │ (MLflow/S3)  │  │ (Redis)  │
            └─────┬────┘          └──────┬───────┘  └──────────┘
                  │                      │
       ┌──────────┴──────────┐    ┌──────┴──────┐
       │ Online (Redis)      │    │ CI/CD       │
       │ Offline (Parquet/BQ)│    │ retrain DAG │
       └─────────────────────┘    │ (Airflow)   │
                                  └─────────────┘

Common Questions

Top 10 asked

Design YouTube recommendation system।
Design Twitter/X timeline ranker।
Design Uber ETA prediction।
Design credit card fraud detection।
Design ChatGPT serving infrastructure।
Design Google search autocomplete।
Design Amazon "people also bought"।
Design Instagram explore page।
Design ad click-through prediction।
Design self-driving perception pipeline।

Scoring Rubric

Interviewer যা mark করে

Clarification (10%): assumption আগে clarify করেছ কি?
Metric (15%): offline-online gap জানো কি?
Data (15%): labeling, freshness, leakage handle?
Model (15%): baseline → advanced, trade-off।
System (25%): train + serve + monitor wiring।
Scale + risk (20%): bottleneck identify + mitigation।

Pitfalls

Interview-এ যা fail করায়

Clarify না করেই solution লেখা শুরু।
সবচেয়ে fancy model দিয়ে শুরু — baseline মেনশন না।
Monitoring/retraining skip — "model deploy করলেই শেষ"।
Number ছাড়া capacity claim ("scalable" বললে চলবে না)।
Trade-off না বলে decision impose।

Practice Plan

4 সপ্তাহে interview-ready

Week 1: framework মুখস্থ + 3 mock নিজে লেখো।
Week 2: top 10 question-এর reference solution পড়ো (Alex Xu, Chip Huyen)।
Week 3: peer-এর সাথে mock interview — record + review।
Week 4: actual interview, প্রতিটার পর note।

Takeaway

মূল কথা

System design = communication test। Framework + trade-off + number — এই তিনটা থাকলে interview clear হয়, perfect solution লাগে না।

← Roadmap-এ ফিরুন

পরবর্তী: Production AI Best Practicesশীঘ্রই