Phase 8 · Chapter 8.01
A/B Testing for Models
Offline accuracy v2 > v1 — তবু production-এ v2 revenue কমিয়ে দিতে পারে। A/B test = scientific judge।
Why
Offline vs online gap
- Offline data = historical, treatment-free; online = real users react।
- Better accuracy ≠ better CTR / revenue / retention।
- Only randomized live experiment proves causality।
The Recipe
৬ ধাপের A/B framework
- Hypothesis: "v2 click-through 5% বাড়াবে।"
- Primary metric: CTR (move-the-needle one)।
- Guardrail metrics: latency, error rate, revenue।
- Sample size: power analysis দিয়ে calculate।
- Randomize: user_id hash, sticky across session।
- Decide: significance + practical effect।
Routing
User-level deterministic split
pythonproduction
import hashlib
def variant_for(user_id: str, exp: str, split: float = 0.5) -> str:
h = hashlib.md5(f"{exp}:{user_id}".encode()).hexdigest()
bucket = int(h[:8], 16) / 0xffffffff
return "treatment" if bucket < split else "control"
@app.post("/predict")
def predict(req: Req):
v = variant_for(req.user_id, exp="iris_v2_2026Q2")
model = model_v2 if v == "treatment" else model_v1
pred = model.predict(req.features)
log.info("ab", user=req.user_id, variant=v, pred=pred)
return {"prediction": pred, "variant": v}Sample Size
Power analysis basics
pythonproduction
from statsmodels.stats.power import NormalIndPower
# detect 2% absolute CTR lift, baseline 10%, power 0.8, alpha 0.05
baseline = 0.10
mde = 0.02
effect = mde / (baseline * (1 - baseline)) ** 0.5
n_per_arm = NormalIndPower().solve_power(
effect_size=effect, alpha=0.05, power=0.8, ratio=1, alternative="two-sided"
)
print(f"≈ {int(n_per_arm)} users per arm")Analysis
Frequentist test (chi-square)
pythonproduction
from scipy.stats import chi2_contingency
# 2x2 table: variant × (click / no-click)
table = [[clicks_a, n_a - clicks_a],
[clicks_b, n_b - clicks_b]]
chi2, p, _, _ = chi2_contingency(table)
lift = (clicks_b / n_b) / (clicks_a / n_a) - 1
print(f"lift={lift:.2%} p={p:.4f}")Beyond Basics
Advanced patterns
- CUPED: pre-experiment data দিয়ে variance reduction — same power-এ ~50% sample।
- Multi-armed bandit: traffic exploration explore/exploit auto-shift (Thompson sampling)।
- Sequential testing: peeking-safe (mSPRT, always-valid p-values)।
- Holdout: 5% সবসময় control — long-term effect দেখো।
Pitfalls
A/B-এর ফাঁদ
- Peeking: daily p-value দেখে decide — false-positive rate বাড়ে।
- Sample ratio mismatch (SRM): 50/50 split-এ 48/52 হলে bug — analysis invalid।
- Network effect: social/marketplace product — A-র behavior B-কে প্রভাবিত করে।
- Novelty effect: নতুন UI প্রথম সপ্তাহে boost, পরে normalize।
- Underpowered test: "no significant difference" ≠ "no difference"।
Mini Project
Synthetic A/B
- Iris API-তে v1/v2 split logic যোগ করো।
- 1000 simulated user-এ click event generate করো।
- Chi-square test চালিয়ে p-value report করো।
- Required sample size calculate করো 1% lift detect করতে।
Takeaway
মনে রাখো
A/B = hypothesis × power × guardrail। তিনটার এক ও skip = পুরো test অর্থহীন।