হোম/Roadmap/Chapter 8.01
Phase 8 · Chapter 8.01

A/B Testing for Models

Offline accuracy v2 > v1 — তবু production-এ v2 revenue কমিয়ে দিতে পারে। A/B test = scientific judge।

Why

Offline vs online gap

  • Offline data = historical, treatment-free; online = real users react।
  • Better accuracy ≠ better CTR / revenue / retention।
  • Only randomized live experiment proves causality।
The Recipe

৬ ধাপের A/B framework

  1. Hypothesis: "v2 click-through 5% বাড়াবে।"
  2. Primary metric: CTR (move-the-needle one)।
  3. Guardrail metrics: latency, error rate, revenue।
  4. Sample size: power analysis দিয়ে calculate।
  5. Randomize: user_id hash, sticky across session।
  6. Decide: significance + practical effect।
Routing

User-level deterministic split

pythonproduction
import hashlib

def variant_for(user_id: str, exp: str, split: float = 0.5) -> str:
    h = hashlib.md5(f"{exp}:{user_id}".encode()).hexdigest()
    bucket = int(h[:8], 16) / 0xffffffff
    return "treatment" if bucket < split else "control"

@app.post("/predict")
def predict(req: Req):
    v = variant_for(req.user_id, exp="iris_v2_2026Q2")
    model = model_v2 if v == "treatment" else model_v1
    pred = model.predict(req.features)
    log.info("ab", user=req.user_id, variant=v, pred=pred)
    return {"prediction": pred, "variant": v}
Sample Size

Power analysis basics

pythonproduction
from statsmodels.stats.power import NormalIndPower

# detect 2% absolute CTR lift, baseline 10%, power 0.8, alpha 0.05
baseline = 0.10
mde = 0.02
effect = mde / (baseline * (1 - baseline)) ** 0.5

n_per_arm = NormalIndPower().solve_power(
    effect_size=effect, alpha=0.05, power=0.8, ratio=1, alternative="two-sided"
)
print(f"≈ {int(n_per_arm)} users per arm")
Analysis

Frequentist test (chi-square)

pythonproduction
from scipy.stats import chi2_contingency

# 2x2 table: variant × (click / no-click)
table = [[clicks_a, n_a - clicks_a],
         [clicks_b, n_b - clicks_b]]

chi2, p, _, _ = chi2_contingency(table)
lift = (clicks_b / n_b) / (clicks_a / n_a) - 1
print(f"lift={lift:.2%}  p={p:.4f}")
Beyond Basics

Advanced patterns

  • CUPED: pre-experiment data দিয়ে variance reduction — same power-এ ~50% sample।
  • Multi-armed bandit: traffic exploration explore/exploit auto-shift (Thompson sampling)।
  • Sequential testing: peeking-safe (mSPRT, always-valid p-values)।
  • Holdout: 5% সবসময় control — long-term effect দেখো।
Pitfalls

A/B-এর ফাঁদ

  • Peeking: daily p-value দেখে decide — false-positive rate বাড়ে।
  • Sample ratio mismatch (SRM): 50/50 split-এ 48/52 হলে bug — analysis invalid।
  • Network effect: social/marketplace product — A-র behavior B-কে প্রভাবিত করে।
  • Novelty effect: নতুন UI প্রথম সপ্তাহে boost, পরে normalize।
  • Underpowered test: "no significant difference" ≠ "no difference"।
Mini Project

Synthetic A/B

  1. Iris API-তে v1/v2 split logic যোগ করো।
  2. 1000 simulated user-এ click event generate করো।
  3. Chi-square test চালিয়ে p-value report করো।
  4. Required sample size calculate করো 1% lift detect করতে।
Takeaway

মনে রাখো

A/B = hypothesis × power × guardrail। তিনটার এক ও skip = পুরো test অর্থহীন।