A/B Testing for Models

Offline accuracy v2 > v1 — তবু production-এ v2 revenue কমিয়ে দিতে পারে। A/B test = scientific judge।

Why

Offline vs online gap

Offline data = historical, treatment-free; online = real users react।
Better accuracy ≠ better CTR / revenue / retention।
Only randomized live experiment proves causality।

The Recipe

৬ ধাপের A/B framework

Hypothesis: "v2 click-through 5% বাড়াবে।"
Primary metric: CTR (move-the-needle one)।
Guardrail metrics: latency, error rate, revenue।
Sample size: power analysis দিয়ে calculate।
Randomize: user_id hash, sticky across session।
Decide: significance + practical effect।

Routing

User-level deterministic split

pythonproduction

import hashlib

def variant_for(user_id: str, exp: str, split: float = 0.5) -> str:
    h = hashlib.md5(f"{exp}:{user_id}".encode()).hexdigest()
    bucket = int(h[:8], 16) / 0xffffffff
    return "treatment" if bucket < split else "control"

@app.post("/predict")
def predict(req: Req):
    v = variant_for(req.user_id, exp="iris_v2_2026Q2")
    model = model_v2 if v == "treatment" else model_v1
    pred = model.predict(req.features)
    log.info("ab", user=req.user_id, variant=v, pred=pred)
    return {"prediction": pred, "variant": v}

Sample Size

Power analysis basics

pythonproduction

from statsmodels.stats.power import NormalIndPower

# detect 2% absolute CTR lift, baseline 10%, power 0.8, alpha 0.05
baseline = 0.10
mde = 0.02
effect = mde / (baseline * (1 - baseline)) ** 0.5

n_per_arm = NormalIndPower().solve_power(
    effect_size=effect, alpha=0.05, power=0.8, ratio=1, alternative="two-sided"
)
print(f"≈ {int(n_per_arm)} users per arm")

Analysis

Frequentist test (chi-square)

pythonproduction

from scipy.stats import chi2_contingency

# 2x2 table: variant × (click / no-click)
table = [[clicks_a, n_a - clicks_a],
         [clicks_b, n_b - clicks_b]]

chi2, p, _, _ = chi2_contingency(table)
lift = (clicks_b / n_b) / (clicks_a / n_a) - 1
print(f"lift={lift:.2%}  p={p:.4f}")

Beyond Basics

Advanced patterns

CUPED: pre-experiment data দিয়ে variance reduction — same power-এ ~50% sample।
Multi-armed bandit: traffic exploration explore/exploit auto-shift (Thompson sampling)।
Sequential testing: peeking-safe (mSPRT, always-valid p-values)।
Holdout: 5% সবসময় control — long-term effect দেখো।

Pitfalls

A/B-এর ফাঁদ

Peeking: daily p-value দেখে decide — false-positive rate বাড়ে।
Sample ratio mismatch (SRM): 50/50 split-এ 48/52 হলে bug — analysis invalid।
Network effect: social/marketplace product — A-র behavior B-কে প্রভাবিত করে।
Novelty effect: নতুন UI প্রথম সপ্তাহে boost, পরে normalize।
Underpowered test: "no significant difference" ≠ "no difference"।

Mini Project

Synthetic A/B

Iris API-তে v1/v2 split logic যোগ করো।
1000 simulated user-এ click event generate করো।
Chi-square test চালিয়ে p-value report করো।
Required sample size calculate করো 1% lift detect করতে।

Takeaway

মনে রাখো

A/B = hypothesis × power × guardrail। তিনটার এক ও skip = পুরো test অর্থহীন।

← Roadmap-এ ফিরুন পরবর্তী: Canary Deployment