Phase 6 · Chapter 6.04

Data Versioning (DVC)

Code-এ Git, data-তে DVC। যেকোনো commit-এ চেকআউট করলে exact same data + model পাবে — true reproducibility।

Why

Git large file-এ ভেঙে পড়ে

Git 100MB+ file-এ slow, GB-এ অচল।
"৩ মাস আগের model কেন এই accuracy দিল?" — data version ছাড়া উত্তর নেই।
Team-এ data share — Drive link হাত-বদল না করে repo থেকে pull।

DVC Mental Model

Git + remote storage

DVC বড় file-এর জায়গায় ছোট .dvc metadata file রাখে (hash + path)। আসল data S3/GCS/Azure Blob/SSH-এ থাকে। Git track করে metadata, DVC track করে actual bytes।

textproduction

repo/
├── .git/
├── .dvc/                 # config + cache
├── data/
│   ├── raw.csv           # gitignored
│   └── raw.csv.dvc       # tiny metadata, in git
└── dvc.yaml              # pipeline definition

Quick Start

Init + track

bashproduction

pip install dvc[s3]

dvc init
git add .dvc .dvcignore
git commit -m "init dvc"

# remote storage
dvc remote add -d origin s3://my-ml-bucket/dvc-store
git add .dvc/config && git commit -m "add remote"

# track a dataset
dvc add data/raw.csv
git add data/raw.csv.dvc data/.gitignore
git commit -m "add raw dataset v1"

dvc push        # upload bytes to S3
git push        # push metadata

Pipelines

dvc.yaml = reproducible DAG

yamlproduction

stages:
  prepare:
    cmd: python src/prepare.py data/raw.csv data/prepared.csv
    deps:
      - data/raw.csv
      - src/prepare.py
    outs:
      - data/prepared.csv

  train:
    cmd: python src/train.py data/prepared.csv models/iris.pkl
    deps:
      - data/prepared.csv
      - src/train.py
    params:
      - train.n_estimators
      - train.max_depth
    outs:
      - models/iris.pkl
    metrics:
      - reports/metrics.json:
          cache: false

bashproduction

dvc repro             # only re-runs changed stages
dvc metrics show      # compare metrics across commits
dvc params diff
dvc exp run -S train.max_depth=10

Reproducibility Workflow

চিরকাল কাজে দেবে

bashproduction

# কেউ পুরোনো result reproduce করতে চাইলে:
git checkout <commit-sha>
dvc pull              # data + model নামাবে
dvc repro             # exact same pipeline চালাবে

Alternatives

Ecosystem

DVC: Git-native, most popular।
LakeFS: Git-like operations on object storage।
Pachyderm: Container-based data pipelines with versioning।
Delta Lake / Iceberg: Time-travel on warehouse tables।
HuggingFace Datasets / Hub: Versioned dataset hosting।

Pitfalls

যা প্রায়ই হয়

dvc push ভুলে যাওয়া — teammate dvc pull করে blank পায়।
Data folder Git-এ accidentally commit — repo বিশাল।
Remote credentials CI-এ নেই — pipeline fail।
Pipeline stage output cache না করা — repeat কাজ।

Mini Project

Iris with DVC

dvc init + local remote (/tmp/dvc-store)।
Iris CSV track করো।
prepare + train stage সহ dvc.yaml বানাও।
3টা experiment চালাও (different max_depth), dvc exp show দিয়ে compare।

Phase 6 Complete

তুমি যা শিখলে

Pipelines, ETL/ELT, Feature stores, Data versioning — ML system-এর data backbone। পরবর্তী Phase: Monitoring & Observability — production-এ চলতে থাকা model নজরে রাখা।

← Roadmap-এ ফিরুন

পরবর্তী: Model Monitoringশীঘ্রই