githubEdit

DVC

Getting started with DVC on ExCL systems.

DVCarrow-up-right (Data Version Control) is a content-addressable data management system that integrates with Git. On ExCL, DVC enables shared, deduplicated storage across projects using a common cache.

DVC works well with ExCL. On ExCL, data can be stored in a shared cache in the project folder. Outside of ExCL, this same cache, can be used as the remote data storage.


🚀 What You Get on ExCL with shared data cache

  • Shared cache across repositories → no duplicate datasets

  • Content-addressed storage → automatic change detection

  • Reproducible pipelines (optional)

  • Git-based versioning for data pointers

  • Flexible remote syncing (for off-cluster use)


⚡ Quick Start (New Repository)

1. Install DVC

uv tool install dvc[ssh]

2. Initialize a Repository


3. Configure Shared Cache (ExCL)

circle-exclamation

📦 Adding Data


🔄 Getting Data

If cache settings changed:


🧰 Helpful Commands


🧑‍🤝‍🧑 Cloning a DVC Repository and Post Clone Steps

On ExCL (shared cache)

circle-info

With ExCL’s shared cache, dvc push and dvc pull are not needed, only dvc checkout and dvc checkout --relink.


Outside ExCL (remote usage)

circle-info

For systems outside of ExCL’s, treat ExCL as a DVC remote and use dvc push and dvc pull to synchronize files to and from your local cache.


🧠 Mental Model

  • Git tracks small metadata files (.dvc)

  • DVC cache stores actual data (by hash)

  • Workspace files are symlinks into the cache

  • Multiple repos → same shared cache → no duplication


🗒️ Best Practices on ExCL

  • Use a shared project cache (/auto/project/…)

  • Always run dvc checkout after cloning

  • Prefer cache.type symlink for data sharing and clarity

  • Avoid unnecessary pipeline re-runs (be careful with dependencies)

  • Periodically review cache usage (dvc gc with care)


📊 When to Use DVC vs. <Insert Existing Workflow>

Use DVC when:

  • You want reproducibility and versioned datasets

  • Multiple repos need to share large datasets

  • You want content-based tracking instead of paths

Use existing workflow when:

  • You have large exploratory runs (e.g., DEFFE)

  • Data is expensive to recompute, or you have bespoke processes.

  • You prefer manual control and simpler tooling


🔁 Key Change from Simple Data Storage Approach

Instead of:

You reference:

→ Data becomes portable, versioned, and reproducible


🧹 Notes on Cache Management

  • Cache grows over time (content-addressed)

  • Use:

circle-exclamation
circle-exclamation
circle-exclamation

Last updated

Was this helpful?