# DVC

[DVC](https://dvc.org/) (Data Version Control) is a **content-addressable data management system** that integrates with Git. On ExCL, DVC enables **shared, deduplicated storage** across projects using a common cache.

DVC works well with ExCL. On ExCL, data can be stored in a shared cache in the project folder. Outside of ExCL, this same cache, can be used as the remote data storage.

***

## 🚀 What You Get on ExCL with shared data cache

* **Shared cache across repositories** → no duplicate datasets
* **Content-addressed storage** → automatic change detection
* **Reproducible pipelines (optional)**
* **Git-based versioning for data pointers**
* **Flexible remote syncing (for off-cluster use)**

***

## ⚡ Quick Start (New Repository)

### 1. Install DVC

```bash
uv tool install dvc[ssh,s3]
```

***

### 2. Initialize a Repository

```bash
git init
dvc init

# Auto-stage DVC changes into git
dvc config core.autostage true
```

***

### 3. Configure Shared Cache (ExCL)

```bash
# Set cache directory path to shared project cache location
dvc cache dir /auto/project/<project>/dvc-cache

# Enable group sharing of cache
dvc config cache.shared group

# Use symlinks (recommended on ExCL)
dvc config cache.type symlink
```

{% hint style="warning" %}
This can also be skipped and done instead at the instance level to better support using the repo on non-ExCL systems. See [Post Clone Steps on ExCL | ExCL User Docs](https://docs.excl.ornl.gov/quick-start-guides/dvc#on-excl-shared-cache).
{% endhint %}

***

## 📦 Adding Data

```bash
dvc add data/my-dataset

git add data/my-dataset.dvc .gitignore # Not needed if autostage is true.
git commit -m "Track dataset with DVC"

dvc push
git push
```

***

## 🔄 Getting Data

```bash
dvc checkout
```

If cache settings changed:

```bash
dvc checkout --relink
```

***

## 🧰 Helpful Commands

```bash
# Show configuration (with source)
dvc config -l --show-origin

# Check data status vs cache/remote
dvc status -c

# Fix workspace links after config change
dvc checkout --relink

# Make git repo group-writable (shared repos)
# This is useful if you want to share a git DVC repo in project space.
git config core.sharedRepository group
```

***

## 🧑‍🤝‍🧑 Cloning a DVC Repository and Post Clone Steps

### On ExCL (shared cache)

```bash
git clone <repo>
cd <repo>
```

```bash
# Set cache directory path to shared location
dvc config --local cache.dir /auto/project/<project>/dvc-cache
# Share cache with your group
dvc config --local cache.shared group
# Use symlinks to map files from the cache.
dvc config --local cache.type symlink
```

```bash
dvc checkout
```

{% hint style="info" %}
With ExCL’s shared cache, `dvc push` and `dvc pull` are not needed, only `dvc checkout` and `dvc checkout --relink`.
{% endhint %}

***

### Outside ExCL (remote usage)

```bash
git clone <repo>
cd <repo>
```

```bash
# Set the remote to be ExCL's DVC cache
dvc remote add -d --local excl ssh://login.excl.ornl.gov/auto/project/<project>/dvc-cache

# Default is reflink,copy so if reflink is not supported, you will want to change it. Check for supported types with `dvc version`. Symlink is recommended if reflink is not an option.
dvc config --local cache.type symlink
```

```bash
# Sync data
dvc pull
```

{% hint style="info" %}
For systems outside of ExCL’s, treat ExCL as a DVC remote and use `dvc push` and `dvc pull` to synchronize files to and from your local cache.
{% endhint %}

***

## 🧠 Mental Model

* Git tracks **small metadata files** (`.dvc`)
* DVC cache stores **actual data (by hash)**
* Workspace files are **symlinks into the cache**
* Multiple repos → **same shared cache → no duplication**

***

## 🗒️ Best Practices on ExCL

* Use a **shared project cache** (`/auto/project/…`)
* Always run `dvc checkout` after cloning
* Prefer `cache.type symlink` for data sharing and clarity
* Avoid unnecessary pipeline re-runs (be careful with dependencies)
* Periodically review cache usage (`dvc gc` with care)

***

## 📊 When to Use DVC vs. \<Insert Existing Workflow>

### Use DVC when:

* You want **reproducibility and versioned datasets**
* Multiple repos need to **share large datasets**
* You want **content-based tracking instead of paths**

### Use existing workflow when:

* You have **large exploratory runs (e.g., DEFFE)**
* Data is **expensive to recompute**, or you have bespoke processes.
* You prefer **manual control and simpler tooling**

***

## 🔁 Key Change from Simple Data Storage Approach

Instead of:

```
/path/to/data/file
```

You reference:

```
(repo, git commit, dvc-tracked file)
```

→ Data becomes **portable, versioned, and reproducible**

***

## 🧹 Notes on Cache Management

* Cache grows over time (content-addressed)
* Use:

```bash
dvc gc
```

{% hint style="warning" %}
Only run with care in shared environments (can remove needed data)
{% endhint %}

{% hint style="warning" %}
Note that enabling soft/hard links causes DVC to protect the linked data because editing them in-place would corrupt the cache. See [`dvc unprotect`](https://doc.dvc.org/command-reference/unprotect).
{% endhint %}

{% hint style="warning" %}
Using [`dvc gc`](https://doc.dvc.org/command-reference/gc) with a shared cache may delete data needed in another project! See more info about [cleaning a shared cache](https://doc.dvc.org/command-reference/gc#cleaning-shared-cache-or-remote) safely.
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.excl.ornl.gov/quick-start-guides/dvc.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
