5 Code persistence
When talking about storing the model we quickly realise that there is more to this task than simply calling pickle.dump
. The same is true for persisting the model, data, and code together.
Luckily, we already have the perfect tool for controlling the code in use, git
1. We can make sure to commit our source files and when using a proper package manager like pdm
2 we can hope to reproduce our environment for the formats that require these features.
To illustrate this, we move the code from Listing 4.1 into a project. In this process we move from the dense script to a proper file structure by splitting up the source script into several parts. Furthermore, we rework some of the details, like only downloading the data once.
The following example is referenced according to the GitHub repository kandolfp/MECH-M-DUAL-2-MLB-DATA on GitHub.
Please use it as reference and move along the corresponding commit SHAs to see certain reference points as we continuously rework the code to adapt to the new requirements and situation.
This also means some of the code included below is not interactively created but static with regard to the execution time and state.
To emphasise that we are in the context of this reference repository the prompt for the terminal will always start with MLB-DATA$
.
The code is called with respect to the root director of the project as seen in the next output block.
If you decide not to write your own code but rather use the reference repository for the following exercise, it is advisable to implement the exercise at the same SHA mentioned prior to the exercise, in order to avoid getting ahead of yourself.
After the redesign we get a structure looking something like the following, see 22788a6
for reference and file content:
MLB-DATA$ tree
.
├── pdm.lock
├── pyproject.toml
└── src
└── MECH-M-DUAL-2-MLB-DATA
├── data.py
├── inference.py
├── model.py
├── myio.py
└── train.py
3 directories, 7 files
We can train the model by calling train.py
(in this case with a logger3 on DEBUG
enabled).
MLB-DATA$ pdm run src/MECH-M-DUAL-2-MLB-DATA/train.py
DEBUG:root: Loaded the data with Split of 60 to 20 per category.
DEBUG:root: Create classifier
DEBUG:root: Train classifier
DEBUG:root: Score classifier
INFO:root: We have a hard voting score of 0.8
DEBUG:root: Save clf to skops file models/model.skops
Of course we can also load the model again and do inference with it by calling inference.py
MLB-DATA$ pdm run src/MECH-M-DUAL-2-MLB-DATA/inference.py
DEBUG:root: Loaded the data with Split of 60 to 20 per category.
DEBUG:root: Load classifier
DEBUG:root: Load clf from skops file models/model.skops
WARNING:root: Unknown type at 0 is sklearn.utils._bunch.Bunch.
DEBUG:root: Score classifier
INFO:root: We have a hard voting score of 0.8
Now we can start connecting the model and the code. Our model was created at commit 22788a6
and stored into the directory model
.
Now there are several things we can do to make sure this is reflected within our little project.
- We can share experiments and models with other people and they can reproduce them.
- Make the convention to never train and store an experiment for later use as long as we have uncommitted changes in our code (easier said than done).
- Make sure to note somewhere what commit SHA is the current HEAD when storing the model. This allows us to reproduce it in case of data corruption or loss and for comparison with other models.
- When running inference we can check for the commit SHA and switch to it in case the project dependencies have changed and we get problems loading the file.
- Every time we change a parameter, we get a new commit, which is not very nice for our code but can be used to note the intend of the experiment in the commit message.
This is rather cumbersome and requires a lot of discipline, it will also become tricky if several people work on the same project and run experiments with different parameters.
5.1 Externalize the parameters/configuration
First thing we do is we externalize the configuration to make sure this is no longer part of our main code source. This means we can filter between actual changes to the structure and source and a commit for a simple experiment.
yaml
is the format to go for these aspects, see Wikipedia. It is a human readable data serialization language. One possible interpretation of the config (among many others is) can be seen in the next code block.
PCA:
type: sklearn.decomposition.PCA
init_args:
n_components: 41
VotingClassifier:
type: sklearn.ensemble.VotingClassifier
init_args:
flatten_transform: False
estimators:
- LinearDiscriminantAnalysis
- RandomForestClassifier
- SVC
LinearDiscriminantAnalysis:
type: sklearn.discriminant_analysis.LinearDiscriminantAnalysis
init_args:
solver: svd
RandomForestClassifier:
type: sklearn.ensemble.RandomForestClassifier
init_args:
n_estimators: 500
max_leaf_nodes: 2
random_state: 6020
SVC:
type: sklearn.svm.SVC
init_args:
kernel: linear
probability: True
random_state: 6020
We can use the Python package omegaconf
to load and use it.
One feature of the OmegaConf
class is that we can use unpacking4 and therefore write a line like:
**params["PCA"].init_args) PCA(
dramatically simplifying our code. Have a look at commit 9f7dead to see this implemented and in action. As a our default parameter template we include the above parameters as params.yaml
in the main directory of our project.
Now that the config is externalized we can continue on our quest to persist all aspects of this project together in a useful way.
5.2 Data persistence
Our model depends on the code, the configuration, but crucially also on the input data itself. In order to make sure that we can reliably reproduce a model we also need to make sure our data is reproducible.
In our reference project we use some files from GitHub (see @Brunton and Kutz (2022) for as reference) but let us still make sure they are tracked within our system.
We use a simple tool for the next paragraphs to illustrate the concepts. There are plenty of alternatives that can and are used.
The selection of dvc
is purely motivated by the following features: it can easily be used in such our lecture, has some basic features that illustrate the requirements on such systems, integrates nicely with Python, has only limited dependencies outside of the Python eco-system.
Alternatives include:
Like always, the best platform depends on the project and the available infrastructure.
One tool for data version control is dvc
. As it is written in Python we can even add and track the version via our package manager pdm add dvc
. Once installed we can initialize it in our project,
MLB-DATA$ pdm run dvc init
Initialized DVC repository.
You can now commit the changes to git.
+---------------------------------------------------------------------+
| |
| DVC has enabled anonymous aggregate usage analytics. |
| Read the analytics documentation (and how to opt-out) here: |
| <https://dvc.org/doc/user-guide/analytics> |
| |
+---------------------------------------------------------------------+
What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>
To make the changes permanent we commit the directory .dvc
(with all the files included) to git
and therefore our project now runs with dvc
, see commit df38086 for reference.
To add the data
directory simply run
MLB-DATA$ pdm run dvc add data
100% Adding...|███████████████████████████████████████|1/1 [00:00, 5.67file/s]
To track the changes with git, run:
git add data.dvc
To enable auto staging, run:
dvc config core.autostage true
Again, to make it permanent in the project we need to add data.dvc
to git
(as suggested).
At this point we have the two files catData_w.mat
and dogData_w.mat
in this directory and they are under version control from dvc
.
If we take a look into data.dvc
we can see that it tracks the files via md5 SHAs and includes some additional useful information .
MLB-DATA$ cat data.dvc
outs:
- md5: 5987e80830fc2caf6d475da3deca1dfe.dir
size: 111165
nfiles: 2
hash: md5
path: data
As mentioned, dvc
works similar to git
so eventually we will need to include a remote that we push data to. For now we just work locally, similar as we could do for a git
repository.
Other than that, we can now change the files use dvc add data
and as soon as we commit the corresponding change in the data.dvc
to git
we know exactly what data is used and we can also restore it.
To do these operations dvc
uses a cache (default location is in .dvc/cache
) and the files are links to the cache.
The most important dvc
commands are (we link the docs for an extended reference):
dvc add
to add a file or directory.dvc checkout
brings your work space up to date, according to the.dvc
files current states.dvc commit
updates the.dvc
files and stores the content in the cache, most of the time called implicitly.dvc config
view and change the config for the repo or globally fordvc
on the system.dvc data status
chow changes to the files in the work space with respect to thegit HEAD
.dvc destroy
remove all files anddvc
structures for the current project, including the cache. The symlinks will be replaced by the actual data so the current state is preserved.dvc exp
has multiple subcommands and is used to handle experiments, we will use this command later.dvc fetch
download files from the remote repository to the cache.dvc pull
download files from the remote and make them visible in the working space.dvc push
upload the tracked files to the remote.
For the other commands run dvc --help
or look at the docs.
As can be seen from the command, dvc
was build with git
in mind and feels quite similar. This means it uses the same commands for the same (or almost same) operations. Unfortionalty or luckily (depending on our preferences), it also brings in the sometimes confusing command structure and the concepts like a working space.
Recall the introduction to git
for some of these 5. We will use this to also recall some details about git
to, hopefully, further foster the understanding
Now our files are tracked, but as you probably realised we did not add the module
folder to dvc
. This is due to the fact that we can use the dvc exp
feature to allow for more fine grained control and even parameter overviews. Furthermore, we can use logging features to integrate with this system even better.
dvc
also allows advanced nice pipelines (we look at a small example later) and automatic computation as well as monitoring. In all its facets this is quite advanced and can be introduced when our project grows.
5.3 dvclive
for experiment management
dvclive
works best with the big ML Frameworks like keras
or pytorch
but we can also utilize it for our example project. The introduction to the experiment management from the dvc
perspective can be found in the docs.
To show some of the dvclive
features we reworked to code, see commit 519a85 for the changes (including pdm add dvclive
). Now, when we run our job once more it will create the dvclive
directory with a couple of subdirectories containing our metrics, looking like the following output.
MLB-DATA$ pdm run src/MECH-M-DUAL-2-MLB-DATA/train.py
INFO:root: We have a hard voting train-score of 1.0
INFO:root: We have a hard voting test-score of 0.8
100% Adding...|███████████████████████████████████████|1/1 [00:00, 7.64file/s]
And the resulting files are stored in dvclive
.
MLB-DATA$ tree dvclive/
dvclive/
├── artifacts
│ └── model.skops
├── metrics.json
├── params.yaml
└── plots
├── metrics
│ ├── testscore.tsv
│ └── trainscore.tsv
└── sklearn
└── confusion_matrix.json
5 directories, 6 files
The experiment is automatically added. We can check this with dvc exp show
MLB-DATA$ pdm run dvc exp show
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Experiment Created trainscore testscore PCA/n_components LinearDiscriminantAnalysis/solver RandomForestClassifier/n_estimators RandomForestClassifier/max_leaf_nodes RandomForestClassifier/random_state SVC/kernel SVC/probability SVC/random_state PCA.type PCA.init_args.n_components VotingClassifier.type VotingClassifier.init_args.flatten_transform VotingClassifier.estimators LinearDiscriminantAnalysis.type LinearDiscriminantAnalysis.init_args.solver RandomForestClassifier.type RandomForestClassifier.init_args.n_estimators RandomForestClassifier.init_args.max_leaf_nodes RandomForestClassifier.init_args.random_state SVC.type SVC.init_args.kernel SVC.init_args.probability SVC.init_args.random_state
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
workspace - 1 0.8 41 svd 500 2 6020 linear True 6020 sklearn.decomposition.PCA 41 sklearn.ensemble.VotingClassifier False ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC'] sklearn.discriminant_analysis.LinearDiscriminantAnalysis svd sklearn.ensemble.RandomForestClassifier 500 2 6020 sklearn.svm.SVC linear True 6020
main Mar 19, 2025 - - - - - - - - - - sklearn.decomposition.PCA 41 sklearn.ensemble.VotingClassifier False ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC'] sklearn.discriminant_analysis.LinearDiscriminantAnalysis svd sklearn.ensemble.RandomForestClassifier 500 2 6020 sklearn.svm.SVC linear True 6020
└── 9e83c73 [sural-cyma] 10:03 AM 1 0.8 41 svd 500 2 6020 linear True 6020 sklearn.decomposition.PCA 41 sklearn.ensemble.VotingClassifier False ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC'] sklearn.discriminant_analysis.LinearDiscriminantAnalysis svd sklearn.ensemble.RandomForestClassifier 500 2 6020 sklearn.svm.SVC linear True 6020
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
dvc exp show
provides an overview of the experiments base on the git
commit SHAs. The output can be counter intuitive. Check out the option ( dvc exp show --help
), especially --num
and -A
to control where to search for experiments.
Each experiment has a unique name, as we did not specify anything a random name is created. In our case sural-cyma
(they are often fun but can be hard to infer meaning and speaking names are most the time more useful, see -n
option).
It is also possible provide a commit message to give more details to the experiment.
The overview also allows us to see the score for our model and the parameters, this allows us to quickly compare models.
We can see that the parameter from params.yaml
are automatically added and our code somewhat duplicated them.
dvclive
relies on git
to do its magic for files and the dvc
cache for large files. How this works is that a reference inside git
is created for each experiment and stores the changes there. We can see this in the above logger output or via git.
1MLB-DATA$ git lg
| * 1f2bee5 - (29 hours ago) dvc: commit experiment 44136fa35 - Authors
|/
* 519a854 - (29 hours ago) feat: include dvc-live - Authors
* f264314 - (2 days ago) feat: add data to dvc - Authors
* df38086 - (2 days ago) feat: init dvc - Authors
* 073fb93 - (2 days ago) fixup! feat: externalize params via omegaconfig - Authors
* 9f7dead - (2 days ago) feat: externalize params via omegaconfig - Authors
* 22788a6 - (2 days ago) init: project - Authors
- 1
- This is a shorthand see6.
Furthermore, we did not commit our changes to git (bad practice!!!) but they are stored alongside with the experiment, so no information is lost (we can find them in .git/refs
, see below.)
To clean up, we commit our changes as we know they work and then rerun the code for a new experiment, see 519a854. We can remove the previous version with dvc exp rm <name>
.
By default, this is not moved to the git
remote, to do so we need to run dvc exp push <git remote>
.
MLB-DATA$ pdm run dvc exp push origin
Collecting |0.00 [00:00, ?entry/s]
Pushing
Experiment sural-cyma is up to date on Git remote 'origin'.
We can also see this in the git reflog
(compare above output).
MLB-DATA$ git reflog
519a854 (HEAD -> main) HEAD@{0}: dvc: Restore HEAD to 'main'
1f2bee5 HEAD@{1}: commit: dvc: commit experiment 44136fa355b3678
519a854 (HEAD -> main) HEAD@{2}: checkout: moving from main to 519a8544c82667
519a854 (HEAD -> main) HEAD@{3}: dvc: Restore HEAD to 'main'
519a854 (HEAD -> main) HEAD@{4}: checkout: moving from main to 519a8544c82667
519a854 (HEAD -> main) HEAD@{5}: commit: feat: include dvc-live
f264314 HEAD@{6}: dvc: Restore HEAD to 'main'
9e83c73 HEAD@{7}: commit: dvc: commit experiment 44136fa355b3678
f264314 HEAD@{8}: checkout: moving from main to f264314e5442315
f264314 HEAD@{9}: dvc: Restore HEAD to 'main'
f264314 HEAD@{10}: checkout: moving from main add dat to f264314e5442315
f264314 HEAD@{11}: dvc: Restore HEAD to 'main'
4afc6af HEAD@{12}: commit: dvc: commit experiment 44136fa355b3678a
f264314 HEAD@{13}: checkout: moving from main to f264314e5442315
f264314 HEAD@{14}: dvc: Restore HEAD to 'main'
f264314 HEAD@{15}: checkout: moving from main to f264314e5442315
In the file system we can look inside (via ls
on Linux).
MLB-DATA$ ls .git/refs/exps/51/9a8544c82667cec5356f92ddde77993f0a0e76/
sural-cyma
We can also see, that a new file has appeared in our root directory, dvc.yaml
with the following content.
params:
- dvclive/params.yaml
metrics:
- dvclive/metrics.json
plots:
- dvclive/plots/metrics:
x: step
- dvclive/plots/sklearn/confusion_matrix.json:
template: confusion
x: actual
y: predicted
title: Confusion Matrix
x_label: True Label
y_label: Predicted Label
artifacts:
model:
path: dvclive/artifacts/model.skops
type: model
The content reflects our call and integration with dvclive
from the dvc
perspective and is called a stage. Furthermore, the file dvclive/artifacts/model.skops.dvc
keeps track of the model itself.
In order to show the next feature we need to remove the dvc.yaml
and the dvclive/artifacts/model.skops.dvc
files again. As the current configuration would produce a conflict with the stage we want to introduce next, please delete these files if you type along.
5.4 dvc
pipeline
The pipeline features we are after are part of the dvc stage
command and we add our training call in the following fashion.
1MLB-DATA$ pdm run dvc stage add --name train \
2--deps data --deps src/MECH-M-DUAL-2-MLB-DATA/ --deps params.yaml \
3--outs dvclive \
4
pdm run python src/MECH-M-DUAL-2-MLB-DATA/train.py
Added stage 'train' in 'dvc.yaml'
To track the changes with git, run:
git add dvc.yaml dvclive/artifacts/.gitignore
- 1
- Define the name of the stage.
- 2
-
Include the dependencies,
dvc
will keep track of these files and only reruns the code if there are any changes in these files/directories. - 3
- Define the output directory to keep track of all the files.
- 4
- Command to run to execute the stage.
Before we follow the instructions for the git
commit, we make a dvc commit
.
MLB-DATA$ pdm run dvc commit
This creates the dvc.lock
file associated with the stage. It keeps track of the dependencies (see above explanations):
schema: '2.0'
stages:
train:
cmd: pdm run python src/MECH-M-DUAL-2-MLB-DATA/train.py
deps:
- path: data
hash: md5
md5: 5987e80830fc2caf6d475da3deca1dfe.dir
size: 111165
nfiles: 2
- path: params.yaml
hash: md5
md5: cb73b44317c559fce7c5e035ba5be854
size: 644
- path: src/MECH-M-DUAL-2-MLB-DATA/
hash: md5
md5: c93360b2cf461f6b2f8e9656882331a7.dir
size: 14538
nfiles: 8
outs:
- path: dvclive
hash: md5
md5: c2bbfd7cb23c3aa8700bc24287b56fee.dir
size: 5569537
nfiles: 6
Have a look at 1282a7e9 to see how this is reflected in our reference project.
Now we can put it to action and execute (all) stages and therefore create a new experiment.
MLB-DATA$ pdm run dvc exp run
Reproducing experiment 'mesic-beep'
Building workspace index |16.0 [00:00, 1.23kentry/s]
Comparing indexes |15.0 [00:00, 1.22kentry/s]
Applying changes |0.00 [00:00, ?file/s]
'data.dvc' didn't change, skipping
Stage 'train' didn't change, skipping
Ran experiment(s): mesic-beep
Experiment results have been applied to your workspace.
As we did no change anything in our configuration (see dependencies above), dvc
is smart enough to basically just copy the experiment. But we can also change the parameters, either the file directly or interactively as seen in the next command block.
MLB-DATA$ pdm run dvc exp run --set-param 'PCA.init_args.n_components=5'
Reproducing experiment 'sappy-corm'
1Building workspace index |4.00 [00:00, 378entry/s]
Comparing indexes |15.0 [00:00, 1.20kentry/s]
Applying changes |6.00 [00:00, 391file/s]
'data.dvc' didn't change, skipping
Running stage 'train':
2> pdm run python src/MECH-M-DUAL-2-MLB-DATA/train.py
DEBUG:root: Loaded the data with Split of 60 to 20 per category.
DEBUG:root: Load config
DEBUG:root: Create classifier
DEBUG:root: Train classifier
DEBUG:root: Score classifier
INFO:root: We have a hard voting train-score of 0.9916666666666667
INFO:root: We have a hard voting test-score of 0.6
DEBUG:root: Save clf to skops file dvclive/artifacts/model.skops
Updating lock file 'dvc.lock'
Ran experiment(s): sappy-corm
Experiment results have been applied to your workspace.
- 1
-
dvc
checks the dependencies and applies the changes. - 2
- Run the command specified in the stage.
The result can be seen in the experiment list, of course under a new commit SHA.
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Experiment Created trainscore testscore PCA/n_components LinearDiscriminantAnalysis/solver RandomForestClassifier/n_estimators RandomForestClassifier/max_leaf_nodes RandomForestClassifier/random_state SVC/kernel SVC/probability SVC/random_state PCA.type PCA.init_args.n_components VotingClassifier.type VotingClassifier.init_args.flatten_transform VotingClassifier.estimators LinearDiscriminantAnalysis.type LinearDiscriminantAnalysis.init_args.solver RandomForestClassifier.type RandomForestClassifier.init_args.n_estimators RandomForestClassifier.init_args.max_leaf_nodes RandomForestClassifier.init_args.random_state SVC.type SVC.init_args.kernel SVC.init_args.probability SVC.init_args.random_state data params.yaml src/MECH-M-DUAL-2-MLB-DATA
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
workspace - 0.99167 0.6 5 svd 500 2 6020 linear True 6020 sklearn.decomposition.PCA 5 sklearn.ensemble.VotingClassifier False ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC'] sklearn.discriminant_analysis.LinearDiscriminantAnalysis svd sklearn.ensemble.RandomForestClassifier 500 2 6020 sklearn.svm.SVC linear True 6020 5987e80830fc2caf6d475da3deca1dfe.dir 0ad678c7c338214916d88a106b4fe90a c93360b2cf461f6b2f8e9656882331a7.dir
main 01:36 PM - - - - - - - - - - sklearn.decomposition.PCA 41 sklearn.ensemble.VotingClassifier False ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC'] sklearn.discriminant_analysis.LinearDiscriminantAnalysis svd sklearn.ensemble.RandomForestClassifier 500 2 6020 sklearn.svm.SVC linear True 6020 5987e80830fc2caf6d475da3deca1dfe.dir cb73b44317c559fce7c5e035ba5be854 c93360b2cf461f6b2f8e9656882331a7.dir
└── 9c326d3 [sappy-corm] 01:36 PM 0.99167 0.6 5 svd 500 2 6020 linear True 6020 sklearn.decomposition.PCA 5 sklearn.ensemble.VotingClassifier False ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC'] sklearn.discriminant_analysis.LinearDiscriminantAnalysis svd sklearn.ensemble.RandomForestClassifier 500 2 6020 sklearn.svm.SVC linear True 6020 5987e80830fc2caf6d475da3deca1dfe.dir 0ad678c7c338214916d88a106b4fe90a c93360b2cf461f6b2f8e9656882331a7.dir
519a854 10:15 AM - - - - - - - - - - sklearn.decomposition.PCA 41 sklearn.ensemble.VotingClassifier False ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC'] sklearn.discriminant_analysis.LinearDiscriminantAnalysis svd sklearn.ensemble.RandomForestClassifier 500 2 6020 sklearn.svm.SVC linear True 6020 - - -
├── 1b92871 [mesic-beep] 01:26 PM - - - - - - - - - - sklearn.decomposition.PCA 41 sklearn.ensemble.VotingClassifier False ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC'] sklearn.discriminant_analysis.LinearDiscriminantAnalysis svd sklearn.ensemble.RandomForestClassifier 500 2 6020 sklearn.svm.SVC linear True 6020 5987e80830fc2caf6d475da3deca1dfe.dir cb73b44317c559fce7c5e035ba5be854 c93360b2cf461f6b2f8e9656882331a7.dir
└── 1f2bee5 [sural-cyma] 10:15 AM 1 0.8 41 svd 500 2 6020 linear True 6020 sklearn.decomposition.PCA 41 sklearn.ensemble.VotingClassifier False ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC'] sklearn.discriminant_analysis.LinearDiscriminantAnalysis svd sklearn.ensemble.RandomForestClassifier 500 2 6020 sklearn.svm.SVC linear True 6020 - - -
f264314 Mar 19, 2025 - - - - - - - - - - sklearn.decomposition.PCA 41 sklearn.ensemble.VotingClassifier False ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC'] sklearn.discriminant_analysis.LinearDiscriminantAnalysis svd sklearn.ensemble.RandomForestClassifier 500 2 6020 sklearn.svm.SVC linear True 6020 - - -
df38086 Mar 19, 2025 - - - - - - - - - - sklearn.decomposition.PCA 41 sklearn.ensemble.VotingClassifier False ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC'] sklearn.discriminant_analysis.LinearDiscriminantAnalysis svd sklearn.ensemble.RandomForestClassifier 500 2 6020 sklearn.svm.SVC linear True 6020 - - -
073fb93 Mar 19, 2025 - - - - - - - - - - sklearn.decomposition.PCA 41 sklearn.ensemble.VotingClassifier False ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC'] sklearn.discriminant_analysis.LinearDiscriminantAnalysis svd sklearn.ensemble.RandomForestClassifier 500 2 6020 sklearn.svm.SVC linear True 6020 - - -
9f7dead Mar 19, 2025 - - - - - - - - - - - - - - - - - - - - - - - - - - - -
22788a6 Mar 19, 2025 - - - - - - - - - - - - - - - - - - - - - - - - - - - -
─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
If we want to restore a specific experiment we can use dvc exp apply
to restore it, note that this will also restore the code and the data.
In order to share experiments we need to push them to the remote, git
and dvc
. This is done via
MLB-DATA$ pdm run dvc exp push origin -A
Experiment sural-cyma is up to date on Git remote 'origin'.
Pushed experiment sappy-corm and mesic-beep to Git remote 'origin'.
git
SHAs
As we can see dvc
attaches experiments to git
SHAs. This is an excellent idea as they identify a code state uniquely. Nevertheless, this can backfire.
Some commands and actions can change a git
SHA. This includes squash merges, or a rebase. Be careful when using such actions in your dvc
repository together with experiment storage.
5.5 Add a remote for dvc
The biggest thing missing from our example project with dvc
is a remote to store our data and make it available for cooperation.
For this lecture we use the storage on Sakai, this also means that this part is tricky to follow if you are not part of the lecture.
If a storage that can be accessed via WebDAV7 is available to you use it. Alternatively, use local storage, see File systems (local remotes).
We can add the remote with the dvc remote add
command.
MLB-DATA$ pdm run dvc remote add -d myremote \
webdavs://sakai.mci4me.at/dav/Course-ID-SLVA-46549/MECH-M-DUAL-2-MLB-DATA
Setting 'myremote' as a default remote.
This create the file .dvc/config
with the following content (a toml file).
[core]
remote = myremote
['remote "myremote"']
url = webdavs://sakai.mci4me.at/dav/Course-ID-SLVA-46549/MECH-M-DUAL-2-MLB-DATA
Due to the server structure we need to limit the number of parallel processes for synchronizing the content. In order to do so, we limit the default value for the number of jobs:
MLB-DATA$ pdm run dvc remote modify myremote jobs 4
This will result in the following toml file:
[core]
remote = myremote
['remote "myremote"']
url = webdavs://sakai.mci4me.at/dav/Course-ID-SLVA-46549/MECH-M-DUAL-2-MLB-DATA
jobs = 4
As this issues only occurred late in the project the last line will not show up in the commit afe0848 and only after 0a98a09
To nicely separate the sensitive information for the access, there also exist a .dvc/config.local
file that is in the .gitignore
and will not be committed. We add our user information to this file or via a command in the terminal
MLB-DATA$ pdm run dvc remote modify --local myremote user ***
MLB-DATA$ pdm run dvc remote modify --local myremote password ***
(the user is without the @mci4me.at
)
To handle WebDAV dvc
requires the package dvc-webdav
, install it via pdm
. See afe0848 on how this is reflected in our reference project.
Now we can run dvc push
and our data is stored remotely,
MLB-DATA$ pdm run dvc push
Collecting |14.0 [00:00, 381entry/s]
Pushing
10 files pushed
And of course, dvc pull
to get the files on another computer.
see the lecture MECH-M-DUAL-1-SWD, Chapter 3 or follow the direct link↩︎
see the lecture MECH-M-DUAL-1-SWD, Chapter 2 or follow the direct link↩︎
see the lecture MECH-M-DUAL-1-SWD, Chapter 11 or follow the direct link↩︎
see the lecture MECH-M-DUAL-1-SWD, Chapter 3 or follow the direct link↩︎
The long version uses some options for
git log
preciselygit log --graph --abbrev-commit --decorate --format=format:'%C(bold blue)%h%C(reset) - %C(bold green)(%ar)%C(reset) %C(white)%s%C(reset) %C(dim white)- %an%C(reset)%C(bold yellow)%d%C(reset)' --all
↩︎