5 Code persistence

When talking about storing the model we quickly realise that there is more to this task than simply calling pickle.dump. The same is true for persisting the model, data, and code together.

Luckily, we already have the perfect tool for controlling the code in use, git¹. We can make sure to commit our source files and when using a proper package manager like pdm² we can hope to reproduce our environment for the formats that require these features.

To illustrate this, we move the code from Listing 4.1 into a project. In this process we move from the dense script to a proper file structure by splitting up the source script into several parts. Furthermore, we rework some of the details, like only downloading the data once.

Reference repository

The following example is referenced according to the GitHub repository kandolfp/MECH-M-DUAL-2-MLB-DATA on GitHub.

Please use it as reference and move along the corresponding commit SHAs to see certain reference points as we continuously rework the code to adapt to the new requirements and situation.

This also means some of the code included below is not interactively created but static with regard to the execution time and state.

To emphasise that we are in the context of this reference repository the prompt for the terminal will always start with MLB-DATA$.

The code is called with respect to the root director of the project as seen in the next output block.

If you decide not to write your own code but rather use the reference repository for the following exercise, it is advisable to implement the exercise at the same SHA mentioned prior to the exercise, in order to avoid getting ahead of yourself.

After the redesign we get a structure looking something like the following, see 22788a6 for reference and file content:

MLB-DATA$ tree

.
├── pdm.lock
├── pyproject.toml
└── src
    └── MECH-M-DUAL-2-MLB-DATA
        ├── data.py
        ├── inference.py
        ├── model.py
        ├── myio.py
        └── train.py

3 directories, 7 files

We can train the model by calling train.py (in this case with a logger³ on DEBUG enabled).

MLB-DATA$ pdm run src/MECH-M-DUAL-2-MLB-DATA/train.py 

DEBUG:root: Loaded the data with Split of 60 to 20 per category.
DEBUG:root: Create classifier
DEBUG:root: Train classifier
DEBUG:root: Score classifier
INFO:root: We have a hard voting score of 0.8
DEBUG:root: Save clf to skops file models/model.skops

Of course we can also load the model again and do inference with it by calling inference.py

MLB-DATA$ pdm run src/MECH-M-DUAL-2-MLB-DATA/inference.py 

DEBUG:root: Loaded the data with Split of 60 to 20 per category.
DEBUG:root: Load classifier
DEBUG:root: Load clf from skops file models/model.skops
WARNING:root: Unknown type at 0 is sklearn.utils._bunch.Bunch.
DEBUG:root: Score classifier
INFO:root: We have a hard voting score of 0.8

Now we can start connecting the model and the code. Our model was created at commit 22788a6 and stored into the directory model.

Now there are several things we can do to make sure this is reflected within our little project.

We can share experiments and models with other people and they can reproduce them.
Make the convention to never train and store an experiment for later use as long as we have uncommitted changes in our code (easier said than done).
Make sure to note somewhere what commit SHA is the current HEAD when storing the model. This allows us to reproduce it in case of data corruption or loss and for comparison with other models.
When running inference we can check for the commit SHA and switch to it in case the project dependencies have changed and we get problems loading the file.
Every time we change a parameter, we get a new commit, which is not very nice for our code but can be used to note the intend of the experiment in the commit message.

This is rather cumbersome and requires a lot of discipline, it will also become tricky if several people work on the same project and run experiments with different parameters.

Exercise 5.1 (Check for a dirty git repository) Implement a function (or a decorator) in Python that uses GitPython (alternatives or the plain shell can also be used) to introduce a safeguard such that training can only be called if the repository is not dirty.

Optional: Extend the implementation and check if the local repository is not behind the remote.

5.1 Externalize the parameters/configuration

First thing we do is we externalize the configuration to make sure this is no longer part of our main code source. This means we can filter between actual changes to the structure and source and a commit for a simple experiment.

yaml is the format to go for these aspects, see Wikipedia. It is a human readable data serialization language. One possible interpretation of the config (among many others is) can be seen in the next code block.

PCA:
  type: sklearn.decomposition.PCA
  init_args:
    n_components: 41

VotingClassifier:
  type: sklearn.ensemble.VotingClassifier
  init_args:
    flatten_transform: False
  estimators:
    - LinearDiscriminantAnalysis
    - RandomForestClassifier
    - SVC

LinearDiscriminantAnalysis:
  type: sklearn.discriminant_analysis.LinearDiscriminantAnalysis
  init_args:
    solver: svd

RandomForestClassifier:
  type: sklearn.ensemble.RandomForestClassifier
  init_args:
    n_estimators: 500
    max_leaf_nodes: 2
    random_state: 6020

SVC:
  type: sklearn.svm.SVC
  init_args:
    kernel: linear
    probability: True
    random_state: 6020

We can use the Python package omegaconf to load and use it.

One feature of the OmegaConfclass is that we can use unpacking⁴ and therefore write a line like:

PCA(**params["PCA"].init_args)

dramatically simplifying our code. Have a look at commit 9f7dead to see this implemented and in action. As a our default parameter template we include the above parameters as params.yaml in the main directory of our project.

Exercise 5.2 (Externalize params) We can generate the entire model from the params, even the different classes. By using the function from importlib import import_module we can dynamically load a class with the following snippet:

module = import_module(params["PCA"].type.rsplit(".", 1)[0])
PCA = getattr(module, params["PCA"].type.rsplit(".", 1)[-1])(
    **params["PCA"].init_args
)

Use this to make the model creation more and more dynamic.

Replace the array estimators=[] by dynamically loading the different estimators.
Use the same for the pipeline

Note: If you see an advantage in rewriting the config structure to make your code easier feel free to do so.

Now that the config is externalized we can continue on our quest to persist all aspects of this project together in a useful way.

5.2 Data persistence

Our model depends on the code, the configuration, but crucially also on the input data itself. In order to make sure that we can reliably reproduce a model we also need to make sure our data is reproducible.

In our reference project we use some files from GitHub (see @Brunton and Kutz (2022) for as reference) but let us still make sure they are tracked within our system.

Important

We use a simple tool for the next paragraphs to illustrate the concepts. There are plenty of alternatives that can and are used.

The selection of dvc is purely motivated by the following features: it can easily be used in such our lecture, has some basic features that illustrate the requirements on such systems, integrates nicely with Python, has only limited dependencies outside of the Python eco-system.

Alternatives include:

Like always, the best platform depends on the project and the available infrastructure.

One tool for data version control is dvc. As it is written in Python we can even add and track the version via our package manager pdm add dvc. Once installed we can initialize it in our project,

MLB-DATA$ pdm run dvc init

Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>

To make the changes permanent we commit the directory .dvc (with all the files included) to git and therefore our project now runs with dvc, see commit df38086 for reference.

To add the data directory simply run

MLB-DATA$ pdm run dvc add data

100% Adding...|███████████████████████████████████████|1/1 [00:00,  5.67file/s]

To track the changes with git, run:

        git add data.dvc

To enable auto staging, run:

        dvc config core.autostage true

Again, to make it permanent in the project we need to add data.dvc to git (as suggested).

Note

At this point we have the two files catData_w.mat and dogData_w.mat in this directory and they are under version control from dvc.

If we take a look into data.dvc we can see that it tracks the files via md5 SHAs and includes some additional useful information .

MLB-DATA$ cat data.dvc

outs:
- md5: 5987e80830fc2caf6d475da3deca1dfe.dir
  size: 111165
  nfiles: 2
  hash: md5
  path: data

As mentioned, dvc works similar to git so eventually we will need to include a remote that we push data to. For now we just work locally, similar as we could do for a git repository.

Other than that, we can now change the files use dvc add data and as soon as we commit the corresponding change in the data.dvc to git we know exactly what data is used and we can also restore it.

To do these operations dvc uses a cache (default location is in .dvc/cache) and the files are links to the cache.

The most important dvc commands are (we link the docs for an extended reference):

dvc add to add a file or directory.
dvc checkout brings your work space up to date, according to the .dvc files current states.
dvc commit updates the .dvc files and stores the content in the cache, most of the time called implicitly.
dvc config view and change the config for the repo or globally for dvc on the system.
dvc data status chow changes to the files in the work space with respect to thegit HEAD.
dvc destroy remove all files and dvc structures for the current project, including the cache. The symlinks will be replaced by the actual data so the current state is preserved.
dvc exp has multiple subcommands and is used to handle experiments, we will use this command later.
dvc fetch download files from the remote repository to the cache.
dvc pull download files from the remote and make them visible in the working space.
dvc push upload the tracked files to the remote.

For the other commands run dvc --help or look at the docs.

Note

As can be seen from the command, dvc was build with git in mind and feels quite similar. This means it uses the same commands for the same (or almost same) operations. Unfortionalty or luckily (depending on our preferences), it also brings in the sometimes confusing command structure and the concepts like a working space.

Recall the introduction to git for some of these ⁵. We will use this to also recall some details about git to, hopefully, further foster the understanding

Now our files are tracked, but as you probably realised we did not add the module folder to dvc. This is due to the fact that we can use the dvc exp feature to allow for more fine grained control and even parameter overviews. Furthermore, we can use logging features to integrate with this system even better.

dvc also allows advanced nice pipelines (we look at a small example later) and automatic computation as well as monitoring. In all its facets this is quite advanced and can be introduced when our project grows.

5.3 `dvclive` for experiment management

dvclive works best with the big ML Frameworks like keras or pytorch but we can also utilize it for our example project. The introduction to the experiment management from the dvc perspective can be found in the docs.

To show some of the dvclive features we reworked to code, see commit 519a85 for the changes (including pdm add dvclive). Now, when we run our job once more it will create the dvclive directory with a couple of subdirectories containing our metrics, looking like the following output.

MLB-DATA$ pdm run src/MECH-M-DUAL-2-MLB-DATA/train.py

INFO:root: We have a hard voting train-score of 1.0
INFO:root: We have a hard voting test-score of 0.8
100% Adding...|███████████████████████████████████████|1/1 [00:00,  7.64file/s]

Output with logger on DEBUG.

MLB-DATA$ pdm run python src/MECH-M-DUAL-2-MLB-DATA/train.py

DEBUG:root: Loaded the data with Split of 60 to 20 per category.
DEBUG:root: Load config
DEBUG:root: Create classifier
DEBUG:root: Train classifier
DEBUG:root: Score classifier
INFO:root: We have a hard voting train-score of 1.0
INFO:root: We have a hard voting test-score of 0.8
DEBUG:root: Save clf to skops file dvclive/artifacts/model.skops
100% Adding...|███████████████████████████████████████|1/1 [00:00,  9.35file/s]
DEBUG:scmrepo.git: Stashing workspace
DEBUG:scmrepo.git.stash: Stashing changes in 'refs/stash'
DEBUG:scmrepo.git: Detaching HEAD at 'HEAD'
DEBUG:scmrepo.git.stash: Applying stash commit '7d0806908'
Collecting files and computing hashes in data        |0.00 [00:00,     ?file/s]
DEBUG:fsspec.memoryfs: open file /.UFCO7_lfUrOd1sAFVpO-vw.tmp
DEBUG:fsspec.memoryfs: info: memory://.UFCO7_lfUrOd1sAFVpO-vw.tmp
Collecting files and computing hashes in data        |0.00 [00:00,     ?file/s]
DEBUG:fsspec.memoryfs: open file /.4cDC9-sZfddrOSDnK8EeJA.tmp
DEBUG:fsspec.memoryfs: info: memory://.4cDC9-sZfddrOSDnK8EeJA.tmp
DEBUG:scmrepo.git.stash: Stashing changes in 'refs/exps/stash'
DEBUG:scmrepo.git: Restore HEAD to 'main'
DEBUG:scmrepo.git: Restoring stashed workspace
DEBUG:scmrepo.git.stash: Popping from stash 'refs/stash'
DEBUG:scmrepo.git.stash: Applying stash commit '7d0806908'
DEBUG:scmrepo.git.stash: Dropping 'refs/stash@{0}'
DEBUG:scmrepo.git.stash: Dropping 'refs/exps/stash@{0}'
DEBUG:scmrepo.git: Detaching HEAD at 'f264314e5442315c74c395a89c2c73a3b7269f90'
DEBUG:scmrepo.git.stash: Applying stash commit 'adac86009'
DEBUG:fsspec.memoryfs: open file /.cSw5cWuYdyH1AkprvFGF8g.tmp
DEBUG:fsspec.memoryfs: info: memory://.cSw5cWuYdyH1AkprvFGF8g.tmp
Collecting files and computing hashes in data         0.00 [00:00,     ?file/s]
DEBUG:fsspec.memoryfs: open file /.-QV8fKiPGOS3tR1JS0SE2w.tmp
DEBUG:fsspec.memoryfs: info: memory://.-QV8fKiPGOS3tR1JS0SE2w.tmp
Collecting files and computing hashes in data        |0.00 [00:00,     ?file/s]
DEBUG:fsspec.memoryfs: open file /.nJTqHs5Fi91gZyK82VYbvA.tmp
DEBUG:fsspec.memoryfs: info: memory://.nJTqHs5Fi91gZyK82VYbvA.tmp
DEBUG:scmrepo.git: Restore HEAD to 'main'

Here we can also see how dvc interacts with git to store the files.

And the resulting files are stored in dvclive.

MLB-DATA$ tree dvclive/

dvclive/
├── artifacts
│   └── model.skops
├── metrics.json
├── params.yaml
└── plots
    ├── metrics
    │   ├── testscore.tsv
    │   └── trainscore.tsv
    └── sklearn
        └── confusion_matrix.json

5 directories, 6 files

The experiment is automatically added. We can check this with dvc exp show

MLB-DATA$ pdm run dvc exp show

 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
  Experiment                 Created        trainscore   testscore   PCA/n_components   LinearDiscriminantAnalysis/solver   RandomForestClassifier/n_estimators   RandomForestClassifier/max_leaf_nodes   RandomForestClassifier/random_state   SVC/kernel   SVC/probability   SVC/random_state   PCA.type                    PCA.init_args.n_components   VotingClassifier.type               VotingClassifier.init_args.flatten_transform   VotingClassifier.estimators                                       LinearDiscriminantAnalysis.type                            LinearDiscriminantAnalysis.init_args.solver   RandomForestClassifier.type               RandomForestClassifier.init_args.n_estimators   RandomForestClassifier.init_args.max_leaf_nodes   RandomForestClassifier.init_args.random_state   SVC.type          SVC.init_args.kernel   SVC.init_args.probability   SVC.init_args.random_state  
 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
  workspace                  -                       1         0.8   41                 svd                                 500                                   2                                       6020                                  linear       True              6020               sklearn.decomposition.PCA   41                           sklearn.ensemble.VotingClassifier   False                                          ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC']   sklearn.discriminant_analysis.LinearDiscriminantAnalysis   svd                                           sklearn.ensemble.RandomForestClassifier   500                                             2                                                 6020                                            sklearn.svm.SVC   linear                 True                        6020                        
  main                       Mar 19, 2025            -           -   -                  -                                   -                                     -                                       -                                     -            -                 -                  sklearn.decomposition.PCA   41                           sklearn.ensemble.VotingClassifier   False                                          ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC']   sklearn.discriminant_analysis.LinearDiscriminantAnalysis   svd                                           sklearn.ensemble.RandomForestClassifier   500                                             2                                                 6020                                            sklearn.svm.SVC   linear                 True                        6020                        
  └── 9e83c73 [sural-cyma]   10:03 AM                1         0.8   41                 svd                                 500                                   2                                       6020                                  linear       True              6020               sklearn.decomposition.PCA   41                           sklearn.ensemble.VotingClassifier   False                                          ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC']   sklearn.discriminant_analysis.LinearDiscriminantAnalysis   svd                                           sklearn.ensemble.RandomForestClassifier   500                                             2                                                 6020                                            sklearn.svm.SVC   linear                 True                        6020                        
 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Note

dvc exp show provides an overview of the experiments base on the git commit SHAs. The output can be counter intuitive. Check out the option ( dvc exp show --help), especially --num and -A to control where to search for experiments.

Each experiment has a unique name, as we did not specify anything a random name is created. In our case sural-cyma (they are often fun but can be hard to infer meaning and speaking names are most the time more useful, see -n option).

It is also possible provide a commit message to give more details to the experiment.

The overview also allows us to see the score for our model and the parameters, this allows us to quickly compare models.

Note

We can see that the parameter from params.yaml are automatically added and our code somewhat duplicated them.

dvclive relies on git to do its magic for files and the dvc cache for large files. How this works is that a reference inside git is created for each experiment and stores the changes there. We can see this in the above logger output or via git.

1MLB-DATA$ git lg

| * 1f2bee5 - (29 hours ago) dvc: commit experiment 44136fa35 - Authors
|/  
* 519a854 - (29 hours ago) feat: include dvc-live - Authors
* f264314 - (2 days ago) feat: add data to dvc - Authors
* df38086 - (2 days ago) feat: init dvc - Authors
* 073fb93 - (2 days ago) fixup! feat: externalize params via omegaconfig - Authors
* 9f7dead - (2 days ago) feat: externalize params via omegaconfig - Authors
* 22788a6 - (2 days ago) init: project - Authors

1: This is a shorthand see⁶.

Furthermore, we did not commit our changes to git (bad practice!!!) but they are stored alongside with the experiment, so no information is lost (we can find them in .git/refs , see below.)

To clean up, we commit our changes as we know they work and then rerun the code for a new experiment, see 519a854. We can remove the previous version with dvc exp rm <name>.

By default, this is not moved to the git remote, to do so we need to run dvc exp push <git remote>.

MLB-DATA$ pdm run dvc exp push origin

Collecting                                           |0.00 [00:00,    ?entry/s]
Pushing
Experiment sural-cyma is up to date on Git remote 'origin'.

We can also see this in the git reflog (compare above output).

MLB-DATA$ git reflog

519a854 (HEAD -> main) HEAD@{0}: dvc: Restore HEAD to 'main'
1f2bee5 HEAD@{1}: commit: dvc: commit experiment 44136fa355b3678
519a854 (HEAD -> main) HEAD@{2}: checkout: moving from main to 519a8544c82667
519a854 (HEAD -> main) HEAD@{3}: dvc: Restore HEAD to 'main'
519a854 (HEAD -> main) HEAD@{4}: checkout: moving from main to 519a8544c82667
519a854 (HEAD -> main) HEAD@{5}: commit: feat: include dvc-live
f264314 HEAD@{6}: dvc: Restore HEAD to 'main'
9e83c73 HEAD@{7}: commit: dvc: commit experiment 44136fa355b3678
f264314 HEAD@{8}: checkout: moving from main to f264314e5442315
f264314 HEAD@{9}: dvc: Restore HEAD to 'main'
f264314 HEAD@{10}: checkout: moving from main add dat to f264314e5442315
f264314 HEAD@{11}: dvc: Restore HEAD to 'main'
4afc6af HEAD@{12}: commit: dvc: commit experiment 44136fa355b3678a
f264314 HEAD@{13}: checkout: moving from main to f264314e5442315
f264314 HEAD@{14}: dvc: Restore HEAD to 'main'
f264314 HEAD@{15}: checkout: moving from main to f264314e5442315

In the file system we can look inside (via ls on Linux).

MLB-DATA$ ls .git/refs/exps/51/9a8544c82667cec5356f92ddde77993f0a0e76/

sural-cyma

We can also see, that a new file has appeared in our root directory, dvc.yaml with the following content.

params:
- dvclive/params.yaml
metrics:
- dvclive/metrics.json
plots:
- dvclive/plots/metrics:
    x: step
- dvclive/plots/sklearn/confusion_matrix.json:
    template: confusion
    x: actual
    y: predicted
    title: Confusion Matrix
    x_label: True Label
    y_label: Predicted Label
artifacts:
  model:
    path: dvclive/artifacts/model.skops
    type: model

The content reflects our call and integration with dvclive from the dvc perspective and is called a stage. Furthermore, the file dvclive/artifacts/model.skops.dvc keeps track of the model itself.

Important

In order to show the next feature we need to remove the dvc.yaml and the dvclive/artifacts/model.skops.dvc files again. As the current configuration would produce a conflict with the stage we want to introduce next, please delete these files if you type along.

5.4 `dvc` pipeline

The pipeline features we are after are part of the dvc stage command and we add our training call in the following fashion.

1MLB-DATA$ pdm run dvc stage add --name train \
2          --deps data --deps src/MECH-M-DUAL-2-MLB-DATA/ --deps params.yaml \
3          --outs dvclive \
4          pdm run python src/MECH-M-DUAL-2-MLB-DATA/train.py

Added stage 'train' in 'dvc.yaml'                                     

To track the changes with git, run:

        git add dvc.yaml dvclive/artifacts/.gitignore

1: Define the name of the stage.
2: Include the dependencies, dvc will keep track of these files and only reruns the code if there are any changes in these files/directories.
3: Define the output directory to keep track of all the files.
4: Command to run to execute the stage.

Before we follow the instructions for the git commit, we make a dvc commit.

MLB-DATA$ pdm run dvc commit

This creates the dvc.lock file associated with the stage. It keeps track of the dependencies (see above explanations):

schema: '2.0'
stages:
  train:
    cmd: pdm run python src/MECH-M-DUAL-2-MLB-DATA/train.py
    deps:
    - path: data
      hash: md5
      md5: 5987e80830fc2caf6d475da3deca1dfe.dir
      size: 111165
      nfiles: 2
    - path: params.yaml
      hash: md5
      md5: cb73b44317c559fce7c5e035ba5be854
      size: 644
    - path: src/MECH-M-DUAL-2-MLB-DATA/
      hash: md5
      md5: c93360b2cf461f6b2f8e9656882331a7.dir
      size: 14538
      nfiles: 8
    outs:
    - path: dvclive
      hash: md5
      md5: c2bbfd7cb23c3aa8700bc24287b56fee.dir
      size: 5569537
      nfiles: 6

Have a look at 1282a7e9 to see how this is reflected in our reference project.

Now we can put it to action and execute (all) stages and therefore create a new experiment.

MLB-DATA$ pdm run dvc exp run

Reproducing experiment 'mesic-beep'  
Building workspace index                            |16.0 [00:00, 1.23kentry/s]
Comparing indexes                                   |15.0 [00:00, 1.22kentry/s]
Applying changes                                    |0.00 [00:00,     ?file/s]
'data.dvc' didn't change, skipping                 
Stage 'train' didn't change, skipping
                                 
Ran experiment(s): mesic-beep
Experiment results have been applied to your workspace.

As we did no change anything in our configuration (see dependencies above), dvc is smart enough to basically just copy the experiment. But we can also change the parameters, either the file directly or interactively as seen in the next command block.

MLB-DATA$ pdm run dvc exp run --set-param 'PCA.init_args.n_components=5'

Reproducing experiment 'sappy-corm'
1Building workspace index                             |4.00 [00:00,  378entry/s]
Comparing indexes                                   |15.0 [00:00, 1.20kentry/s]
Applying changes                                     |6.00 [00:00,   391file/s]
'data.dvc' didn't change, skipping
Running stage 'train':
2> pdm run python src/MECH-M-DUAL-2-MLB-DATA/train.py
DEBUG:root: Loaded the data with Split of 60 to 20 per category.
DEBUG:root: Load config
DEBUG:root: Create classifier
DEBUG:root: Train classifier
DEBUG:root: Score classifier
INFO:root: We have a hard voting train-score of 0.9916666666666667
INFO:root: We have a hard voting test-score of 0.6
DEBUG:root: Save clf to skops file dvclive/artifacts/model.skops
Updating lock file 'dvc.lock'

Ran experiment(s): sappy-corm
Experiment results have been applied to your workspace.

1: dvc checks the dependencies and applies the changes.
2: Run the command specified in the stage.

The result can be seen in the experiment list, of course under a new commit SHA.

 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
  Experiment                 Created        trainscore   testscore   PCA/n_components   LinearDiscriminantAnalysis/solver   RandomForestClassifier/n_estimators   RandomForestClassifier/max_leaf_nodes   RandomForestClassifier/random_state   SVC/kernel   SVC/probability   SVC/random_state   PCA.type                    PCA.init_args.n_components   VotingClassifier.type               VotingClassifier.init_args.flatten_transform   VotingClassifier.estimators                                       LinearDiscriminantAnalysis.type                            LinearDiscriminantAnalysis.init_args.solver   RandomForestClassifier.type               RandomForestClassifier.init_args.n_estimators   RandomForestClassifier.init_args.max_leaf_nodes   RandomForestClassifier.init_args.random_state   SVC.type          SVC.init_args.kernel   SVC.init_args.probability   SVC.init_args.random_state   data                                   params.yaml                        src/MECH-M-DUAL-2-MLB-DATA            
 ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── 
  workspace                  -                 0.99167         0.6   5                  svd                                 500                                   2                                       6020                                  linear       True              6020               sklearn.decomposition.PCA   5                            sklearn.ensemble.VotingClassifier   False                                          ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC']   sklearn.discriminant_analysis.LinearDiscriminantAnalysis   svd                                           sklearn.ensemble.RandomForestClassifier   500                                             2                                                 6020                                            sklearn.svm.SVC   linear                 True                        6020                         5987e80830fc2caf6d475da3deca1dfe.dir   0ad678c7c338214916d88a106b4fe90a   c93360b2cf461f6b2f8e9656882331a7.dir  
  main                       01:36 PM                -           -   -                  -                                   -                                     -                                       -                                     -            -                 -                  sklearn.decomposition.PCA   41                           sklearn.ensemble.VotingClassifier   False                                          ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC']   sklearn.discriminant_analysis.LinearDiscriminantAnalysis   svd                                           sklearn.ensemble.RandomForestClassifier   500                                             2                                                 6020                                            sklearn.svm.SVC   linear                 True                        6020                         5987e80830fc2caf6d475da3deca1dfe.dir   cb73b44317c559fce7c5e035ba5be854   c93360b2cf461f6b2f8e9656882331a7.dir  
  └── 9c326d3 [sappy-corm]   01:36 PM          0.99167         0.6   5                  svd                                 500                                   2                                       6020                                  linear       True              6020               sklearn.decomposition.PCA   5                            sklearn.ensemble.VotingClassifier   False                                          ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC']   sklearn.discriminant_analysis.LinearDiscriminantAnalysis   svd                                           sklearn.ensemble.RandomForestClassifier   500                                             2                                                 6020                                            sklearn.svm.SVC   linear                 True                        6020                         5987e80830fc2caf6d475da3deca1dfe.dir   0ad678c7c338214916d88a106b4fe90a   c93360b2cf461f6b2f8e9656882331a7.dir  
  519a854                    10:15 AM                -           -   -                  -                                   -                                     -                                       -                                     -            -                 -                  sklearn.decomposition.PCA   41                           sklearn.ensemble.VotingClassifier   False                                          ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC']   sklearn.discriminant_analysis.LinearDiscriminantAnalysis   svd                                           sklearn.ensemble.RandomForestClassifier   500                                             2                                                 6020                                            sklearn.svm.SVC   linear                 True                        6020                         -                                      -                                  -                                     
  ├── 1b92871 [mesic-beep]   01:26 PM                -           -   -                  -                                   -                                     -                                       -                                     -            -                 -                  sklearn.decomposition.PCA   41                           sklearn.ensemble.VotingClassifier   False                                          ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC']   sklearn.discriminant_analysis.LinearDiscriminantAnalysis   svd                                           sklearn.ensemble.RandomForestClassifier   500                                             2                                                 6020                                            sklearn.svm.SVC   linear                 True                        6020                         5987e80830fc2caf6d475da3deca1dfe.dir   cb73b44317c559fce7c5e035ba5be854   c93360b2cf461f6b2f8e9656882331a7.dir  
  └── 1f2bee5 [sural-cyma]   10:15 AM                1         0.8   41                 svd                                 500                                   2                                       6020                                  linear       True              6020               sklearn.decomposition.PCA   41                           sklearn.ensemble.VotingClassifier   False                                          ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC']   sklearn.discriminant_analysis.LinearDiscriminantAnalysis   svd                                           sklearn.ensemble.RandomForestClassifier   500                                             2                                                 6020                                            sklearn.svm.SVC   linear                 True                        6020                         -                                      -                                  -                                     
  f264314                    Mar 19, 2025            -           -   -                  -                                   -                                     -                                       -                                     -            -                 -                  sklearn.decomposition.PCA   41                           sklearn.ensemble.VotingClassifier   False                                          ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC']   sklearn.discriminant_analysis.LinearDiscriminantAnalysis   svd                                           sklearn.ensemble.RandomForestClassifier   500                                             2                                                 6020                                            sklearn.svm.SVC   linear                 True                        6020                         -                                      -                                  -                                     
  df38086                    Mar 19, 2025            -           -   -                  -                                   -                                     -                                       -                                     -            -                 -                  sklearn.decomposition.PCA   41                           sklearn.ensemble.VotingClassifier   False                                          ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC']   sklearn.discriminant_analysis.LinearDiscriminantAnalysis   svd                                           sklearn.ensemble.RandomForestClassifier   500                                             2                                                 6020                                            sklearn.svm.SVC   linear                 True                        6020                         -                                      -                                  -                                     
  073fb93                    Mar 19, 2025            -           -   -                  -                                   -                                     -                                       -                                     -            -                 -                  sklearn.decomposition.PCA   41                           sklearn.ensemble.VotingClassifier   False                                          ['LinearDiscriminantAnalysis', 'RandomForestClassifier', 'SVC']   sklearn.discriminant_analysis.LinearDiscriminantAnalysis   svd                                           sklearn.ensemble.RandomForestClassifier   500                                             2                                                 6020                                            sklearn.svm.SVC   linear                 True                        6020                         -                                      -                                  -                                     
  9f7dead                    Mar 19, 2025            -           -   -                  -                                   -                                     -                                       -                                     -            -                 -                  -                           -                            -                                   -                                              -                                                                 -                                                          -                                             -                                         -                                               -                                                 -                                               -                 -                      -                           -                            -                                      -                                  -                                     
  22788a6                    Mar 19, 2025            -           -   -                  -                                   -                                     -                                       -                                     -            -                 -                  -                           -                            -                                   -                                              -                                                                 -                                                          -                                             -                                         -                                               -                                                 -                                               -                 -                      -                           -                            -                                      -                                  -                                     
 ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

If we want to restore a specific experiment we can use dvc exp apply to restore it, note that this will also restore the code and the data.

In order to share experiments we need to push them to the remote, git and dvc. This is done via

MLB-DATA$ pdm run dvc exp push origin -A

Experiment sural-cyma is up to date on Git remote 'origin'.
Pushed experiment sappy-corm and mesic-beep to Git remote 'origin'.

Warning: Be careful with changed git SHAs

As we can see dvc attaches experiments to git SHAs. This is an excellent idea as they identify a code state uniquely. Nevertheless, this can backfire.

Some commands and actions can change a git SHA. This includes squash merges, or a rebase. Be careful when using such actions in your dvc repository together with experiment storage.

Exercise 5.3 (dvc queue) Checkout the option --queue for dvc exp run. Use it to plan various experiments with different parameters from the params.yaml file.

Run them via dvc queue start.

Exercise 5.4 (dvc visualization) dvc can help us visualize our experiments. Have a look at dvc plots to visualize the experiments and there differences.

5.5 Add a remote for `dvc`

The biggest thing missing from our example project with dvc is a remote to store our data and make it available for cooperation.

Important

For this lecture we use the storage on Sakai, this also means that this part is tricky to follow if you are not part of the lecture.

If a storage that can be accessed via WebDAV⁷ is available to you use it. Alternatively, use local storage, see File systems (local remotes).

We can add the remote with the dvc remote add command.

MLB-DATA$ pdm run dvc remote add -d myremote \
    webdavs://sakai.mci4me.at/dav/Course-ID-SLVA-46549/MECH-M-DUAL-2-MLB-DATA

Setting 'myremote' as a default remote.

This create the file .dvc/config with the following content (a toml file).

[core]
    remote = myremote
['remote "myremote"']
    url = webdavs://sakai.mci4me.at/dav/Course-ID-SLVA-46549/MECH-M-DUAL-2-MLB-DATA

Important

Due to the server structure we need to limit the number of parallel processes for synchronizing the content. In order to do so, we limit the default value for the number of jobs:

MLB-DATA$ pdm run dvc remote modify myremote jobs 4

This will result in the following toml file:

[core]
    remote = myremote
['remote "myremote"']
    url = webdavs://sakai.mci4me.at/dav/Course-ID-SLVA-46549/MECH-M-DUAL-2-MLB-DATA
    jobs = 4

As this issues only occurred late in the project the last line will not show up in the commit afe0848 and only after 0a98a09

To nicely separate the sensitive information for the access, there also exist a .dvc/config.local file that is in the .gitignore and will not be committed. We add our user information to this file or via a command in the terminal

MLB-DATA$ pdm run dvc remote modify --local myremote user ***

MLB-DATA$ pdm run dvc remote modify --local myremote password ***

(the user is without the @mci4me.at)

To handle WebDAV dvc requires the package dvc-webdav, install it via pdm. See afe0848 on how this is reflected in our reference project.

Now we can run dvc push and our data is stored remotely,

MLB-DATA$ pdm run dvc push

Collecting                                           |14.0 [00:00,  381entry/s]
Pushing                                   
10 files pushed

And of course, dvc pull to get the files on another computer.

Exercise 5.5 (dvc remote)

Synch the data from the remote described here.
On the course sakai cloud your have write rights in the subfolder students_remote, create a folder with your name and use it as remote.
- Synchronize the performed experiments with the remote (dvc and git).
Use a local directory as remote and see if you can make sense of the content, i.e. how is it structured.

see the lecture MECH-M-DUAL-1-SWD, Chapter 3 or follow the direct link ↩︎
see the lecture MECH-M-DUAL-1-SWD, Chapter 2 or follow the direct link ↩︎
see the lecture MECH-M-DUAL-1-SWD, Chapter 11 or follow the direct link ↩︎
see the Python documentation or follow the direct link ↩︎
see the lecture MECH-M-DUAL-1-SWD, Chapter 3 or follow the direct link ↩︎
The long version uses some options for git log precisely git log --graph --abbrev-commit --decorate --format=format:'%C(bold blue)%h%C(reset) - %C(bold green)(%ar)%C(reset) %C(white)%s%C(reset) %C(dim white)- %an%C(reset)%C(bold yellow)%d%C(reset)' --all↩︎
See Wiki for a quick reference, accessed on 21.03.2025↩︎

5.1 Externalize the parameters/configuration

5.2 Data persistence

5.3 dvclive for experiment management

5.4 dvc pipeline

5.5 Add a remote for dvc

5.3 `dvclive` for experiment management

5.4 `dvc` pipeline

5.5 Add a remote for `dvc`