4  Model persistence

So far, we either loaded a dataset or generated it on the fly. Therefore, we start by looking into ways to persist the models we generated.

The general idea is to simply store the object we generate and load it at some later time. Nevertheless, this can be quite tricky as we will see in the following.

For example it might be that we do our training in a different environment than the inference or prediction. It might even be the case, that we switch programming language for these tasks to extract the best performance.

As we mainly work with scikit-learn we introduce the concepts with it. Our first step is to check the documentation docs - model persistence and we can use it as reference for this introduction. We introduce different possibilities, all of them have strengths and weaknesses, unfortunately there is no gold standard.

Let us use the following toy example, see Listing 4.1, with our cats and dogs.

Listing 4.1: Code for the toy example
import numpy as np
import scipy
import requests
import io
import sklearn
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier, VotingClassifier 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import SVC

response = requests.get(
    "https://github.com/dynamicslab/databook_python/"
    "raw/refs/heads/master/DATA/catData_w.mat")
cats_w = scipy.io.loadmat(io.BytesIO(response.content))["cat_wave"]

response = requests.get(
    "https://github.com/dynamicslab/databook_python/"
    "raw/refs/heads/master/DATA/dogData_w.mat")
dogs_w = scipy.io.loadmat(io.BytesIO(response.content))["dog_wave"]

X_train = np.concatenate((cats_w[:60, :], dogs_w[:60, :]))
y_train = np.repeat(np.array([1, -1]), 60)
X_test = np.concatenate((cats_w[60:80, :], dogs_w[60:80, :]))
y_test = np.repeat(np.array([1, -1]), 20)

voting_clf = make_pipeline(
    PCA(n_components=41),
    VotingClassifier(
        estimators=[
            ("lda", LinearDiscriminantAnalysis()),
            ("rf", RandomForestClassifier(
                n_estimators=500,
                max_leaf_nodes=2,
                random_state=6020)),
            ("svc", SVC(
                kernel="linear",
                probability=True,
                random_state=6020)),
        ],
        flatten_transform=False,
    )
)

voting_clf.fit(X_train, y_train)
score = voting_clf.score(X_test, y_test)
print(f"We have a hard voting score of {score}")
We have a hard voting score of 0.8
Note

In the next couple of exercises we create different version of our model and persist it to storage. Try to keep track of what model version corresponds to which exercise/code block.

Important

We need to install several packages for the following exercises.

All of them can be installed via:

pdm add skl2onnx onnxruntime skops

4.1 Open Neural Network Exchange - ONNX

ONNX is an open format built to represent machine learning models. ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers. LEARN MORE

Source: https://onnx.ai/, accessed 07.03.2025.

The use-case for ONNX is to persist a model without necessarily using the Python object itself. This is especially useful, when the runtime for distributing the model is not Python or a very restricted Python environment.

Now let us see how we can persist the model of Listing 4.1 via ONNX.

from skl2onnx import to_onnx
1onx = to_onnx(voting_clf, X_train[:1].astype(np.int64))
with open("model.onnx", "wb") as f:
    f.write(onx.SerializeToString())
1
Not all data types are supported, so we need to convert our reference training sample to int64 (potentially increasing storage demands).

As mentioned, the file format is binary so it does not make a lot of sense to actually read the resulting file in plain text but we can have a look at the size.

80K model.onnx

Unfortunately, there is no method to convert back to our scikit-learn. What we can do is to use the onnxruntime and see if we still get the same score as before.

import onnxruntime as ort

model = ort.InferenceSession("model.onnx")
input_name = model.get_inputs()[0].name
predictions = model.run(None, {input_name: X_test.astype(np.int64)})

score = sklearn.metrics.accuracy_score(y_test, predictions[0])
print(f"We have a test_score of for {score} for the recovered model.")
predictions = model.run(None, {input_name: X_train.astype(np.int64)})
score = sklearn.metrics.accuracy_score(y_train, predictions[0])
print(f"We have a train_score of for {score} for the recovered model.")
We have a test_score of for 0.825 for the recovered model.
We have a train_score of for 1.0 for the recovered model.

As we can see, the score is actually better than before, which is odd and definitely not intended.

Important

This is due to the fact, that skl2onnx is not able to convert all scikit-learn models exactly. This is especially true for the SVC class included in our composite model. Therefore, the class is stored with the same weights but slightly different parameters.

Furthermore, if we inspect our predictions output from above a bit more it looks like we have switched to soft voting.

Overall, we can see that ONNX is a way to persist a model such that we can make predictions with it but we do no longer have the Python object. Of course it is possible that we can write our own provided and persist our required models to a better state, see sklearn-onnx docs Regarding file size, we can already say it is efficient and it provides some independence from our training environment.

Exercise 4.1 (Test how the recovery works for SVC) Try to rewrite the model and check the resulting score after recovery vs. the original score for the following modifications.

  1. Remove the probability=True for SVC.
  2. Replace SVC by LinearSVC.
  3. Remove the SVC all together and replace it with a LogisticRegression classifier.

4.2 pickle - Python object serialization

We can also swing the pendulum in the other direction and use the Python standard library pickle to persist our model.

Before we go into more details, we should emphasise the potential security problem we introduce with pickle as stated in its own docs:

The pickle module is not secure. Only unpickle data you trust.

It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

Consider signing data with hmac if you need to ensure that it has not been tampered with.

Safer serialization formats such as json may be more appropriate if you are processing untrusted data, see Comparison with json.

As pickle is the native implementation in Python. It is easy to use and works for (almost) all models and configurations. The downside is, that we need to absolutely trust the source of our model. We need to trust all steps from the storage provider, through the network to our own infrastructure.

Furthermore, the environment we load the model into needs to be the same as the one we stored it from. As we have already seen how the dependency hell1 influences our development, we bring import these issues together with the pickle file.

This notably implies, it is not guaranteed that a model can be loaded with a different scikit-learn version or let alone a different numpy version, that is only a sub-dependency of scikit-learn. Furthermore, if a different hardware is involved there might be problems as well, e.g. a different architecture of an integer or float. As a consequence, if we use pickle a thorough version control with package management is key!

If we have a model that moves around different processes via the disc or is restored frequently from storage but can not be permanently in storage and therefore performance for loading and storing is of interest we can also use joblib as a more performant alternative.

Now let us see how we can persist the model of Listing 4.1 as a pickle file.

from pickle import dump
with open("model.pkl", "wb") as f:
    dump(voting_clf, f, protocol=5)

Again, the file format is binary so it does not make a lot of sense to actually read the file in plain text but we can have a look at the size

304K    model.pkl

and we can see that the storage demands are higher than for ONNX.

We restore the model via

from pickle import load
with open("model.pkl", "rb") as f:
    clf = load(f)
score = clf.score(X_test, y_test)
print(f"We have test_score of {score} after loading the object again.")
score = clf.score(X_train, y_train)
print(f"We have train_score of {score} after loading the object again.")
We have test_score of 0.8 after loading the object again.
We have train_score of 1.0 after loading the object again.

As we can see, the score stays the same and we can deal with the loaded object in the same way as with the original.

Exercise 4.2 (Further investigations for pickle)  

  1. For the loaded model, switch to soft voting by calling

    clf[1].voting = "soft"
    clf[1].named_estimators["svc"].probability = True
    clf.fit(X_train, y_train)
    clf.score(X_test, y_test)
  2. Use joblib to persist and load the module, also check the file size.

  3. Switch for the SVC to a "rbf" kernel and see if you can fully recover the object.

  4. Some user defined functions can cause problems for pickle try persisting the model with cloudpickle and test with the user defined kernel function rbf = lambda x, y: np.exp(1e-2 * np.abs(x@y.T)).

4.3 skops.io - the more secure Python alternative

As an alternative to pickle we can use skops.io. It is developed as a secure alternative for pickle and therefore supports a wide range of objects. The main idea is, that only trusted functions are loaded and not everything included in the file. It is also possible to verify our data before loading it into our program, increasing the security further. Still, it returns the Python object, if it can be loaded, and we can manipulate it in the same fashion as with pickle.

As a downside, the process is slower and some user defined functions/object might not work as desired. This also implies, that we need to have the same environment for loading as we had for storing the Python object, similar to pickle.

The interface itself is simple and follows pickle.

import skops.io as sio
obj = sio.dump(voting_clf, "model.skops")

For comparison, we show the size of the file

26M model.skops

and we can see that this format has a significant higher overhead as the other formats.

Retrieving the model is a two step process, first loading the untrusted types and than loading the verified objects.

1unknown_types = sio.get_untrusted_types(file="model.skops")
for i, a in enumerate(unknown_types):
    print(f"Unknown type at {i} is {a}.")

clf = sio.load("model.skops", trusted=unknown_types)
score = clf.score(X_test, y_test)
print(f"We have test_score of {score} after loading the object again.")
score = clf.score(X_train, y_train)
print(f"We have train_score of {score} after loading the object again.")
1
We should investigate the contents of unknown_types, and only load if we trust everything we see.
Unknown type at 0 is sklearn.utils._bunch.Bunch.
We have test_score of 0.8 after loading the object again.
We have train_score of 1.0 after loading the object again.

Exercise 4.3 (Further investigations for skops.io)  

  1. The included PCA works with float64, this is not necessary, can we reduce the file size by switching to float16? Hint: look at voting_clf[0].components_.

  2. Apply the self defined kernel from Exercise 4.2 and test the load/recover cycle.

4.4 Comparison of the different approaches

The docs are doing an excellent job in summarizing the key differences.

Based on the different approaches for model persistence, the key points for each approach can be summarized as follows:

  1. ONNX: It provides a uniform format for persisting any machine learning or deep learning model (other than scikit-learn) and is useful for model inference (predictions). It can however, result in compatibility issues with different frameworks.

  2. skops.io: Trained scikit-learn models can be easily shared and put into production using skops.io. It is more secure compared to alternate approaches based on pickle because it does not load arbitrary code unless explicitly asked for by the user. Such code needs to be packaged and importable in the target Python environment.

  3. joblib: Efficient memory mapping techniques make it faster when using the same persisted model in multiple Python processes when using mmap_mode="r". It also gives easy shortcuts to compress and decompress the persisted object without the need for extra code. However, it may trigger the execution of malicious code when loading a model from an untrusted source as any other pickle-based persistence mechanism.

  4. pickle: It is native to Python and most Python objects can be serialized and deserialized using pickle, including custom Python classes and functions as long as they are defined in a package that can be imported in the target environment. While pickle can be used to easily save and load scikit-learn models, it may trigger the execution of malicious code while loading a model from an untrusted source. pickle can also be very efficient memorywise if the model was persisted with protocol=5 but it does not support memory mapping.

  5. cloudpickle: It has comparable loading efficiency as pickle and joblib (without memory mapping), but offers additional flexibility to serialize custom Python code such as lambda expressions and interactively defined functions and classes. It might be a last resort to persist pipelines with custom Python components such as a sklearn.preprocessing.FunctionTransformer that wraps a function defined in the training script itself or more generally outside of any importable Python package. Note that cloudpickle offers no forward compatibility guarantees and you might need the same version of cloudpickle to load the persisted model along with the same version of all the libraries used to define the model. As the other pickle-based persistence mechanisms, it may trigger the execution of malicious code while loading a model from an untrusted source.

Source: scikit-learn.org, accessed 07.03.2025.

4.5 Further considerations

Now that we know how to persist our models, or at least hope to do so, we need to talk about how we keep track of our different model versions (parameters, training data, random seeds, etc.).

In the previous exercises we created multiple versions of our model and stored them to disc. If we now look at the different files, do we still know which version corresponds to which code block?

As we experiment with different parameters for our composite method - in pursuit of better results - we’ll likely generate even more model variations. To ensure reproducibility, we need a way to track our models alongside the code and parameters that produces them. This is what we are going to look into in the next section.


  1. see MECH-M-DUAL-1-SWD, Section 4.1↩︎