6 Data persistence
In our example the data is downloaded from a remote and added to dvc
. This already covers the basic data persistence problems as we described it. We can always make sure that the data content is verified and we know the state. If we run an experiment we see it in the output:
Applying changes |0.00 [00:00, ?file/s]
'data.dvc' didn't change, skipping
If we change the data we know that we need to update the training as well. Nevertheless, we want to highlight some different aspects in this context.
It is quite common, if not always the case, that the model we want to train is not using the raw data but rather processed data. On common pattern to get from the raw data to the processed data is called ETL (extract, transform, load).
In Figure 6.1 we can see an illustration of the pattern and Reis (2022) covers this concept as well.
The main idea is that in order to create a report, train a model, or any other ML tasks the raw data (illustrated via the Data sources) needs some processing. Maybe some sources are combined, cumulative data is computed, images are cropped and transformed, converted into a different format etc.. At the end of the process the new data is stored again. This is often called a Data warehouse in the context of data science. From this storage our tasks can load the data again for direct processing.
Separating these tasks has some advantages.
- The processes can run asynchronous.
- A state of the processed data can be frozen for processing.
- There is a well defined and reproducible way to come from the raw data to our input data (the code is under version control).
- Unstructured data can be transformed into structured data that is easier to process.
- If a format changes this can be incorporated into the ETL to allow backwards or forwards compatibility.
- Depending on our use-case we can extend this list.
We illustrate this task in the context of our example project. This means, instead of directly using catData_w.mat
we start of from the raw data catData.mat
, see b097ba6 for the implementation details.
We introduce a new module called etl
with three files plus the module file to accomplish this task.
MLB-DATA$ tree src/etl
etl
├── extract.py
├── __init__.py
├── load.py
└── transform.py
1 directory, 4 files
Now a simple call to etl.load("catData_w.mat")
will run the following chain:
- If the file does not exist:
- extract:
- access
catData.mat
and extract the content as anp.array
- access
- transform:
- transform the image in the same fashion as described in Section A.1
- store the data in the
.mat
format incatData_w.mat
- extract:
- load:
- return the content as a `np.array
We try to only store the processed data and not the raw data itself, as this would be a duplication. In this case, we therefore do not store catData.mat
locally but just the final result. Depending on your ETL you might build up a buffer locally to optimize processing time over access time.
While for image processing, computational resources are often called out as the bottleneck, the influence of storage and the correct structure of data should not be ignored.
Modern GPU architectures are designed for high data throughput. This comes with the drawback that we need to provide the data as quickly as it gets processed. Consequently, a high throughput for storage is required or the performance is not optimal.
Note: as dvc
uses a cache and links the data, this means the performance of the cache location is important.