Appendix B — Our first Neural Network in tensorflow.keras

As mentioned there are multiple frameworks to work neural networks (NNs) and we focus on pytorch. Nevertheless, tensorflow.keras is often used in industry and to give an idea how it works we also implemented the first model in this framework.

In this section we will focus on keras with TensorFlow as backend. This means we find our main modules under tensorflow.keras. We will, in general, reference the documentation via tensorflow, the content on keras.io Api Docs is mostly equivalent. Where some tutorials and examples might differ.

Show the code for loading, splitting and scaling the images.
import numpy as np
import scipy
import requests
import io
import matplotlib.pyplot as plt
%config InlineBackend.figure_formats = ["svg"]


response = requests.get(
    "https://github.com/dynamicslab/databook_python/"
    "raw/refs/heads/master/DATA/catData_w.mat")
cats_w = scipy.io.loadmat(io.BytesIO(response.content))["cat_wave"]

response = requests.get(
    "https://github.com/dynamicslab/databook_python/"
    "raw/refs/heads/master/DATA/dogData_w.mat")
dogs_w = scipy.io.loadmat(io.BytesIO(response.content))["dog_wave"]

X_train = np.concatenate((dogs_w[:, :40], cats_w[:, :40]), axis=1).T / 255.0
y_train = np.repeat(np.array([0, 1]), 40)
X_test = np.concatenate((dogs_w[:, 40:], cats_w[:, 40:]), axis=1).T / 255.0
y_test = np.repeat(np.array([0, 1]), 40)


def myplot(y):
    plt.figure()
    n = y.shape[0]
    plt.bar(range(n), y)
    plt.plot([-0.5, n - 0.5], [0, 0], "k", linewidth=1.0)
    plt.plot([n // 2 - 0.5, y.shape[0] // 2 - 0.5], [-1.1, 1.1], "r-.", linewidth=3)
    plt.yticks([-0.5, 0.5], ["cats", "dogs"], rotation=90, va="center")
    plt.text(n // 4, 1.05, "dogs")
    plt.text(n // 4 * 3, 1.05, "cats")
    plt.gca().set_aspect(n / (2 * 3))

If we translate it into a tensorflow.keras model this looks like the following code (already including a summary at the end).

import tensorflow as tf

# Initialize everything for reproducibility
tf.keras.utils.set_random_seed(6020)
tf.keras.backend.clear_session(free_memory=True)


1model = tf.keras.Sequential([
2    tf.keras.layers.InputLayer(shape=[X_train.shape[1]], name="Input"),
3    tf.keras.layers.Dense(2, activation="tanh", name="H01"),
    tf.keras.layers.Dense(2,
4                    #activation="softmax",
                    name="H02")
])

5model.summary(line_length=60)
1
Generate a sequential model, the keras way of calling a linear stack of layers.
2
Our fist layer defines the input size, in our case we have flat images of size \(1024\).
3
Use the activation function \(\tanh\) and we reduce down to \(2\) neurons.
4
Use the activation function softmax to compute the probability per class, not recommended directly in the model.
5
Print the summary of the model.
2025-06-16 14:35:41.829501: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Layer (type)              Output Shape         Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ H01 (Dense)              │ (None, 2)         │     2,050 │
├──────────────────────────┼───────────────────┼───────────┤
│ H02 (Dense)              │ (None, 2)         │         6 │
└──────────────────────────┴───────────────────┴───────────┘
 Total params: 2,056 (8.03 KB)
 Trainable params: 2,056 (8.03 KB)
 Non-trainable params: 0 (0.00 B)

It is often also quite useful to export the model as an image. In our case we can work again with a dot file rendered as Figure B.1.

dot = tf.keras.utils.model_to_dot(model, show_shapes=True,
                                  show_layer_activations=True, dpi=False)
with open("model.dot", "w") as f:
    f.write(dot.to_string())
G 140286204807552 H01 (Dense) Activation: tanh Input shape: (None, 1024) Output shape: (None, 2) 140286172849376 H02 (Dense) Activation: linear Input shape: (None, 2) Output shape: (None, 2) 140286204807552->140286172849376
Figure B.1: Model with all its parameters visualized.

Exercise B.1 (Explain the output of .summary()) For the (extended) output of .summary()

model.summary(line_length=60, 
              expand_nested=True, show_trainable=True)
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━┓
┃ Layer (type)           Output Shape      Param #  Tr… ┃
┡━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━┩
│ H01 (Dense)           │ (None, 2)        │   2,050Y  │
├───────────────────────┼──────────────────┼─────────┼─────┤
│ H02 (Dense)           │ (None, 2)        │       6Y  │
└───────────────────────┴──────────────────┴─────────┴─────┘
 Total params: 2,056 (8.03 KB)
 Trainable params: 2,056 (8.03 KB)
 Non-trainable params: 0 (0.00 B)

answer the following questions.

  1. Explain how the Param # column is computed and retrace the details.
  2. What does the memory usage of 8.03KB imply on the used number format for the parameters?
  3. What does the (None, 2) imply for the shape?

Next, we need to compile the model to prepare it for execution/training. In this step we define the optimizer to use for solving the included optimization method and how loss is computed. After this we can

1loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(loss=loss,         
2              optimizer=tf.keras.optimizers.SGD(),
              metrics=["accuracy"])
1
Function to compute the loss, the name corresponds to sparse_categorical_crossentropy from the tf.keras.losses module.
2
Our optimizer, the Stochastic Gradient Descent method, more details later.

After we compiled the model we can train it using our data

1history = model.fit(X_train,
                    y_train,
2                    epochs=130,
3                    batch_size=8,
4                    validation_split=0.1,
5                    verbose=0,
6                    shuffle=True,
                    )
1
Our entire training set, it is shuffled and split into a validation set during each epoch.
2
Number of epochs, i.e. iterations to run the optimization with the data provided.
3
The batch_size describes the number of samples per gradient update to use - can be interpreted as filling the spot of None in the above summary output.
4
We can define how large the split between training and validation data is.
5
Deactivate the output to not clutter the notes
6
Shuffle the data on each run to make sure different sets are provided in different orders.

Now that training is complete we can have a look at the performance.

Show the code for the figure
probability_model = tf.keras.Sequential([
    model,
    tf.keras.layers.Softmax(axis=1),
])
y_proba = probability_model.predict(X_test, verbose=0)
y_predict = y_proba.argmax(axis=-1)
myplot(y_predict * (-2) + 1)

n = y_proba.shape[0]
plt.figure()
plt.bar(range(n), y_proba[:, 0], color='y', label=r"p(dog)")
plt.bar(range(n), y_proba[:, 1], color='b', label=r"p(cat)", 
        bottom=y_proba[:, 0])
plt.plot([-0.5, n - 0.5], [0.5, 0.5], "r", linewidth=1.0)
plt.gca().set_aspect(n / (1 * 3))
plt.legend()

plt.figure()

epochs = range(len(history.history["accuracy"]))
plt_list = [["accuracy", "-"], ["loss", ":"],
            ["val_accuracy", "--"], ["val_loss", "-."]]

for name, style in plt_list:
    plt.plot(epochs, history.history[name], style, label=name)
plt.legend(loc="lower right")
plt.gca().set_aspect(130 / (1 * 3))
plt.show()
(a) Classification for our test set.
(b) Probabilities of the two classes - our actual output of the model.
(c) Summary of the key metrics of the model training.
Figure B.2: Performance of our model.

Overall, we have an accuracy of 83.75% for our test set, better than with our linear models.

Exercise B.2 (Switch to one_hot representation) In the compile step we used sparse_categorical_crossentropy to allow for the computation of the loss function. Often a so called one_hot vector is more appropriate, this is supported via categorical_crossentropy as loss function.

  1. Change to categorical_crossentropy.

  2. Use the function tensorflow.keras.utils.to_categorical to switch to a one_hot encoding.

  3. To switch back we can use np.argmax as in the code above.

Exercise B.3 (Change the optimizer) As we have seen so often before, the optimizer used for the problem can change how fast convergence is reached (if at all).

  1. Have a look at the possibilites provided in the framework Overview - Optimizers - they are basically all improvements on Stochastic Gradient Descent.

  2. Test different approaches, especially one of the Adam (Adam, AdamW, Adamax) and Ada (Adadelta, Adafactor, Adagrad) implementations and record the performance (final accuracy).

B.1 How to save a keras model

Of course we can save a keras model with the methods discussed in Chapter 4 but it is more convenient to use the dedicated backend from keras, see docs.

In short, we only need to call .save(file) to save in the default .keras format. This is also the recommended format to use. There also exist a tf and h5 format, note that for the last we need to install the Python package h5py 1. The tf version might execute arbitrary code during loading the model, be careful when using it, see Section 4.2

model.save("model.keras")
loaded_model = tf.keras.saving.load_model("model.keras")

It is also possible to implement continuous checkpoints during training. This allows us to resume training after an interrupt or simply store the model at the end of the training.

All of this can be done via the already discussed dvclive interface, the implementation is defined as an exercise.

Exercise B.4 (dvclive integration into the keras training) Implement the DVCLive integration to track the metrics.


  1. HDF5 is a file format for storing large complex heterogeneous data often used for large simulations in High Performance Computing, see details on the project page hdfgroup.org.↩︎