Drawing Loss Curves for Deep Neural Network Training in PyTorch

6 min readNov 24, 2021

Typical Shape of Loss and Accuracy Graph. Source: https://stackoverflow.com/questions/47817424/loss-accuracy-are-these-reasonable-learning-curves/47819022

This blog uses the neural network model and training code described in the following blog and builds on it.

Training a Neural Network in PyTorch for a Computer Vision Task — Person Re-Identification

Neural networks are powerful constructs that mimic the functionality of the human brain to solve various problems that…

niruhan.medium.com

We will see how we can plot the loss curve for each epoch and how to find the best model and save it for future inference usage.

Plotting Loss Curve

First, let’s import the additional libraries required as follows.

import matplotlib.pyplot as plt

First, we need to initialize the required variables as follows to store the loss values while training.

y_loss = {}  # loss history
y_loss['train'] = []
y_loss['val'] = []
y_err = {}
y_err['train'] = []
y_err['val'] = []x_epoch = []

We will also initialize the required plots as follows. Note that we are actually drawing two graphs, one for the epoch loss and the rank 1 error.

Loss — Training a neural network (NN)is an optimization problem. For optimization problems, we define a function as an objective function and we search for a solution that maximizes or minimizes the output of the function. In the case of NN, we define a loss/cost function and try to minimize it [2]. We plot the loss value obtained at the end of each epoch.
Rank 1 error — In person re-identification, we need the ID of the person in a query image. If there are 10 images and our model predicts the ID of 9 persons correctly, then we have a rank 1 error of 10%.

fig = plt.figure()
ax0 = fig.add_subplot(121, title="loss")
ax1 = fig.add_subplot(122, title="top1err")

Now, we will write a function named draw_curve which we will call at the end of each epoch to draw and save the graph.

def draw_curve(current_epoch):
    x_epoch.append(current_epoch)
    ax0.plot(x_epoch, y_loss['train'], 'bo-', label='train')
    ax0.plot(x_epoch, y_loss['val'], 'ro-', label='val')
    ax1.plot(x_epoch, y_err['train'], 'bo-', label='train')
    ax1.plot(x_epoch, y_err['val'], 'ro-', label='val')
    if current_epoch == 0:
        ax0.legend()
        ax1.legend()
    fig.savefig(os.path.join('./lossGraphs', 'train.jpg'))

Note that we are keeping the y_loss and y_err as global variables and adding to them at the end of each epoch.

Don’t forget to create a directoy named lossGraphs in your project directory as it is the directory set in the draw_curve function above.

Lastly, we need to calculate the loss. Loss is calculated per epoch and each epoch has train and validation steps. So, at the start of each epoch, we need to initialize 2 variables as follows to store the epoch loss and error.

running_loss = 0.0
running_corrects = 0.0

We need to calculate both running_loss and running_corrects at the end of both train and validation steps in each epoch. running_loss can be calculated as follows.

running_loss += loss.item() * now_batch_size

Note that we are multiplying by a factor noe_batch_size which is the size of the current batch size. This is because PyTorch’s loss.item() gives the average loss of the batch [4].

The running_corrects can be calculated as follows,

running_corrects += float(torch.sum(preds == labels.data))

Here, we are comparing the actual ID of each image in the batch and the predicted ID. running_corrects will be equal to the number of predicted IDs that match the actual ID label.

At the end of each epoch, we need to calculate the epoc_loss and epoch_acc as follows.

epoch_loss denotes the average loss per image in the epoch
epoch_acc denotes the fraction of correct ID predictions from the dataset in the epoch

epoch_loss = running_loss / dataset_sizes[phase]
epoch_acc = running_corrects / dataset_sizes[phase]

Now, we need to add these values to the y_loss and y_err variables that we initialized earlier.

y_loss[phase].append(epoch_loss)
y_err[phase].append(1.0 - epoch_acc)

Here, phase will have either ‘train’ or ‘val’ as its value to denote training and validation phases. One final step remains, and it is calling the draw_curve function to draw the graph and store it as an image in ‘lossGraphs’ directory.

if phase == 'val':
    draw_curve(epoch)

We are drawing only for the validation phase as it is the final step in each epoch.

Testing our Code

In order to test our code, we will reduce the batch size and the number of images handled in each epoch, so that we can quickly go through a few epochs on our local machine and see if we get the required graphs. To this end, I will set the following variables.

batchsize = 2
num_epochs = 10

To limit the number of images from the dataset processed in each epoch, I will just add a counter and stop the epoch after processing 10 batches.

count = 0
# Iterate over data.
for data in dataloaders[phase]:
    if count > 10:
        break
    ....

Finally, your code should look something similar to the following.

#!/usr/bin/python
# -*- encoding: utf-8 -*-

import os
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.autograd import Variable
from simple_model import ft_net
import matplotlib.pyplot as plt

h, w = 256, 128
data_dir = '/home/niruhan/Personal/paper/Market-1501-v15.09.15/pytorch'
batchsize = 2
num_epochs = 10
use_gpu = torch.cuda.is_available()

transform_train_list = [
    transforms.Resize((h, w), interpolation=3),
    transforms.Pad(10),
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]

transform_val_list = [
    transforms.Resize(size=(h, w), interpolation=3),  # Image.BICUBIC
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
]

data_transforms = {
    'train': transforms.Compose(transform_train_list),
    'val': transforms.Compose(transform_val_list),
}

image_datasets = {}
image_datasets['train'] = datasets.ImageFolder(os.path.join(data_dir, 'train'),
                                               data_transforms['train'])
image_datasets['val'] = datasets.ImageFolder(os.path.join(data_dir, 'val'),
                                             data_transforms['val'])

dataloaders = {x: torch.utils.data.DataLoader(image_datasets[x], batch_size=batchsize, shuffle=True, num_workers=8)
               for x in ['train', 'val']}

class_names = image_datasets['train'].classes
dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}

model = ft_net(len(class_names))
criterion = nn.CrossEntropyLoss()

lr = 0.05
optim_name = optim.SGD
ignored_params = list(map(id, model.classifier.parameters()))
base_params = filter(lambda p: id(p) not in ignored_params, model.parameters())
classifier_params = model.classifier.parameters()
optimizer = optim_name([
    {'params': base_params, 'lr': 0.1 * lr},
    {'params': classifier_params, 'lr': lr}
], weight_decay=5e-4, momentum=0.9, nesterov=True)

y_loss = {}  # loss history
y_loss['train'] = []
y_loss['val'] = []
y_err = {}
y_err['train'] = []
y_err['val'] = []

x_epoch = []
fig = plt.figure()
ax0 = fig.add_subplot(121, title="loss")
ax1 = fig.add_subplot(122, title="top1err")


def draw_curve(current_epoch):
    x_epoch.append(current_epoch)
    ax0.plot(x_epoch, y_loss['train'], 'bo-', label='train')
    ax0.plot(x_epoch, y_loss['val'], 'ro-', label='val')
    ax1.plot(x_epoch, y_err['train'], 'bo-', label='train')
    ax1.plot(x_epoch, y_err['val'], 'ro-', label='val')
    if current_epoch == 0:
        ax0.legend()
        ax1.legend()
    fig.savefig(os.path.join('./lossGraphs', 'train.jpg'))


for epoch in range(num_epochs):
    print('Epoch {}/{}'.format(epoch, num_epochs - 1))
    print('-' * 10)
    # Each epoch has a training and validation phase
    for phase in ['train', 'val']:
        if phase == 'train':
            model.train(True)  # Set model to training mode
        else:
            model.train(False)  # Set model to evaluate mode

        running_loss = 0.0
        running_corrects = 0.0

        count = 0
        # Iterate over data.
        for data in dataloaders[phase]:
            if count > 10:
                break

            count = count + 1
            # get a batch of inputs
            inputs, labels = data
            now_batch_size, c, h, w = inputs.shape
            if now_batch_size < batchsize:  # skip the last batch
                continue
            # print(inputs.shape)
            # wrap them in Variable, if gpu is used, we transform the data to cuda.
            if use_gpu:
                inputs = Variable(inputs.cuda())
                labels = Variable(labels.cuda())
            else:
                inputs, labels = Variable(inputs), Variable(labels)

            # zero the parameter gradients
            optimizer.zero_grad()

            # -------- forward --------
            outputs = model(inputs)
            _, preds = torch.max(outputs.data, 1)
            loss = criterion(outputs, labels)

            del inputs

            # -------- backward + optimize --------
            # only if in training phase
            if phase == 'train':
                loss.backward()
                optimizer.step()

            # statistics
            running_loss += loss.item() * now_batch_size
            del loss
            running_corrects += float(torch.sum(preds == labels.data))

        epoch_loss = running_loss / dataset_sizes[phase]
        epoch_acc = running_corrects / dataset_sizes[phase]

        print('{} Loss: {:.4f} Acc: {:.4f}'.format(phase, epoch_loss, epoch_acc))

        y_loss[phase].append(epoch_loss)
        y_err[phase].append(1.0 - epoch_acc)

        # deep copy the model
        if phase == 'val':
            draw_curve(epoch)

Make sure to change the data_dir to the location that contains the preprocessed Market1501 dataset. Read my previous blog at [5] to learn how to download and preprocess the dataset for PyTorch.

You will need to install PyTorch and other required libraries in a Python environment in order to run this code. I use Anaconda to set up an environment and install libraries. At the end of the run, you should see a graph as follows stored in the directory ‘lossGraphs’

Don’t worry about the values in the graphs. They will be wrong since we have designed the loss and error calculation considering the entire dataset but we limited the training to only 10 batches in each epoch. When we do the training with the entire dataset, we will get meaningful values. In my future blogs, I will discuss how we can save the trained model, and how we can train the model in a cloud instance with GPU.

References

[1] https://github.com/layumi/Person_reID_baseline_pytorch
[2] https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/
[3] https://niruhan.medium.com/training-a-neural-network-in-pytorch-for-a-computer-vision-task-person-re-identification-b2b23d2cc8d0
[4] https://discuss.pytorch.org/t/what-is-loss-item/61218
[5] https://niruhan.medium.com/pre-processing-market1501-person-reid-dataset-for-pytorch-fbb4912f4cc5