Images from the Convolutional World

jercoli
Aug 27, 2021
12 min read

We take neural networks one step further from our previous post, towards classifying images with a process called Convolution.

CNN (Convolutional Neural Network) is a type of Artificial Neural Network that processes its layers imitating the human eye, making it possible for it to identify objects and "see".

CNN contains several specialized hidden layers with a hierarchy, for example the first layers can detect lines, curves, and edges, and then go specializing until they reach deeper layers that recognize complex shapes such as the face or a part of the human body.

In this trip, first we must understand how to work with this new type of input data:

Images

A digital image has 3 attributes: width and height in pixels (the resolution), and color in RGB (red, green, blue) format called channels.

In our example we will work with a dataset of tiny images called CIFAR-10 that serves our

learning purposes just fine. It consists of 60,000 32×32 color (RGB) images, labeled with an integer corresponding to 1 of 10 classes (airplane (0), automobile (1), bird (2), cat (3), deer (4), dog (5), frog (6), horse (7), ship (8), and truck (9)).

We will continue using Pytorch for our project, so we will import this dataset from your torchvision library.

from torchvision import datasets
data_path = './dlpytorch/'
cifar10 = datasets.CIFAR10(data_path, train=True, download=True)
cifar10_val = datasets.CIFAR10(data_path, train=False, download=True)
len(cifar10)

We set the download parameter to True indicating the data_path, and train to True indicating that we are interested in the training set. In the case of values, our train will be False to obtain validation data. Finally we verify the size of the data set obtained: 50,000 images

Ok, each image of the dataset is an instance of an RGB PIL image, so let's visualize some of them:

# plot ex. image
img, label = cifar10[79]
plt.imshow(img)
plt.show()

print(f"Label:{label} - Class:{class_names[label]}")

Label:1 - Class:automobile

Now, we will need to convert our image to a Tensor before we can do anything with Pytorch

from torchvision import transforms
to_tensor = transforms.ToTensor()
img_t = to_tensor(img)
img_t.shape

torch.Size([3, 32, 32])

So, the image has been turned into a 3 (RGB channels) × 32 × 32 tensor. Ok, let's transfer all our image dataset to Pytorch Tensors using the dataset transform parameter:

tensor_c10 = datasets.CIFAR10(data_path, train=True, download=False,                                                
                             transform=transforms.ToTensor())
                             
# load the same image and view the the tensor shape and type
img_t, _ = tensor_c10[79]
img_t.shape, img_t.dtype

(torch.Size([3, 32, 32]), torch.float32)

The ToTensor transform turns the data into a 32-bit floating-point per channel, scaling the values down from 0.0 to 1.0, also keep in mind that we must change the order of the axes to re-view our image with matplotlib (from RGB channels-H-W to H-W-RGB channels).

# Permute columns and visualize image again
plt.imshow(img_t.permute(1, 2, 0))
plt.show()

Normalize image data

This is because by choosing activation functions that are linear around 0 plus or minus 1 (or 2), we keep the data in the same range and thus neurons are more likely to have non-zero gradients, therefore they will learn earlier. Also, normalize each channel to have the same distribution will ensure that channel information can be mixed and updated through the gradient descent using the same learning rate.

So, we have to compute the mean value and the standard deviation of each channel across the dataset and apply the following transform:

v_norm[c] = (v[c] - mean[c]) / stdev[c]

Let’s compute them for the CIFAR-10 training set:

# Normalizing data
# 1. stack the dataset tensors along an extra dimension
imgs = torch.stack([img_t for img_t, _ in tensor_c10], dim=3)
imgs.shape

torch.Size([3, 32, 32, 50000])

# 2. compute mean per channel
imgs.view(3, -1).mean(dim=1)

tensor([0.4914, 0.4822, 0.4465])

view(3, -1) keeps the 3 channels and merges all the remaining dimensions into one, so our 3 × 32 × 32 image is transformed into a 3 × 1,024 vector, then the mean is taken over the 1,024 elements of each channel.

# 3. compute std deviation
imgs.view(3, -1).std(dim=1)

tensor([0.2470, 0.2435, 0.2616])

# 4 normalize the data 
normalized_c10 = datasets.CIFAR10(data_path, train=True, download=False,
                   transform=transforms.Compose([
                       transforms.ToTensor(),
                       transforms.Normalize((0.4915, 0.4823, 0.4468),(0.2470, 0.2435, 0.2616)) ]))

# view normalized image
img_t, _ = normalized_c10[79]
plt.imshow(img_t.permute(1, 2, 0))
plt.show()

From Fully Connected to Convolutional Neural Network

Suppose we want to build a fully connected NN (as explained in the previous post), and this time we will make it deep, for which we will add a new hidden layer. Our input layer will have 3072 values (3 * 32 * 32) and will result in an output of 1024 values for the 1st hidden layer, 512 for the 2nd. hidden layer and 128 for the output layer that will output 10 values (probability of each of the classes)

# fully connected NN
fc_model = nn.Sequential(
          nn.Linear(3072, 1024),
          nn.Tanh(),
          nn.Linear(1024, 512),
          nn.Tanh(),
          nn.Linear(512, 128),
          nn.Tanh(),
          nn.Linear(128, 10))

Let's see how many parameters (only weights, it remains to add the bias) we would have in our network:

# how many parameters have our fully connected NN ?
param_list = [p.numel()
              for p in fc_model.parameters()]
sum(param_list), param_list

(3738506, [3145728, 1024, 524288, 512, 65536, 128, 1280, 10])

Wow!! 3.738.506 parameters, why so many?

Remember that a linear layer computes y = weight * x + bias, and if x has length 3,072, and y must have length 1,024, then the weight tensor needs to be of size 1,024 × 3,072 and the bias size must be 1,024. So 1,024 * 3,072 + 1,024 = 3,146,752 parameters for the 1st layer.

This is telling us that our NN does not scale well when it comes to pixels, imagine what would happen if we had 1024x1024 RGB images, only 3.1 million input values, and more than 3 billion parameters !!

Convolutions to the rescue

If we want to recognize patterns corresponding to objects, such as a car on a route, we will probably need to look at how nearby pixels are arranged, and we will be less interested in how pixels that are far from each other appear in combination, so combinations of important features tend to be in pixels together with each other. If we wanted to detect our Ford car in an image, it wouldn't matter if it has a tree or a cloud in the corner or not.

To translate this mathematically, we could calculate the weighted sum of a pixel to its immediate neighbors, rather than to all other pixels in the image. This would be equivalent to constructing weight matrices, one per output feature and output pixel location, in which all weights beyond a certain distance from a central pixel would be zero.

For these localized patterns have an effect on the output regardless of their location in the image, we must achieve invariant translation.

Fortunately we have available a linear, local and invariant operation in image translation:

A Convolution

We can define the convolution for a 2D image, as the dot product of a weight matrix: the kernel with each neighborhood at the input, generating a new output matrix.

That kernel will move from left to right and top to bottom through the input image, as if a patch were being put on the image.

In summary the advantages are:

* Local operations in neighborhoods * Translation invariance * Models with much fewer parameters

The kernel (also called convolution matrix) are generally square and small (3x3, 5x5), and are usually initialized with random values. Of course there is a tradeoff to choose the kernel size that we will talk about later.

Let's start to see some code, Pytorch provides convolutions for 1, 2, and 3 dimensions: nn.Conv1d for time series, nn.Conv2d for images and nn.Conv3d for volumes/videos. We will create a 2d convolution for an image:

# create Conv2d
conv = nn.Conv2d(3, 16, kernel_size=3)
conv

Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1))

Parameters: 3 is the number of input features (or channels in this case), 16 are the output channels, this is an arbitrary number, but more channels is more capacity of the network to detect different types of features. kernel_size is 3 (Pytorch assumes 3x3). Finally the stride is 1 by default, that is the step size of the kernel when sliding through the image.

So, in this case we have a weight tensor of : 16 (out_ch) x 3 (in_ch) x 3 x 3, and the bias will have size 16. Verify that:

conv.weight.shape, conv.bias.shape

(torch.Size([16, 3, 3, 3]), torch.Size([16]))

Ok, applies the convolution over our example image:

# Apply convolution
output = conv(img_t.unsqueeze(0))
img_t.unsqueeze(0).shape, output.shape

(torch.Size([1, 3, 32, 32]), torch.Size([1, 16, 30, 30]))

The unsqueeze add a new dim 0 to the output, because Conv2d expects a tensor in the form of B(atch) × C × H × W as input, in this case the Batch is only 1 image.

And show the convolution image:


plt.imshow(output[0, 0].detach())
plt.show()

Note that the shape of the output is 30x30 and not 32x32, so after the convolution we’re missing two pixels in each dimension. To solve this Pytorch gives us the possibility of padding the image by creating ghost pixels around the border that have value zero.

conv = nn.Conv2d(3, 16, kernel_size=3, padding=1)

Now let's say we want our kernel to perform an edge detection. How could we assign the weights to this new kernel?

# create new conv to perform edge detection
conv = nn.Conv2d(3, 1, kernel_size=3, padding=1)
# disable gradient calculation and set the weights
with torch.no_grad():
    conv.weight[:] = torch.tensor([[-1.0, 0.0, 1.0],
                                   [-1.0, 0.0, 1.0],
                                   [-1.0, 0.0, 1.0]])
    conv.bias.zero_()

output = conv(img_t.unsqueeze(0))
plt.imshow(output[0, 0].detach())
plt.show()

In this way we can build a lot of more elaborate filters. The job of a CNN is to estimate the kernel of a set of filter banks, in successive layers that will transform a multichannel image into another multichannel image, where different channels correspond to different features.

We will have an output channel x Kernel (like a channel for the average, another channel for the vertical edges, etc.)

Kernel size tradeoff

Smaller kernel will gives you a lot of details, but it can lead you to overfitting and they are computationally expensive.

Larger kernel will gives you loss a lot of details and it can lead you to underfitting, but computational time is faster and memory usage is smaller.

So, you should tune your model to find the best size. It's very common to use odd kernels, being 3x3 and 5x5 the most used.

Downsampling

The downsampling or pooling aims to reduces the spatial dimensions of the image based on certain mathematical operations such as average or max-pooling. Combining convolutions and downsampling can help us recognize larger structures.

For example scaling an image in half is the equivalent of taking 4 neighboring pixels (locality) as input and producing one pixel as output. This downsampling of the image could be done by applying:

Average Pooling : the average of the 4 pixels, it was the first method but it has fallen into disuse
Max Pooling: take the maximum of the four pixels, currently the most common

Downsampling helps capture essential structural features of rendered images without fussing with fine details and generally acts as a noise suppressant.

Advantages of combining convolutions and poolings

In the example above, the first set of kernels operates in small neighborhoods and low-level features, while the second set of kernels effectively operates in larger neighborhoods, producing features that are compositions of the previous features.

This combination gives CNN the ability to view very complex scenes.

(example of convolution+pooling layers)

Feature mapping

The feature map is the output of one filter applied to the previous layer. So, if we have a 1st convolution with 16 kernels, we will have 16 output matrices (feature mapping)

Ok it's time to rebuild our neural network with convolutions and pooling and then check if we have a number of acceptable parameters so that the training is faster and computationally less expensive than Fully connected NN.

# CNN combining convolutions and pooling
model = nn.Sequential(
    nn.Conv2d(3, 16, kernel_size=3, padding=1),
    nn.Tanh(),
    nn.MaxPool2d(2),
    nn.Conv2d(16, 8, kernel_size=3, padding=1),
    nn.Tanh(),
    nn.MaxPool2d(2),
    nn.Linear(8 * 8 * 8, 32),
    nn.Tanh(),
    nn.Linear(32, 10))

The first Convolution takes 3 channels to 16, so it generates 16 independent features that will serve to discriminate the low-level characteristics of the image, then we apply a Tanh activation function and last the 16-channel 32 × 32 image is pooled to a 16-channel 16×16 image (MaxPool2d).

Same process to a second Convolution, Tanh and Pool, finally we pass an 8 channel 8x8 image to a linear module and output 32 elements to a final linear that output 10 elements (10 probabilities, 1 per each class of image in the Cifar10 dataset).

Now we obtain the number of parameters that this network needs to compare with the fully connected:

numel_list = [p.numel() for p in model.parameters()]
sum(numel_list), numel_list

(18354, [432, 16, 1152, 8, 16384, 32, 320, 10])

18.354 vs 3.738.506 thats a great downsize !!

If we try to apply an image to our model to make its prediction, it will give an error. This is because after the last convolution we must reshape from an 8 channels * 8*8 image to 512 1D vector.

But unfortunately, we don’t have any explicit visibility of the output of each module when we use nn.Sequential in Pytorch, so we must subclass nn.Module:

# The solution: make our own nn.module subclass
class Net(nn.Module):
  def __init__(self):
    super().__init__()
    self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
    self.act1 = nn.Tanh()
    self.pool1 = nn.MaxPool2d(2)
    self.conv2 = nn.Conv2d(16, 8, kernel_size=3, padding=1)
    self.act2 = nn.Tanh()
    self.pool2 = nn.MaxPool2d(2)
    self.fc1 = nn.Linear(8 * 8 * 8, 32)
    self.act3 = nn.Tanh()
    self.fc2 = nn.Linear(32, 10)

  # That takes the inputs to the module and returns the output
  def forward(self, x):
    out = self.pool1(self.act1(self.conv1(x)))
    out = self.pool2(self.act2(self.conv2(out)))
    out = out.view(-1, 8 * 8 * 8)  # the reshape to 1D 512 elements
    out = self.act3(self.fc1(out))
    out = self.fc2(out)
    return out

We will do a new refactoring, since some modules like nn.Tanh and nn.MaxPool2d do not have parameters and it is not necessary to register them in the new subclass. For this Pytorch has a functional API (torch.nn.functional) that we will use to perform this task.

# Refactor the Net subclass to use funct. Api for Activat.funct. and Pooling

import torch.nn.functional as F
class Net(nn.Module):
    def __init__(self):
      super().__init__()
      self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
      self.conv2 = nn.Conv2d(16, 8, kernel_size=3, padding=1)
      self.fc1 = nn.Linear(8 * 8 * 8, 32)
      self.fc2 = nn.Linear(32, 10)

    def forward(self, x):
      out = F.max_pool2d(torch.tanh(self.conv1(x)), 2)
      out = F.max_pool2d(torch.tanh(self.conv2(out)), 2)
      out = out.view(-1, 8 * 8 * 8)
      out = torch.tanh(self.fc1(out))
      out = self.fc2(out)
      return out

Apply an image to a Net model:

# we obtain 10 probabilities, 1 per class in the Cifar10 Dataset
model = Net()
model(img_t.unsqueeze(0))

tensor([[ 0.1109, 0.1352, 0.1018, -0.0142, -0.1172, 0.0679, 0.0349, 0.1117, 0.0222, 0.1687]], grad_fn=<AddmmBackward>)

The loss function to use in CNN classification model : Softmax Cross Entropy Loss

In our previous post about NN we had used MSE as a loss function, but here we are facing a classification problem so we should use a function that better interprets the output values as probabilities. That is, each value of the array must be between 0-1 and the vector must total 1 for each sample.

The Softmax Cross Entropy exploits these characteristics by producing gradients steeper than MSE for the same input. It have 2 components:

a) Softmax function: strongly amplifies the maximum value in relation to the others, forcing the NN to be less neutral on which prediction it believes to be correct. For example:

normalize(np.array([10, 6, 4]) ==> array([0.5, 0.3, 0.2])

softmax(np.array([10, 6, 4]) ==> array([0.84, 0.11, 0.04])

b) Cross Entropy loss: the penalties are much higher than in MSE for the interval [0,1], also they get steeper approaching ∞ when the difference between the prediction (p) and the target (y) approaches 1.

Example: Cross entropy loss versus MSE when y = 0

Training the Convolutional NN model

Ok now we have to train our model, let's get to work. To be able to execute our training more quickly we will use a reduced dataset from Cifar10 that only has images of 3 classes instead of the original 10 (airplanes, cars and ships) called Cifar3 :). You can see the code in the github repo.

We create a method to train the network in a loop of n epochs. We will use the data loader provided by Pytorch to feed the network with batches of images (64 in our example).

# Training the Net CNN
import datetime

def training_net(n_epochs, optimizer, model, loss_fn, train_loader):
    for epoch in range(1, n_epochs + 1):
        loss_train = 0.0

        # The DataLoader (train_loader) create batches of 64 imgs (batch_size)
        for imgs, labels in train_loader:
            outputs = model(imgs)
            # compute the loss to minimize
            loss = loss_fn(outputs, labels)
            # put 0 the grads of the last round, compute new grads and update model
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            # sums the losses we saw over the epoch
            # item() to transform loss to a python number
            loss_train += loss.item()

        if epoch == 1 or epoch % 10 == 0:
            print('{} Epoch {}, Training loss {}'.format(
            datetime.datetime.now(), epoch,
            loss_train / len(train_loader)))

Then we will create an instance of the CNN model, one SGD optimizer, one of our Cross Entropy loss, and pass them to the training net.

train_loader = torch.utils.data.DataLoader(cifar3, batch_size=64,
                shuffle=True)
model = Net() 
optimizer = optim.SGD(model.parameters(), lr=1e-2) 
loss_fn = nn.CrossEntropyLoss() 
training_net(
              n_epochs = 20,
              optimizer = optimizer,
              model = model,
              loss_fn = loss_fn,
              train_loader = train_loader,
              )

2021-08-27 15:35:11.218508 Epoch 1, Training loss 1.2309487659880456

2021-08-27 15:36:10.194867 Epoch 10, Training loss 0.46954463948594766

2021-08-27 15:37:15.662480 Epoch 20, Training loss 0.38312686726133877

2021-08-27 15:38:21.163712 Epoch 30, Training loss 0.3325024703081618

2021-08-27 15:39:26.853854 Epoch 40, Training loss 0.2940514350825168

2021-08-27 15:40:32.674722 Epoch 50, Training loss 0.265239320631991

2021-08-27 15:41:38.467511 Epoch 60, Training loss 0.24000025144282808

2021-08-27 15:42:44.337462 Epoch 70, Training loss 0.21803398712518368

2021-08-27 15:43:50.985756 Epoch 80, Training loss 0.197745847099639

2021-08-27 15:44:56.936097 Epoch 90, Training loss 0.18055609463060157

2021-08-27 15:46:02.644376 Epoch 100, Training loss 0.16271316761032065

Now we will measure the accuracy of our model against the validation data:

# build the data loaders for train and validation
train_loader = torch.utils.data.DataLoader(cifar3,     
                     batch_size=64,shuffle=False)
val_loader = torch.utils.data.DataLoader(cifar3_val, 
                     batch_size=64,shuffle=False)

# measure the accuracy using the validation dataset
def validate_net(model, train_loader, val_loader):
    for name, loader in [("train", train_loader), ("val", val_loader)]:
        correct = 0
        total = 0
        # don't want grads here
        with torch.no_grad():
          for imgs, labels in loader:
              outputs = model(imgs)
              # gives the index of the highest value of outputs
              _, predicted = torch.max(outputs, dim=1)
              # increased total with batch size
              total += labels.shape[0]
              # Comparing the predicted class that had the maximum probability with the correct label. The sum gives the number of items are agree
              correct += int((predicted == labels).sum())
        print(f"Accuracy {name}: {correct / total}")

validate_net(model, train_loader, val_loader)

Accuracy train: 0.9295

Accuracy val: 0.8713

Ok folks, too much information in this post. In the next one we will be talking a little about:

how to store and retrieve the trained parameters of our neural network, methods to regularize our network fighting overfitting, and run our training over GPUs.

As always the github link with the complete neural network jupyter nb is attached, so that you can verify the code for yourself.

Your comments are appreciated, thanks.