Neural Networks from the beginnings

jercoli
Jul 31, 2021
10 min read

Updated: Aug 1, 2021

After reading several books that deal with the topic of neural networks with different levels of complexity, I have proposed to put together this post for those who wish to understand this topic from scratch, or rather from the understanding of a linear regression model (complete reading here), which we will quickly review below.

Reviewing linear regression

X = Our Dataset

x1, x2, x3,... = Each feature of dataset

w1, w2, w3,...= Weight (variable value that multiplies x)

b = Bias (constant value that add to Σ w*x)

So we have a value y as a result of the function output, that is our prediction of the target value, which we must compare with the real value. These differences will be evaluated by a loss function that will aim to minimize this error.

In linear regression we use Least Squares to find the best-fitting line that minimize the prediction error for the observed data.

ok, but what does this have to do with neural networks?

Neural Network fundamentals

Neural network is a series of algorithms that seek to identify relationships in a dataset via a process that tries to represent how the human brain works.

We use linear regression function in the artificial neuron (called perceptron)

Some features of neural networks:

they are built on the basis of layers made up of a series of artificial neurons. 3 layers minimum (input, hidden and output layer)
is a computational graph through which multidimensional arrays flow
is a universal function approximator that can represent the solution of any supervised learning problem

ok, we know that each neuron of our hidden layer will apply a linear regression function to the input values, obtaining as output a value of its prediction (P)

so far we have made a forward pass.

Then, these predictions (P) together with the real target values (y), will be the input of a function that should minimize the prediction error. We will call this the loss function.

In this case we use MSE (mean squared error) instead of LSE (least Square error) like the loss function.

MSE works fine if we have small errors, helping to converge to the minima efficiently, but being sensitive to the outliers (large errors), gives them relatively higher weight (penalty). We have others metrics like RMSE or MAE.

(for a better understanding of the loss functions that can be used, I recommend this post)

With the value obtained by our loss function plus the initial weights we will have to obtain the gradient descent that will allow us to recalculate the weights in order to obtain a global minimum.

These weights recalculated by the gradient will allow updating the initial weights in what we will call a backward pass.

To visualize gradient descent, imagine an MSE loss function (y) with a target=0 and a weight (w). We will have to obtain the derivative of y (dy) and the derivative of w (dw) to obtain a slope.

For ex.: dy/dw is positive when w is bigger than 0. A positive dy/dw can be interpreted as a positive step in w will result in a positive change in y.

To lead to a decrease in loss, a negative step in the w direction is needed: w' <------- negative step ------- w

w' = w - (+ (dy/dw) )

And when w is smaller than 0, a positive step we must do => w' = w - ( - (dy/dw) )

Now that we have our weights updated by the backward pass we can make a new forward pass and so on until we obtain a model that adjusts to our optimum. Basically we train the regression using iterative methods based on gradients, to go down the slope (descent) to some minimum global error level.

Summary of steps to train the model

Select a batch of data (x, w)
Execute the forward pass of the model
Execute the backward pass of the model, using the information computed in the forward pass
Use the gradient calculated in the backward pass to update the weights (also called parameters in deep learning)

Hyperparameters for the learning algorithm (tuning the model)

Batch: one or more samples considered by the model in an epoch before the weights are updated
Epoch: it is comprised of one or more batches. An epoch implies that each sample of the dataset has had the opportunity to update the internal parameters of the model (weights)
Optimizer: optimization algorithm used to train the network. You should find a set of internal model parameters that perform well with some performance metric (MSE, RMSE, etc). This algorithm, called gradient descent, is iterative and is made up of the steps previously explained. In the case of NN, backpropagation is used, which trains the weights (updates w). The most simple to use is SGD (stochastic gradient descent)
Learning rate: is the value that indicates how long the path taken by the optimization algorithm will be. If the value is too small, the update can get stuck in a local minimum, while if it's too large we run the risk of jumping repeatedly to the global minimum.

image source: https://www.jeremyjordan.me/nn-learning-rate/

Batches and optimizers

A training dataset can be divided into 1 or more batches.

In the case of the Stochastic GD optimizer, the batch measurement is = 1 sample.

If the measurement were the entire dataset, we would use a Batch GD, and if it were > 1 and < the whole dataset, we would use MiniBatch GD, with typical values of 32,64, 128.

Non linear relationships

So far we have seen that our Neural Network is composed only of linear regression functions, but what if one or more of our most important features for prediction has a non-linear relationship?

Suppose we have this linear regression function on our data, and the RMSE of this model is : 5.05.

If we divide this number by the mean value of the target, we can get a measure of how far off a prediction is (on average) from its actual value.

y_test mean=24.08, so 5.05/24.08= 21%

It's important to scale each feature of our data to have mean 0 and standard deviation 1 (Standard scaler), a benefit of doing this is that we can interpret the absolute values of the coefficients (weights) as corresponding to the importance of the different features (larger = more important)

So, suppose that we have 10 features and the most important has the following graph:

This feature is strongly related to the target, but has a non-linear relationship.

So if we keep only the linear regression in our model, we will be losing very important details in its learning cycle.

For this reason we must build a more complex model that includes these non-linear relationships of our features with the target.

Our first Neural Network

In this section we will explain the neural network with a single hidden layer that we will create for our sample data obtained from the Boston Houses dataset (included in sklearn).

In our case there are 13 features in the dataset, so our Input layer will have 13 inputs (1 for each feature) and a Hidden layer (13 neurons) with 13 inputs and obviously 13 outputs, finally an Output layer with 13 inputs and only 1 output value.

The main idea is that we will perform many linear regressions, then we will send the results through a nonlinear function and finally we will do one last linear regression that will make the predictions in the last instance.

Step 1 - A bunch of lineal regressions

Our data (Input) have a shape of [batch_size (our samples), num_features (13)].

Here we must multiply our input by a matrix of weights with dimensions [num_features (13), num_outputs], resulting in an output of dimensions [batch_size, num_outputs].

Now for each sample we have [num_outputs] different weighted sums of the original features, think of a weighted sum as a learned feature.

Step 2 - No lineal function (Activation function)

Now feed each one of this weighted sums through a non lineal function, in our case a Sigmoid function.

Sigmoid is one of the most common activation functions used in NN.

It squashes some input between 0 and 1, where large positive values converge to 1, and large negative values converge to 0.

There are various activation functions such as ReLU or Tanh.

In this step we enable our NN to model non-linear relationships between the features and the target

A nice view of what a neuron would look like with its linear regression (step 1) and its non-linear activation function (step 2), is shown below:

Step 3 - Another linear regression

Now we will take the 13 values emitted by the layer built in steps 1 and 2 (linear regression + activation funct.) and we will feed a last linear regression that will give us a single value as a result.

Finally we will train our first neural network and evaluate the metric to see if it improves its value with respect to the linear regression model.

OK, we check the RMSE metric and get 3,67 !!, better than the 5,05 of the previous model.

But what are the reason of this improvement?

1st. By adding the non-linear function we allow our model to better learn the most important feature

This relation is now non-linear and closer to the target.

2nd. The NN can learn combinations between our features and the target, unlike only individual features of the linear regression model. Our NN performs a matrix multiplication and creates 13 learned features (by combining the original feat.)

So, we have 2 improvements:

Learning nonlinear relationships
Learning relationships between combinations of individual features against the target

Using Pytorch to create our first real NN

Pytorch is a great python library for neural networks and deep learning.

Obviously here we will not explain in detail the characteristics of this library, so the task of understanding the main details of how this library works, as well as a better understanding of Tensors, is left to the reader.

I highly recommend the book: Deep Learning with Pytorch (manning), especially part 1 (chapter 1 to 8).

we start by loading the data and scaling it

# Data load
from sklearn.datasets import load_boston
houses = load_boston()
data = houses.data
features = houses.feature_names
target = houses.target

# Data preparation
from sklearn.preprocessing import StandardScaler
s = StandardScaler()
data = s.fit_transform(data)

then we split the data into train and test datasets

# Build train, test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3)
# reshape y to 1 column
y_train, y_test = y_train.reshape(-1, 1), y_test.reshape(-1, 1)

Ok, let's start the code to build our model with the help of Pytorch. PyTorch has the ability to define models and layers as easy-to-use objects that handle sending gradients backward and storing parameters automatically, simply by having them inherit from the torch.nn.Module class.

We'll create a base class to model our neural network with a forward method, then we'll create a subclass to apply our house price prediction model, which will inherit from our base class.


from torch import nn, Tensor, sigmoid
import torch.optim.optimizer as Optimizer
import torch.optim as optim
import torch.nn.modules.loss as _Loss
import numpy as np
from typing import Tuple

def inference_mode(m: nn.Module):
  m.eval()

# Pytorch base model class 
class PyTorchModel(nn.Module):
   def __init__(self) -> None:
 	super().__init__()
 
 def forward(self, x: Tensor, inference: bool = False) -> Tensor:
   if inference:
 	self.apply(inference_mode)
   raise NotImplementedError()

One minute please, what does the inference_mode method do?

A neural network can be executed in training or evaluation mode of the already trained model (inference), so Pytorch gives us the method nn.Module.eval(). So, inference consists of putting into practice what the AI has learned in training. Also in that mode we can turn off the normalization functions such as dropout or batch normalization, for example.

In our case we code a simple model that does'nt have normalization.

Ok, next our house price prediction NN:

# NN Model to predict House prices - subclass of base
class HousePricesModel(PyTorchModel):
   # set default 13 features for the output of hidden layer
   # and the input of output layer
   def __init__(self, hidden_size: int = 13):
      super().__init__()
   self.fc1 = nn.Linear(13, hidden_size)
   self.fc2 = nn.Linear(hidden_size, 1)

   def forward(self, x: Tensor) -> Tensor:
      assert_dim(x, 2)
      assert x.shape[1] == 13

      x = self.fc1(x)
      x = sigmoid(x)
      return self.fc2(x)

So, we pass our X data tensor (with 13 features) to the first lineal regression (self.fc1), then execute a sigmoid non-lineal function (activation function) to that. Finally pass the results to a second lineal regression (the output layer)

we create an object of our new class and see its result:

pt_houses_model = HousePricesModel()
pt_houses_model

HousePricesModel(

(fc1): Linear(in_features=13, out_features=13, bias=True)

(fc2): Linear(in_features=13, out_features=1, bias=True)

)

Now we create a class that will serve to train our model, so it will have a fit method where the learning of the NN will be executed during n iterations (Epochs). Another issue to clarify is that we have made batches of 32 samples (_generate_batches method), so as our dataset has 354 samples we will have 11 batches of 32 (+ 1 of only 2 samples).

Then those 11 batches of 32 samples each, will serve to train the model in each epoch. Also before each iteration the data is shuffle (permute_data method)

# Build a trainer class
class PyTorchTrainer(object):
 def __init__(self, model: PyTorchModel, optim: Optimizer, criterion:                
              _Loss):
      self.model = model
      self.optim = optim
      self.loss = criterion
      self._check_optim_net_aligned()

 # check that the parameters that the Optimizer refers
 # to are in fact the same as the model’s parameters
 def _check_optim_net_aligned(self):
      assert self.optim.param_groups[0]['params']\
      == list(self.model.parameters())

 def _generate_batches(self,
                       X: Tensor,
                       y: Tensor,
                       size: int = 32) -> Tuple[Tensor]:
      N = X.shape[0]
      for ii in range(0, N, size):
          X_batch, y_batch = X[ii:ii+size], y[ii:ii+size]
          yield X_batch, y_batch
 
 def fit(self, X_train: Tensor, y_train: Tensor,
         X_test: Tensor, y_test: Tensor,
         epochs: int=50, 
         batch_size: int=32):
 
   for e in range(epochs):

      # On each epoch we shuffle the array elements randomly
      X_train, y_train = permute_data(X_train, y_train)
      # Generate the batches of data
      batch_generator = self._generate_batches(X_train, y_train, 
                                                  batch_size)

     # On each batch of data, backpropagate and perform the gradient descent
     for ii, (X_batch, y_batch) in enumerate(batch_generator):
 
        # Because Optimizer retains the parameter gradients (param_grads) 
        # after each iteration,set the previously calc. gradients to 0
        self.optim.zero_grad()
        # Feed the outputs and targets into a loss function, compute loss
        output = self.model(X_batch)
        # Compute the loss gradient with respect to all of the parameters
        loss = self.loss(output, y_batch)
        # After computing the loss value, started the backpropagation
        loss.backward()
        # Use the Optimizer to update parameters according to some rule
        self.optim.step()

     output = self.model(X_test)
     # Calculate and print the loss of the epoch
     loss = self.loss(output, y_test)
     print(e, loss)

Before running our trainer, we must create:

An instance of our NN model (HousePricesModel)
An instance of Optimizer
An instance of Loss function
Transform our train and test data to Tensors

# Pass our data to Tensors
X_train, X_test, y_train, y_test = Tensor(X_train), Tensor(X_test), 
                                   Tensor(y_train), Tensor(y_test)
# build an object model
nn_obj = HousePricesModel()
# build an SGD optimizer wit learning rate=0.001
optimizer = optim.SGD(nn_obj.parameters(), lr=0.001)
# Loss function MSE
criterion = nn.MSELoss()

We will print the results of the loss function in each epoch. Let's try our trainer and see what results it gives us after executing 200 epochs:

trainer = PyTorchTrainer(nn_obj, optimizer, criterion)
trainer.fit(X_train, y_train, X_test, y_test, epochs=200)

First 10 epochs

...

Last 10 epochs.

We obtained a better MSE metric with 21.30 against 25.48 of our model with single linear regression. It doesn't seem like much, but we still need to optimize our NN by fine-tuning our model, or perhaps adding new layers.

Ok folks, I hope this is a good starting point in your understanding of a neural network. Let's in mind that when we talk about deep learning we must have at least 2 hidden layers, and we have only seen the construction of a model with only one; of course this has been with the aim of simplifying the level of difficulty as much as possible.

Neural Networks and Deep Learning are tricky, so understanding each topic well and slowly is essential, take time to practice each step of your learning with code.

As always the github link with the complete neural network jupyter nb is attached, so that you can verify the code for yourself.

Your comments are appreciated, thanks.