Neural Networks are an important area of Artificial Intelligence. They have wide application in areas like Computer Vision, Speech Recognition, Natural Language Processing, etc. It has been observed that Deep Neural Networks work extremely well in application like converting speech to text, object recognition in images, self driving cars, time series analysis, etc. Their applications are increasing by the day.
From my recent studies, I observed that programming a Neural Network can be set in a pattern as provided below.
- Define the neural network structure ( # of input units, # of hidden units, etc).
- Initialise the model’s parameters
- Implement forward propagation
- Compute loss
- Implement backward propagation to get the gradients
- Update parameters (gradient descent)
Though this looks very simple, there is considerable amount of work involved in programming Neural Networks and it involves making very critical decisions. I provide some guidelines to making these decisions that I have learnt.
The first decision is to decide the structure of the Neural Network.
Suppose we have a problem to recognise images. We can start with a simple problem of trying to figure out if the image is that of a dog or not. Even for this simple problem, we may consider whether we would like to use a 2-layer neural network or a 3 or more layer neural network.
A few decisions can be made based on the problem. For example, in this problem, we have to decide whether the image is that of a Dog or not. So, we want 2 outputs from our neural network – 1 if the image is that of a Dog and 0 if the image is not of a Dog. For making binary decision, one of the best algorithms is to use Logistic Regression. So, we could create a neural network as follows.
A useful guideline is to make gut feel decision and create an initial neural network.
To start training the model, we must set an initial set of values of Hyperparameters.
We see in the picture above that so far we have 3 Hyperparameters. To start training, we must first set the weights and bias. One common practice is to initialise the weight and bias to zero.
It is better to initialise the weights and bias to some random values.
In my observation, generally the model reaches the global minima much faster if the weights and bias is set to some random values.
Next decision is to set the value of the Learning Rate.
To train a Model, we will have to experiment with many values of the Learning Rate. I find it useful to start with setting the Learning Rate to a very small value like 0.001 and gradually increasing it in subsequent iterations. One point to note is that setting the Learning Rate to a small value also implies that the number of iterations over which the Model must be trained must be large.
It may be a good idea to start by setting the Learning Rate to 0.001 and number of iterations to 25,000.
We must study the rate at which the gradient descent is taking place. If the Model is responding by positive Gradient Descent and not reaching a minimum in the number of iterations, we can increase the number of iterations.
Another technique to consider here is that of Regularisation.
In regularisation, we calculate the cost function as follows.
So, now we now have another hyperparameter – lambda. We have to experiment with different values of lambda to get the gradient descent to move towards global minima. However, there are some useful initialisation already available like Xavier Initialisation, Adam Initialisation, etc.
So far, I found it useful to start with value of 0.1 for lambda.
Now, we may find that the initial model is not predicting very accurately. One of the obvious decisions could be to change the model. However, before we do that, we should do the following.
- Determine the accuracy with which Humans could perform the same task. Let us say that Human can detect the images of Dog with 0.5% error rate (Human could also make a few mistakes if the image is very blur).
- Determine the accuracy with which the Model is performing on the Training Set.
- Determine the accuracy with which the Model is performing on the Test Set.
Take a case that the Model is producing predicting with 8% error rate on the Training Set and with 10% error rate on the Test Set. In this case, we would say that the model has high bias and low variance. In simple English, we could say that the Model is under fitting the Training Set and is fairly generalised. In this case, we may consider a deeper Neural Network.
Take a case that the Model is producing predicting with 2% error rate on the Training Set and with 12% error rate on the Test Set. In this case, we would say that the model has low bias and high variance. In simple English, we could say that the Model is over fitting the Training Set and is not generalised. In this case, we must consider training the network with more data and/or we must use regularisation.
Take a case that the Model is producing predicting with 6% error rate on the Training Set and with 12% error rate on the Test Set. In this case, we would say that the model has high bias and high variance. In simple English, we could say that the Model is under fitting the Training Set and is not generalised. This could be a case of data mismatch. In this case, we must consider look at relabelling our test data. Relabelling means that while setting the test data, we may have classified the test data incorrectly. We need correcting this. Now, this can be the case with the training data as well. However, it can be very expensive to relabel the training data as training data is normally very huge.
Though this can be a good starting point, we cannot expect this neural network to perform very well. So, we may want to create a deeper neural network. Suppose we consider creating the below L-layer neural network.
We need to now decide the activation function for each layer.
Another aspect of decision making is using the available data.
In the modern context, normally data available for training Neural Networks is in millions. So, instead of dividing this data in the traditional 60% for Training, 20% for Test and 20% for Validation, we may devote 99% of the data for Training and 1 % each for Test and Validation. By this distribution, we will still have 10,000 data point for test and 10,000 data points for Validation (provided we started with 1 million data points).
Now, when so much data is made available for training the model, if we process all this data in one iteration, then we will need a very powerful machine and even then the amount time the processing will take will be enormous.
So, a useful technique is to create mini-batches of the training data and train the Model on the mini-batches instead of in a single batch. So, mini-batch size is also a hyperparameter.
Ideally, mini-batches should be of 64K or 128K size.