Alexnet is a monumental neural network that was first proven that CNN (Convolutional Neural Network) performs better than any other types of neural network in terms of image classification. You may think of this as the ancestor of most of CNN based neural network that you see now a days. So it would be important and worth having detailed understanding on this neural network.
Highlights of Architecture
Why ReLU ?
Why ReLU(a Non-Saturating Function) than tanh(a saturating function) ? it is because it is observed that ReLU learns several times faster than hanh (shown in Figure 1 of Ref ).
ReLU does not require input normalization to prevent them from saturating.
Why use of Dropout ?
It is to reduce overfitting in the fully-connected layers.
How to process the training image to fit into input dimension ?
The training image is not all same as as this. So the authors rescale the image in such a way that the shorter side is of length 256 and then cropped out the central 256x256 patch from the rescaled image. They trained the network on the raw RGB values of the pixel.
Why CNN rather than standard feedforward Network ?
Theoretically the standard feedforward network can solve any types of classification problem if the enough number of neurons are provided, but in practice we don't know exactly what is the enough number for our application.. and we don't know 'enough number' can be trained by the reasonable/practical computing power.
The paper (Ref ) says as follows :
Compared to standard feedforward neural networks with similarly-sized layers, CNNs have much fewer connections and parameters and so they are easier to train, while their theoretically-best performance is likely to be only slightly worse.
Fighting against Overfitting
As you may know, one of the ever occuring problem in most of neural network based Machine Learning is to overfitting issue. According to the paper (Ref ), there are a few common technique are used to tackle the issue of overfitting as summarized below.
Data Augmentation : This is to transform images in such a way that it become slightly different from the image but not much different to fall out of the category it is labeled. And add those transformed data to the set of training data. It means that the number of training data set gets larger than the original training set. This paper (Ref ) uses two types of Data Augmentation as follows.
Dropout : This is a technique to remove a hidden layer neuron with a certain probability. This does not mean that we physically remove those nuerons. We can make it act as those are removed by setting the output of those dropout neuron to be zero. According to this paper(Ref, the neurons which are "dropped out" in this way do not contribute to the forward pass and do not participate in back propagation. So everytime an input is presented, the neural network samples a different architecture, but all these architectures share weights.
by Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton