APTOS - ZackLarsen/kaggle GitHub Wiki

In this competition, we used computer vision and deep learning techniques to diagnose diabetic retinopathy from fundus photographs taken from diabetic or otherwise healthy patients. Below are some of the important high-level aspects of convolutional neural networks that are needed to accomplish this task.

Convolution

In convolution, we apply something called a "filter" which is essentially 3 X 3 a matrix with ones in the first column, zeros in the middle column, and negative ones in the third column. The convolution operation takes the input matrix that represents the image and puts the filter "on top" of the input matrix, then multiplies each number in the input matrix that matches up with the filter matrix number on top of it. Then, we take the sum of all those products (order does not matter here), and put the resulting number in a new matrix, which you can think of as being another image.
We perform this multiplication and sum operation one at a time by "shifting" the filter over the input image, one cell at a time.
The filter we use in convolution operations could be one that we come up with by hand or one that computer vision researchers have commonly used, such as Sobel or Scharr filters, but it is probably more effective to treat the numbers in the filter as parameters to be learned through backpropagation so that the neural network can learn the optimal filter to be used for the task.
When performing convolution on volumes (e.g. RGB color values instead of grayscale), we use the same technique but we include several layers for each step of the process. In other words, we will have for an RGB image 3 different input images that are 2D by themselves, and we will be applying an f X f filter for each of those 3 2D input matrices. Then, we will perform our elementwise multiplication and sum operations for each of the 3 input matrices and add the 3 resulting values together and store that value in the corresponding output cell. The output matrix, unlike the input matrix and filters, will only be 2 dimensional, so for an RGB image that is 6 by 6 pixels, we will have 3 6X6 input matrices, 3 3X3 filters, but only 1 4X4 output matrix (assuming no padding and a stride of 1).
Convolution on volumes allows you to not only include more channels, as in the case of RGB images, but it also lets you apply as many filters as you wish, where each filter will contribute another layer of depth to the output. In other words, you could apply both a horizontal edge detector and a vertical edge detector to a 6X6 RGB image and you would have 3 6X6 matrices for input, 2 separate 3X3X3 filters, and your output would be 2 4X4 matrices instead of just 1 4X4 matrix.
Convolution has two mechanisms that make it much easier to train and less prone to overfitting compared to fully connected implementations: parameter sharing and sparsity of connections. Parameter sharing refers to the fact that a feature detector (such as a vertical edge detector) that is useful in one part of the image is probably useful in another part of the image as well. Sparsity of connections refers to the fact that in each layer, the output depends only on a small number of inputs because our calculations use the filter instead of the entire input image. Convolution also tends to capture translation invariance, such that the model will recognize an object even when it is shifted by a few pixels.

Padding

By using padding, we can get over two of the more severe limitations to the convolution operation, namely that 1) your input image will shrink through successive convolutional layers and 2) the pixels at the edges or corners of the image will be used less often and therefore have less influence, even if they are important aspects of the image.
Padding takes the pixels at the edges or corners of the image and adds additional pixels around them so that the result of the convolution operation will have the same dimensions as the input image, and the pixels at the edges will be used in the convolution calculations just as frequently as those in the center of the image.
By convention, we pad with zeros so that the additional pixels added to the image will all have a value of zero.
To get the output to have the "same" dimensionality as the input, we have to use padding p = (f-1)/2, where f is the dimensionality of the filter being used in the convolution operation. For a 3X3 filter, padding p = (3-1)/2 = 1; for a 5X5 filter, padding p = (5-1)/2 = 2.
By convention in computer vision (according to Andrew Ng in his deeplearning.ai course on convolutional neural networks on Coursera), f, or the filter dimensions, will always be ODD, so the padding will never be a half-pixel value. For example, if we used a 4X4 filter, padding p = (4-1)/2 = 1.5, which does not make sense in terms of adding 1.5 pixels around the input image. We could also add more padding on the left or right to get around the half pixel problem, but this could create problems of its own. Also, oddly-dimensioned filters have the nice property that there is a central pixel.

Strides

Stride refers to the number of pixels or cells that we are shifting by as we perform the convolution calculations.

Pooling

The process of pooling divides the input matrix or volume into different quadrants/regions, then applies either an average or maximum operation on the elements in that region. It has the effect of retaining larger numbers when a feature corresponding to the filter being used exists within the region. Max pooling is typically favored over average pooling. Average pooling is typically reserved for deep inside a neural network when you want to collapse some of the output.
Max pooling typically does not use padding.
Pooling works remarkably well, but Geoffrey Hinton has said that it makes no sense for us to perform pooling because it ruins translation invariance, and therefore it doesn't behave the same way the human brain does in terms of recognizing the orientation of an object. He has proposed capsule networks to overcome some of the limitations of pooling operations in the context of computer vision / object recognition tasks in artifical intelligence and machine learning. Unfortunately, there is as of yet no implementation of capsule networks available in the popular deep learning frameworks such as TensorFlow, Keras, PyTorch, etc.

Fully-connected layers

Dense layers

Logistic vs. Softmax

A Logistic classifier can be used as the output of a convnet that will be used to make binary predictions such as hotdog / not hotdog or cancer / not cancer. For more than 2 classes, we would use a softmax classifier, which outputs probabilities that sum to 1 and tell is which of the various class labels is most likely for that particular data instance.

Regularization

Dropout
- Dropout is a process where we remove a certain percentage of the weights being trained for a particular stage of training so that the neurons associated with those weights do not play a role in prediction at that time. By randomly varying which neurons are being used in the network, we can ensure that no particular neurons dominate the model by adding too much bias.
Data Augmentation
- We can augment the data by applying certain techniques such as shifting, blurring, translating, etc., but the main idea here is that a good model should know that an object is the same regardless of the resolution of the image or the position within the frame or the angle of it. By feeding the same data instance to the model several times with different augmentations applied, we can do a better job of teaching the model to identify that same object in a variety of contexts.
Batch normalization
- Normalization is a general technique to make data instances more similar to each other so that models can generalize better to new data instances. Even if we normalize data before feeding it into a neural network, there is no reason to believe that the data will remain normalized after we apply transformations at different stages of training. Batch normalization was introduced in 2015 by Ioffe and Szegedy and it allows us to adaptively re-normalize the data through different stages of training. It acts as a layer, and in Keras this is called the BatchNormalization layer. Batch normalization helps with gradient propagation, and much like the residual connections in the ResNet architecture, it can help us to train much deeper models than we would otherwise be able to. Batch normalization is typically applied after a convolutional or dense layer.
- In 2017, Ioffe introduced something called batch renormalization, which appears to be an improvement over the original normalization.
Weight decay (L2 Regularization)
- This ensures that the activations are updated within the same range.

Depthwise separable convolution

This technique makes for a more lightweight model (fewer parameters) that trains more quickly and generalizes better than alternatives. It accomplishes these improvements by separating spatial features from channel-wise features. These SeparableConv2D layers can act as a drop-in replacement for Conv2D layers, and they are particularly helpful for smaller models built with smaller training datasets.

Case Studies

LeNet - 5 (1998)

This network appeared relatively early in the deep learning saga, but it was an important development for the application of hand written digit recognition. It uses a common pattern of conv - pool - conv - pool - fc - fc, and has about 60,000 parameters, far fewer than modern networks would use. It employed average pooling instead of max pooling, and no padding.

AlexNet (2012)

AlexNet showed for the first time that deep learning could be a viable tool for computer vision tasks. It also demonstrated that really large networks (approx. 60 million parameters) could be trained. This was when practitioners first started using the Rectified Linear Unit (ReLU) as the activation function instead of Sigmoid or Hyperbolic Tangent (TanH). Also, this network used a multi-gpu setup where there was a sophisticated technique that allowed some information to be shared between the two devices. Something called local response normalization was also used in this approach, although it is not very common now.

VGG (2015)

VGG showed that you can achieve good performance using relatively simple architectures.

ResNet (2015)

ResNet showed a way to train really deep networks by using a technique called stacking residual blocks, which allow you to "skip" a few layers ahead to add the activation to a layer deeper in the network.

Inception

Inception used a 1 X 1 convolution called "network-in-network" that brings a multi-channel input to a lower representation, effectively making it easier to train by reducing the number of parameters. It makes use of inception modules stacked one after the other, with each one using 1 X 1 convolution to "concatenate" outputs of the same height and width together. Another aspect of inception is that there are intermediate softmax classifers preceded by fully connected layers throughout the overall architecture that essentially ensure that the model is able to pick up on important features throughout all layers and do a good job of classifying output. This seems to have a good regularizing effect to avoid overfitting.