Overview - HoldenCaulfieldRye/PlantRecog GitHub Wiki

Image Recognition With Deep Learning

In image recognition, the programmer can specify at the pixel level what defines each class, but that is extremely difficult and maybe even impossible. The programmer can also provide no specification whatsoever, and combine a lot of data, a neural network with intermediate layers, and some badass learning algorithms for the classifier to learn the specification of each class from the ground up. That's deep learning.

A multi-layered neural network is a good architecture for learning features

Deep Convolutional Neural Nets

Deep CNNs (often simply referred to as CNNs) are a type of deep neural nets. They were used by the winning teams at ImageNet 2012 and ImageNet 2013. They were used in 2012 for the first time at ImageNet and thrashed competing classifiers, which had been built by world class computer vision teams.

Dimension Hopping

CNNs seem to be the architecture of choice for image recognition '''with supervised learning''', because they are good at tackling dimension hopping in the data. With images, each pixel defines a dimension, and since a leaf (or any feature) can appear anywhere on the photo, this means a feature can appear in any group of dimensions. This is very difficult to deal with. For example, consider the classification task of diagnosing illnesses for medical patients based on symptom data. Imagine if the heartbeat data were to sometimes be written under the "height" section! It would be a total mess.

Replicated Feature Detection

CNN architecture is a way of replicating feature detection across the whole image. This is done by modifying the backpropagation algorithm into forcing certain edge weights to be the same. But since one feature has to be potentially picked up anywhere on the image, and there are lots of features to detect, hidden layers (especially the lower ones) have lots of neurons. But since weights are replicated, the number of parameters to optimise isn't as high, so it doesn't make the required computation impossible.

Our Network's Architecture

Given a training set, the performance of a neural network can be improved by adopting a specific architecture, ie choosing values for the following parameters:

number of hidden layers
number of neurons on each layer
number of connections ie edges between any two neurons from adjacent layers (by default, fully connected)
a neuron model for each neuron (ie an operation to perform on the neuron's input which determines its output)
an error function for the output layer
a learning algorithm (including the learning rate)
techniques for preventing overfitting
a distribution from which to sample initial weights (usual Gaussian with mean zero, but optimal std dev can vary)
momentum
...

To give you an idea, the CNN used by Krizhevsky, Sutskever and Hinton in ImageNet 2012 (aka cuda-convnet) is described in Krizhevsky et al 2012 as follows: 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout”.

Ideas

Output layer: softmax, taxon for target values

It's likely that the output layer will have softmax neurons. By definition, the sum of target output values will sum to 1, with each value contained in [0;1]. The simplest approach is to have the target value for the correct class = 1, and 0 for all others. But it might be interesting to exploit the taxonomic tree of species to assign nonzero values to classes/species which are similar to the true class/species. This might help the network to find "good" features, ie features relating to the biology of the plants (phenotypes). It might also prevent the network from punishing the use of features which are useful in many cases but were not useful in that specific case. For example: feature A is great for distinguishing species u from species v and w because only u has it, but it isn't great for distinguishing species u from x because they both have it. That's ok, u and x are taxonomically similar, so the target output value for class x when classifying on image of a plant from class u should not be 0, it should be 0.3. You might say that 0.3 is an arbitrary value. What should we use? We could use some normalised function of the distance between species u and x in the taxonomic tree. If the distance is twice the height H of the tree, the target value is 0. If the distance is zero, target value is 1. If distance is n, with 0<n<2H, the target value is n/2H. Or, if we think this distance metric doesn't leave enough of a gap between the correct answer and all others, we could take the square of the metric. We wouldn't be able to use these distances as labels for a softmax, because the sum of output values needs to sum to 1. So we'd have to normalise the labels by dividing whatever distance metric we use by the sum total of distance metrics in the classifier's species.

Backpropagation Initial Weights: discretely explore error function

Performance-wise, backpropagation's big weakness is the gradient descent's naivety. But backprop is fine if the initial weights are on the slope of the deepest crater. So instead of initialising weights by randomly sampling them from the normal distribution, might it do any good to evaluate the error for, say, 1 billion sparse weight vectors (scattered on the surface), and take as initial weights those with lowest error? The underlying assumption is that there is a high probability of being on the slope of the deepest crater given that the weights' error is lowest in a billion. This won't be the case if the surface is crazy bumpy (ie if the error function is a polynomial of high order).

Backpropagation Initial Weights: estimate polynomial order of error function

Is there a way of efficiently estimating the order of an unknown polynomial? Eg by randomly sampling 1 billion points of the error function (aka evaluate the error for 1 billion sparse weight vectors). We could use that to calculate the probability that the local minimum found by gradient descent is the global minimum.

Backpropagation Initial Weights: unsupervised pretraining

Read on deep learning google+ community that unsupervised pretraining can help. No idea what this consists in yet. . . .

Data

Labelled photos taken by amateurs is the ideal data to look for. Discussion below. NB1: we can't use externally sourced data for the PlantClef competition, so this data would be strictly for the app. NB2: below are my (alexd) opinions, but I may be wrong!

Amateur Photos

The objective is to distinguish plants based on images taken by smartphone users. So we need to train our classifier on similar images. What might be our classifier's biggest challenge is segmenting the target object from the rest of the user photo. Deep learning can cope with complex tasks like that, but it needs to be given many realistic examples to train on. In a statistical sense, training data is realistic if it is randomly sampled from the same population distribution as that from which test data (that is, smartphone user photos) is sampled. If for example all of our leaf images are 2D scans of leaves on a white background, our classifier will think that an object can only ever be a leaf if it's surrounded in white, and laid perfectly flat, and will not know how to segment the leaf in a noisy environment. So ideal data is labelled photos taken by amateurs.

:information_source: Possibility we could take high-quality labeled data and adjust to mimic the properties the photo would have, had it been taken from the average iphone user. eg. Add noise, de-noise, reduce resolution, offset the white balance and hue, over/under expose, add extra background etc @ac7513

Labeled Data

Deep learning has proved its superiority in the ImageNet competitions, which are supervised learning challenges. If we are to use the cuda-convnet code which won these competitions, we need labeled data. Labels are discussed below. The species label is crucial, the component label is in my opinion very good for improving performance, and the other ones I am less sure about.

Species Label

The ImageNet data contained a single class label, so I think cuda-convnet is built to deal with just that. The other labels below aren't crucial, but are likely to increase app performance.

Category Label

Eg fruit, flower, stem, leaf, leaf-scan, bark, branch. This is provided in PlantClef data. I think these could heavily boost performance. In the ImageNet Fine-Grained challenge, you had to identify the correct plane model, or the correct dog species, from one photo only. In our case, if our classifier is uncertain about which of 3 species a given leaf belongs to, but has learnt that these 3 species have very different flowers, it could ask the user for a flower photo. I think learning different components could be great because if our classifier could only recognise leaves, there'd be only so much that querying the user for more leaf photos could do to help. Moreover, it's what botanists need to sometimes do. Note that leaf and leaf-scan are different categories! This is really important, because in a non-scan environment, segmenting the target object is difficult, and the classifier needs to learn how to do it. Segmenting is easy in a scan environment, so we don't want these images confusing our classifier.

Bounding Box

A bounding box is basically a square around the object that needs to be identified in the photo. In the ImageNet Fine-Grained challenge, bounding boxes were provided in the training data and sometimes in the test data. They help the program with segmentation. I don't think PlantClef provide bounding boxes, so it's unlikely that external sources will. Bounding boxes are important for computer vision approaches which detect features in an image but do not know how the features piece up together in the image. I don't know about deep neural nets. But it's worth keeping in mind. Moreover, we could enforce bounding boxes in our test data! Apps that take a photo of a face and make it look old have a preliminary stage in which the user is asked to place markers on key facial features. We could do the same with the heart of a flower, the central stem of a leaf, the trunk of a tree, or the rough contour.

Taxon Label

The full taxon names (species, genus, family…) following as possible the most recent international taxonomical references. We might be able to create a tree graph from this data (a root node, then subsequent layer of nodes are families, then subsequent layer of nodes are genus, then subsequent layer of nodes are species etc). If our classifier is uncertain between 3 species, and 2 of those are much closer together in the tree than the 3rd species, then the 3rd one can be ruled out. Jack mentioned a paper about making use of such a graph to "make good mistakes" when the classifier can't recognise the object it's been given. I don't know how that works, but it might be worth checking out.

Location Label

This is also provided in PlantClef data. Suppose that based on some photos from the user, our classifier is uncertain between two species which, for some reason, look very similar, but grow in completely different parts of the world. Then the user's location would help to distinguish the two. This is worth trying out, but I wonder whether the risks of overfitting are high, because for a given species, we might not have training images from every location in which the species grows. Our classifier might think a plant cannot grow in a region it's never seen it in.

Other PlantClef Labels

Individual Plant ID, GeographicalDistribution, PhotographerName, Date, ImageQualityRating. Any ideas how these might be useful?

##Non-Amateur Photos Still Good There are 2 types of non-amateur photos: professional wildlife photos, and scientific scan images (or have we missed any?).

Professional Wildlife Photos Still Good

Possibility we could take high-quality labeled data and adjust to mimic the properties the photo would have, had it been taken from the average iphone user. eg. Add noise, de-noise, reduce resolution, offset the white balance and hue, over/under expose, add extra background etc.

Scientific Scan Images Still Good

I can think of two approaches for using scientific scan images:

Prompt For User Scan-Like Photos You might argue that leaf-scan images are not realistic examples of user photos. But we could prompt the user to tear off a leaf, hold it up against a uniform background (the sky, a tissue, a blank sheet of paper at home), and take a photo. I'm interested in exploring this, because scan images are very easy to segment and are likely to be plentiful and labeled.
Generate Synthetic Leaf-Scan Images Otherwise our classifier could learn to map leaf-wildlife images to leaf-scan images. When faced with a leaf-wildlife photo taken by a user, it would implicitly create a synthetic leaf-scan image, and extract features from that (which is like comparing the synthetic leaf-scan image to the leaf-scan images in its database). The synthetic leaf-scan image won't carry more information than the leaf-wildlife image it was mapped from, but the classifier might have more knowledge about leaf-scan images than leaf-wildlife images, if it has seen many more of them during training.

Unlabeled Amateur Photos Still Good

We might still be able to make good use of unlabeled amateur photos. ImageNet winners / cuda-convnet uses deep convolutional networks, which have been powerful since the 90s, which are a form of supervised learning. Geoffrey Hinton has been focusing on Restricted Boltzmann Machines for the past decade, which are a form of unsupervised learning. As mentioned above, unsupervised learning is all about clustering. We could combine our labeled and unlabeled images into one set, and cluster that set, using each label from the labeled images to define a cluster outline. We could then assume that the unlabeled images which fall within a cluster outline have that label. These newly labeled images, if they are realistic examples of user photos, could help our classifier to get an idea of what a user photo is like. I think this could reduce overfitting and increase processing speed at test time, but I'm not sure.