Leaf Classification Project - SummerBigData/MattRepo GitHub Wiki

For my project I am working on the leaf classification competition in Kaggle. There are four files provided to download: a training set, a testing set, images of each leaf, and an example submission that shows the format to submit the results in. There are a total of 1584 images of leaves and 99 different species of leaves. This means there are 16 leaves per species, 10 of each species were put into the training set while the other 6 were put into the testing set. The training set given did not consist of images, it consisted of statistics associate with each leaf. It has an id unique to each leaf, the species the leaf belongs to, 64 attribute vectors for the margin feature (shape contiguous descriptor), 64 attribute vectors for the shape feature (ﬁne-scale margin histogram) and 64 attribute vectors for the texture feature (interior texture histogram) for each leaf. The test set consists of the same things as the training set except it does not include the name of the species that a leaf belongs to. Here is what the provided table looks like for the training set.

The images of each leaf are titled after the corresponding id in the training and testing sets. It should be noted that the dimensions of each leaf image is different from one another. Also, the images consist of a black leaf on a white background. It seems that other than making sure 10 leaves per species were put into the training set and 6 leaves per species were put into the testing set the leaves were chosen at random to be taken for the complete set (Complete set meaning when both of the data sets are combined). Also, it seems that the leaves are randomly distributed in all of the data sets provided. Here are what some of the leaves look like and their corresponding species name.

I decided to use softmax regression first and see how accurate I could be with only that. This involved using only the data given to me in the training set and not using the images at all. I used three hidden layers, the first one being size 400, second one size 280, and third one size 140. There was a dropout after each hidden layer. Dropout meaning a certain percentage of the weights were set to zero to help prevent over-fitting. It ran for a total of 299 epochs and I used a batch size of 10. I received a Kaggle score of 0.03473 which put me around the 460 spot on the public leader board. Additionally, here is the structure of my softmax classifier.

I wanted to try to use larger hidden layer sizes, but when I submit jobs to the supercomputer Keras doesn't seem to run properly and I receive horrible results. Consequently, that is currently on hold. I let my program run for more epochs while saving the weights only when my val_loss decreased. Val_loss is the loss calculated on the validation set, a fraction of training set set aside to test our weights to keep track of over-fitting, where the lower the loss the better. I obtained a final Kaggle score of 0.02302. This puts me around the 260 spot on the public leader board.

Later, I was able to find the problem with submitting jobs with my code. The batch size was too small which was causing my program to get terrible results. So, I was forced to increase my batch size which I discovered would give me worse results. My best results with my three hidden layer network were obtained with a batch size of 10. When submitting a job with the same network I have to increase the batch size to 25 in order for it to run correctly. This lead me to getting worse results than when I ran it in the terminal. After adjusting my neural network multiple times in an attempt to receive a better score I have been unsuccessful with the increased batch size.

Next step was to set up a CNN (convolutional neural network) and utilize the images given to see if I can recieve better results. I wanted to merge this with my softmax classifier. The main problem currently is the different dimensions each picture has. This means I have to resize each image to the same dimensions, which can be a problem since some images are very wide, or long, and some are nearly square. First, I used SciPy's image resize function by finding the smallest dimension in the entire set of images and trying to downscale the rest of the images to that dimension. This lead to issues, since I failed to rescale the images correctly. I ended up using Abhijeet Mulgund's method of rescaling and centering the image to get results that match my images with his. Here is a link to the discussion I read through.

https://www.kaggle.com/abhmul/keras-convnet-lb-0-0052-w-visualization/notebook

Here are some of the leaves rescaled to dimensions (96, 96).

As we can see, our images are scaled correctly, but we loose the structure of the leaves at the current dimensions. So, to combat this problem I just increased the dimensions.

It should be noted that the dimensions of the original image for this leaf are (331, 771). Looking at the different species of leaves my main concern is using a lower dimension image. This is due to some of them looking very identical.

I displayed the images of three different species that are in the training set to see how the shape of each the leaves change.

Also, I computed the cosine similarity between the images of each species to see how similar each one was. I did this by picking out the first picture of each species in the training set and computing the cosine similarity between it and the rest of the nine images. I then took the average of all nine similarities for each species. Not the most thorough way to do it, but it got the job done. The larger the value the more similar the images are to each other. The results can be found here. I decided to look at the images of species that had cosine similarities between .6 and .7 (None were lower than .6).

With the rescaled images I set up the CNN to see how well it would do. Images of size (50, 50) with a very basic CNN ran significantly slower than my softmax classifier and the rate of decreasing val_loss is much slower. I set up the model so that first the images would be convolved and pooled, flattened, ran through one dense hidden layer and attached to the pre-extracted features. These features were then ran through the network that was structured just like my original neural network. The overall structure of my CNN will be shown later. After multiple trail and errors I was unable to beat my lowest score from the softmax classifier neural network. I tried increasing the image size to (100, 100), but received worse results so I reverted back to (50, 50) and tried to come up with a new idea. One of the top competitors in this competition, Ivan Sosnovik , discussed how he used the original images to gain more features that could be used to train our models. The link attached to his name will take you to his blog on how he obtained first place for this competition. Following his advice I ended up using the image's

Height
Width
Aspect ratio
If they were horizontal or not (VAE made this seem more important, results shown later)
Square
Average pixel value

I attached these features onto the pre-extracted features and ran my CNN with them to see if I could receive better results. After tweaking the amount of hidden layers and their sizes I was able to obtain a Kaggle score of 0.01763. This placed me around 180 on the public leader board. Unsure of what to do next I looked into using the variational auto-encoder (VAE) to see if there were any characteristics of the images that I was not taking into account. The following images are the training set leaves placed in the latent space of the VAE to see where they lie. The bright yellow dots single out certain species.

Since this leaf species had the only image that was diagonal in the whole set I decided to take a closer look. The image on the right is the latent space for my VAE, where each coordinate has a leaf corresponding to it. Seeing where the testing set lies relative to the training set.

It seems the thin, long leaves are very similar to one another, so that might be something I will need to address later if I am unable to obtain better results. Also, the orientation of the thinner leaves might cause some confusion too while I am training my model.

My next plan of action was to use PCA on the images and attach the PCA images to the pre-extracted features along with the features I grabbed from each picture. Also similar to what Ivan Sosnovik did. The way PCA works is it takes each image and reduces it down to the specified number of principal components, where each component contains a certain amount of information from the picture. Thus, the less components the less information is retained from the image, but the less features I would have to train on. Less features means my network will run faster and use less resources. I decided to do this with images of dimension (50, 50) since those worked best for the CNN. Here is a graph showing the relationship of how much information is retained versus the number of principal components specified.

We can see there is a certain point where increasing the number of principal components (N) gives us diminishing returns. It should be noted that majority of the information is contained within the first 80 principal components. The points marked are for different numbers of principal points that I investigated. Sklearn has a function that allows us to inverse transform our PCA images back into their original dimensions, here are the results:

This link shows each point in a little more detail. We can see how the image quality becomes worse as our number of components decreases. In the end I choose to run PCA with N = 40. After attaching them to the other pre-extracted features I ran my training set through my softmax classifier. The set-up was much the same as last time except after tweaking the hidden layers I had them set to have sizes 1000 for all three to receive the best Kaggle score. That score was 0.01419, which placed me at around 145 on the public leader board. In hindsight I should have tried with less hidden layers since I was training with more features. At the time I didn't think about it since my softmax classifier worked best with three hidden layers for just the original pre-extracted features. I also ran them through my CNN. After tweaking the hidden layers I was able to receive a Kaggle score of 0.01340, which places me around 140 on the public leader board. My best CNN model was set up similar to this:

Note that this is a general look at my CNN, the sizes of the layers were not the sizes used when I received my best score. Following code by Abhijeet Mulgund in his discussions I was able to view what my hidden layers looked like. These images are not based on my best CNN run, but give a good representation of what each run looks like. Here is the leaf.

Here is the first convolutional layer.

Here is the second convolutional layer.

Here is a look at a different leaf.

I decided to see what score I would get if I set the number of principal components to 300 since that retained 95% of the information from the original image. I received worse scores doing this, maybe this was due to having too many hidden layers or too large of hidden layers with the added features. At this point I could have fiddled around more with my CNN's hidden layers and convolutional layers, but I wanted to try something different first. After looking through the discussion board for my competition I noticed some people used sklearn's LogisticRegression function to train their models and were receiving decent results. Their scores weren't as low as mine when I used a CNN, but they also didn't include the extra features and PCA images that mine did. So, I wanted to see if I could do better with that.

First, I set up the logistic regression model that trained only on the pre-extracted features. Sklearn has a function called GridSearchCV that allows you to send in multiple inputs for each parameter (regularization, tolerance, etc.) and trains the model on each combination of these inputs, which makes finding the optimal parameters significantly easier and faster than creating models and adjusting them in keras. Once I tweaked my model I was able to receive a Kaggle score of 0.03806, which is on par with what some other people were getting using this method. Then, I added just the extra features that I grabbed from each image, tweaked my model's parameters when training with them included, and received a Kaggle score of 0.02225, a significant improvement. I added the PCA components and trained my model with them attached to the pre-extracted and added features and was able to receive a Kaggle score of 0.01194. This put me around 125 on the public leader board. This score was obtained when I set my number of principal components to 30 when I used PCA on the images. If I went any higher or lower I received higher Kaggle scores, so 30 was the best amount to use. With this fact in mind I decided to go back to my original softmax classifier in keras. I noticed a lot of people used smaller NN than me when looking through the discussions again, so I gave it another shot. I set up my model like this:

I ran it with all of the added features I obtained from the images, including the PCA images (N = 30 this time), and received a Kaggle score of 0.00554, this put me around 45 on the leader board. I tried adjusting the size of the hidden layers and N for PCA to different amounts, but was unable to receive a lower score. Upon changing the seed of my code, which changes what random values I will receive, I got worse scores. This leads me to believe I lucked out with the seed I was running my code on and it is not performing as well as I had thought. I added in 10-fold cross validation to my Keras softmax and started to receive consistence results. I also added in three additional features obtained from using openCV. These features are:

Perimeter of each leaf
Area of each leaf
Average size of defect for each leaf

My new results gave me a Kaggle score of a little over 0.012. Using sklearn's logistic regression gave me Kaggle scores of around 0.009, with the best being 0.00903. This placed me around 100 on the public leader board. There is another function I wanted to try out called t-SNE from sklearn. It allows me to plot my data points after I used PCA on them to find patterns in the data. Here's the results where each color represents one of the 99 species.

Here is the same graph, but the pictures of each of the leaves now occupies each data point.

I was able to set up these plots with the help of Gábor Vecsei's code on plotting the t-SNE results. The link attached to his name will take you to the code I followed. Some species are in their own clusters, but there are quite a few species that look similar to each other that are in big blobs. It also looks like orienting the species of leaves so that they are the same orientation as one another would help cluster some species closer together. I would use the coordinates for each leaf as an additional feature since some of the leaf species were grouped up.

Before that, I made modifications to the features I was using with openCV. After getting undesired results from adding more features than the ones listed above I decided to take a step back and try to figure why I was not improving. I tried adding in the moments of the leaves, which are characteristics about the leaf. While what most of the of the moments mean are not clear to me, there are a total of 24, one of them that's named 'm00' is the area of the leaf. But, my results were not improving. So, I stuck with just the moments and the perimeter of the leaf.

This idea to use only the moments of the leaves came from Ivan Sosnovik. I decided to keep using the perimeter to see if it would help at all. His results were better than mine despite him using only the moments along with PCA and some extra features obtained from the original images. So, I tried using only the moments and the perimeter and sure enough I obtained a better Kaggle score. But, this result was counter-intuitive since openCV was not working properly with just using the moments the way he did. This is because openCV's findCountours function gives a list of contours that pertain to different objects in the image. My code accesses the first set of contours contours[0] in the list and uses the moments obtained from them. I only have one leaf per picture so that should not be an issue if I do it for all images, right? Wrong. Some of the pixels on the original images and not connected to their leaves, which is causing openCV's findCountours to treat them as their own object. These contours are put first in the list and the contours of the leaf are put second, or third, etc. depending on how many areas of the leaf have detached pixels. This means sometimes the moments are all zeros since there are times it is only one pixel that isn't connected, or moments of a tiny object that does not exist in the original image. This happens to leaves at random, not a specific species and it effects around 50 - 60 of the total amount of leaves. Despite this flaw I still receive better results using it. Attempts to fix this by choosing the correct contours that will give me moments of the leaf have given me worse results, oddly enough. So, the openCV code I am using that isn't working 100% correctly is still being used.

With this and the coordinates from t-SNE I was now averaging a Kaggle score of around 0.0067 with sklearn. When using my Keras network I averaged around 0.0085, so both showed signs of improvement. The score I received from sklearn put me around spot 65 on the public leader board. Using the coordinates from t-SNE gave me a slightly better score. One run without them received 0.00686 while one run with them received 0.00667. I decided to look at the leaves that were getting confused. I did this be looking at the prediction of leaves that had a probability of less than 95%. Here are the leaves that were predicted and the top three predictions.

I added in finding the perimeter through OpenCV and tweaked the logistic regression model. I received a Kaggle score of 0.00650 which places me at exactly 65 on the public leaderboard.