s27_machinelearning_Medium - agarmstrong21/agarmstrong_swd_2019 GitHub Wiki

S27_MachineLearning_Medium

Portfolio Home Page

Problem Statement

In the modern world, machine learning is a tool used by pretty much everyone. In essence, machine learning algorithms use statistical/mathematical models to "learn" from data to allow for future inference. In supervised learning, each data point has a known label, usually a class membership label (for classification problems) or a numerical value (for regression problems). Unsupervised learning (such as clustering) uses unlabeled data to find patterns and relationships among features.

Similarity Functions In several machine learning algorithms, functions that calculate a "similarity" metric between two data points are necessary. Here are several such metrics:

Easy

Implement all three of the above similarity functions as class methods. You may assume that the inputs to the Cosine and Euclidean similarity methods are arrays. Assume that the inputs to the Hamming similarity method are Strings consisting of 1s and 0s. Your solution should work for any size of array or string, and for each method you should check that the inputs are of the same length. Note: JUnit 5 tests must be written for each similarity function. Your tests need to consider at least 5 examples per similarity function. To set up JUnit5 for intellij, read the module posted on icon (that was also covered in class), and use the following links for supplemental help :

https://www.jetbrains.com/help/idea/configuring-testing-libraries.html (use JUnit5 instead of JUnit4)

https://www.jetbrains.com/help/idea/create-tests.html

Examples:

  • Cosine similarity of [1,2,3] and [2,6,3] = 0.8781
  • Euclidean distance between [1,2,3] and [2,6,3] = 4.1231
  • Hamming distance between "0110101" and "1110010" = 4

Medium

Supervised learning: Do the Easy assignment, and also implement a k-nearest-neighbor classifier (https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) using the supplied dataset. The data points consist of 5 numerical features. This is a supervised learning problem, so in the supplied data each data point also belongs to a class ("class1" or "class2"). The data is supplied to you in the "S27-MLMedium.csv" file on ICON. Given a dataset, a new data point, and a parameter k, your solution will need to find the k data points that are closest to the new data point as per your Euclidean similarity function, and choose the most common class among them to classify the new data point.

Example:

  • knearest("datasetfilepath.csv",[1.5, 3.5, 2, 2, 8], 5) <- here, the last argument is k (5)
  • with output: "New data point belongs to class1"
  • knearest("datasetfilepath.csv",[3, 3, 2, 2, 1], 5) <- here, the last argument is k (5)
  • with output: "New data point belongs to class2"

Hard

Unsupervised learning: Do the Medium assignment, and also implement k-means clustering (https://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) using the supplied dataset. The data points consist of 2 numerical features. The data is supplied to you in the "MLHard.csv" file on ICON. Given a dataset and a number of clusters k, your code will converge on a clustering of the data points. After clustering, your program should print how many data points are in each cluster.

Example:

kmeans("datasetfilepath.csv",4) <- here, the last argument is the number of clusters, k=4 with output:

  • cluster1: 20 data points
  • cluster2: 30 data points
  • cluster3: 20 data points
  • cluster4: 50 data points

Note: which cluster has which number of data points isn't important, as long as you end up with two clusters with 20 data points, one with 30, and one with 50. Slight variation is not unusual with random initialization (e.g. 20, 22, 28, 50 with certain initializations instead of 20, 20, 30, 50).

User Documentation

To start Machine Learning Medium, run MachineLearningDriver which calls the super class, MachineLearningTest. MachineLearningTest test many different variants of data for each Cosine Similarity, Euclidean Distance, Hamming Distance, and K Nearest Neighbor. Cosine_Similarity and Euclidean_Distance input two similar length double arrays and computes a double output. While Hamming-Distance takes in two strings full of Binary Code and computes how many errors are between the strings. This outputs an integer. Lastly KNearest inputs a file, double array, and an integer of how many neighbors the program asks for. KNearest uses KNearestHelper to override the compare function to find the K nearest neighbors.

Developer Documentation

This program consists of four classes, MachineLearningDriver, MachineLearningTest, MachineLearning, and kNearestHelper. MachineLearningDriver starts MachineLearningTest which contains many different tests of different variables to test the capabilities of MachineLearning. MachineLearning contains Cosine_Similarity, Euclidean_Distance, Hamming_Distance, and kNearest. Cosine_Similarity takes in two double arrays and calculates the cosine similarity of the two arrays. Euclidean_Distance takes in two double arrays and calculates the euclidean distance between the two arrays. Hamming_Distance takes in two strings with binary code and compares the two strings. Lastly, kNearest takes in a file, double array, and an integer and uses kNearestHelper which takes in integer and a double to calculate the kth nearest point.

Java Docs

The java documents are served from a local web server on this machine. To start the web server, navigate to the directory immediately above where the source code is checked out (i.e. ~/git ) and then use "python -m SimpleHTTPServer" in that directory.

cd ~/git
python -m SimpleHTTPServer&
```
Note: if you are running python 3 (which you can check via opening a terminal and typing: python --version), then the command is:


```shell
python3 -m http.server
```

[Java Docs for S27_MachineLearning_Hard](http://localhost:8000/agarmstrong_swd/oral_exam1/S27_MachineLearning_Medium/doc/allclasses.html)

## Source Code
[Souce Code for S27_MachineLearning_Hard](https://class-git.engineering.uiowa.edu/swd2019/agarmstrong_swd/tree/master/oral_exam1/S27_MachineLearning_Hard)