LAB 2 WIKI - grhoads/CS5590PythonDeepLearning GitHub Wiki

Program #1.)

The problem states: Pick any data set from the data set sheet in the class sheet or online which includes both numeric and non-numeric features.

For this problem we decided to go with the cars data set that was used for in class programming due to familiarity.

Part A.)

"Perform exploratory data analysis on the data set"

We begin this problem by importing necessary python libraries:

pandas
numpy
sklearn

To get this kicked off we read the csv file with pandas and then split the sets randomly with numpy.
We then create training and test data sets
Next we handle the categorical data by applying the test and train with the correct data
After test and train are officially initialized we then created and assigned the X and Y trains as well as the X test:

X_train = train.drop(" brand", axis=1)

Y_train = train[" brand"]

X_test = test.drop(" brand",axis=1)

Part B.)

"Apply the three classification algorithms Naïve Baye’s, SVM and KNN on the chosen data set and report which classifier gives better result."

This is the meat and potatoes of the program.

Here we start with the SVM algorithm and utilize the SKLearn library's .svm functions. using the .fit() function we use X_train and Y_train as parameters and finally initializing the prediction using .predict()
We then go on the KNN algorithm using the KNeighborsClassifier of SKlearn. once again using .fit() and .predict(). The main difference of this algorithm is the use of KNeighborsClassifier(n_neighbors=3)
Finally we come to the implementation of the Gaussian algorithm using GaussianProcessClassifier(). We once again use the X and Y trains and then create a Y prediction.

Final Results:

Our ending results/output for this program using the cars.csv file were as follows:

SVM accuracy is: 100.0
KNN accuracy is: 83.33
GAUSSIAN accuracy is: 100.0

Program #2.)

The problem states: "Choose any dataset of your choice. Apply K-means on the dataset and visualize the clusters using matplotlib or seaborn."

Let's look at the libraries used in this problem:

pandas
sklearn
matplotlib

Part A.)

"Report which K is the best using the elbow method."

To get things going first we need to load the data set. We went with the pre-loaded iris data set provided by sklearn.datasets.
After initializing necessary variables and assigning, we perform standardization using scaler.
Then we created a for loop to determine KMeans Clustering.
The data is then shared through matplotlib in a beautiful elbow graph showing the SSE to be approaching 600 at 0 clusters and approaching 0 at 9 clusters.

Part B.)

Evaluate with silhouette score or other scores relevant for unsupervised approaches

using sklearn and an X and Y variable which contain iris data and target, we created a for loop which uses the silhouette_score() function and kmeans.labels to gather the coefficients.

Final Results:

We ended with a elbow graph with SSE approaching 600 at 0 clusters and approaching 0 at 9 clusters.
Output for silhouette coefficient: at n_clusters = 2: 0.68104. at n_clusters = 5: 0.48874. at n_clusters = 10: 0.327608.

Program #3.)

"Write a program in which takes an Input file, use the simple approach below to summarize a text file":

Read the data from a file.
Tokenize the text into words and apply lemmatization technique on each word.
Find all the trigrams for the words.
Extract the top 10 of the most repeated trigrams based on their count.
Go through the text in the file
Find all the sentences with the most repeated tri-grams
Extract those sentences and concatenate
Print the concatenated result

In this problem we use the nltk and operator libraries available in python

First thing to do is read in text file: nlp_input.txt and tokenize it by word.
Next we initialize the lemmatizer and print out every word and its lematized version by using the WordNetLemmatizer() and .lemmatize() functions.
The next step is to initialize the trigrams. We do this using a dictionary to seperate the trigrams and their counts. This is done using a for loop: for words in trigrams: and an if/else: if words not in trigramDict: trigramDict[words]=1 else: trigramDict[words]+=1
Then we move on to the ten most used trigrams; starting with a list. We use the sorted() function to sort the dictionary of trigrams by most used in descending order. We then add them to the list showing the ten most used trigrams.
To move on to the last parts, we tokenize the text file by sentence and then convert the trigrams from tuples to a single string using a simple for loop.
Moving on, we create and initialize a dictionary of sentences that have the most trigrams in them. And finally with a nested for loop we find the strings not already in our dictionary and initialize their count to 1, otherwise adding 1 if it is found already.
Finally we sort the dictionary once again using the sorted() function and concatenating all of the trigrams into a single string and printing it out to the console.

Final Results: The appended final string output:

The gradient descent algorithm is used to find the optimal cost function by going over a number of iterations. Visualization of the squared error (from Setosa.io) The equation for this model is referred to as the cost function and is a way to find the optimal error by minimizing and measuring it. But the data we need to define and analyze is not always so easy to characterize with the base OLS model. These are known as L1 regularization(Lasso regression) and L2 regularization(ridge regression).The best model we can hope to come up with minimizes both the bias and the variance: Ridge regression uses L2 regularization which adds the following penalty term to the OLS equation. First we need to understand the basics of regression and what parameters of the equation are changed when using a specific model. = 1 denotes lasso) Performing Elastic Net regression Performing Elastic Net requires us to tune parameters to identify the best alpha and lambda values and for this we need to use the caret package. We will tune the model by iterating over a number of alpha and lambda pairs and we can see which pair has the lowest associated error. The larger the value of lambda the more features are shrunk to zero. A third commonly used model of regression is the Elastic Net which incorporates penalties from both L1 and L2 regularization: In addition to setting and choosing a lambda value elastic net also allows us to tune the alpha parameter where ?? This constraint results in minimized coefficients (aka shrinkage) that trend towards zero the larger the value of lambda. To produce a more accurate model of complex data we can add a penalty term to the OLS equation. The penalty applied for L2 is equal to the absolute value of the magnitude of the coefficients: L1 regularization penalty term Similar to ridge regression, a lambda value of zero spits out the basic OLS equation, however given a suitable lambda value lasso regression can drive some coefficients to zero.

Program #4.)

Create Multiple Regression by choosing a dataset of your choice (again before evaluating, clean the data set with the EDA learned in the class). Evaluate the model using RMSE and R2 and also report if you saw any improvement before and after the EDA.

NOTE: We were not able to do EDA so the comparison portion of this program is incomplete.

Libraries used:

matplotlib
numpy
sklearn

For this problem we went with the diabetes data set provided through sklearn's datasets.

Beginning, we load the data, matrix shape, and vector shape, as well as column names.
We then set up and initialize X_train, Y_train, and Y_test by using the train_test_split() function.
Next we create the model with sklearn:

model=LinearRegression() model.fit(X_train,Y_train) model.score(X_test,Y_test)

For the next part, we print the coefficients and accuracy and then plot the prediction and data.
And finally we plot the perfection line and find the prediction in the graph.

Final Results:

We end the program and the Lab with a scatter plot and a prediction line. Output:

Model Intercept: 152.255
Accuracy: 0.5342
RMSE: 2911.828