16 01 Classification - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

01 Supervised Learning

What is machine learning

  • The art and science of :
    • Giving computers the ability to learn to make decisions from data
    • without being explicitly programmed
  • Examples:
    • Learning to predict whether an email is spam or not
    • Clustering wikipedia entries into different categories
  • Supervised learning: Uses labeled data
    • Classification
    • Regression
  • Unsupervised learning: unlabeled data

Reinforcement learning

  • Software agents interact with an environment
    • Learn how to optimize their behavior
    • Given a system of rewards and punishments
    • Draws inspiration from behavioral psychology
  • AlphaGo

Features = predictor variables
Target variable = dependent variable

in Python

  • scikit-learn/sklearn
    • Integrates well with the Scipy Stack
  • TensorFlow / keras

02 EDA

The Iris dataset in scikit-learn

  • import the datasets
from sklearn import datasets
iris = datasets.load_iris()
type(iris)  # sklearn.datasets.base.Bunch  # similar to dictionary contains key-value pairs
  • keys
print(iris.keys())
dict_keys(['data', 'target_names', 'DESCR', 'feature_names', 'target'])
  • the features: iris.data, the target:iris.target, both are np.ndarray

03 The classification

Scikit-learn fit and predict

  • All machine learning models implemented as Python classes
    • implement the algorithms for learning and predicting
    • Store the information learned
  • Train a model on the data
    • .fit() method
  • .predict() method

K-Nearest Neighbors

  • KNeighborsClassifier
  • fit the model: knn.fit()
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier

# Create arrays for the features and the response variable
y = df['party'].values
X = df.drop('party', axis=1).values

# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)

# Fit the classifier to the data
knn.fit(X, y)
  • predict: knn.predict()
# Predict the labels for the training data X
y_pred = knn.predict(X)

# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print("Prediction: {}".format(new_prediction))

04 Measuring model performance

  • Split data into training and test set
  • Fit the classifier on the training set
  • Make predictions on test set
  • Compare predictions with the known labels
    • knn.score(X_test, y_test)

Train/test split

  • from sklearn.model_selection import train_test_split - random_state = 21 : set the seed - stratify = y : Stratify the split according to the labels so that they are distributed in the training and test sets as they are in the original dataset.
# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Create feature and target arrays
X = digits.data
y = digits.target

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)

# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors = 7)

# Fit the classifier to the training data
knn.fit(X_train, y_train)

# Print the accuracy
print(knn.score(X_test, y_test)) # 0.9833

Model complexity and over/underfitting

# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))

# Loop over different values of k
for i, k in enumerate(neighbors):
    # Setup a k-NN Classifier with k neighbors: knn
    knn = KNeighborsClassifier(n_neighbors = k)

    # Fit the classifier to the training data
    knn.fit(X_train, y_train)
    
    #Compute accuracy on the training set
    train_accuracy[i] = knn.score(X_train, y_train)

    #Compute accuracy on the testing set
    test_accuracy[i] = knn.score(X_test, y_test)

# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()