MachineLearning - henk52/knowledgesharing GitHub Wiki

Machine learning

Introduction

Vocabulary

DBSCAN:
Data: Predictor or set of predictors used to make a prediction.
Dendrogram:
Factor analysis: a regression method to discover root causes or hidden factors that are present in the data set but not observable.
- Factor analysis is a method you used to regress on features in order to discover factors that you can use as variables to represent the original data set
Feature: variable, column, attribute, or field.
hierarchical clustering: an unsupervised machine learning method that you can use to predict subgroups based on the difference between data points and their nearest neighbors.
- hospital resource management
- business process management
- customer segmentation analysis
- social network analysis.
Instances: row, data point, value, or case.
K-means:
- K-means Precision: a measure of the models relevancy. (Higher is better)
- K-means recall: a measure of the models completeness. (Higher is better)
Linear regression: all variables are continuous numeric variables and not categorical ones, that your data is free of missing values and outliers. That there's a linear relationship between predictors and predictant, that all predictors are independent of one another, and that your residuals are normally distributed.
- all variables are continuous numeric variables and not categorical ones,
- your data is free of missing values and outliers.
- That there's a linear relationship between predictors and predictant,
- that all predictors are independent of one another,
- that your residuals are normally distributed.
Logistic regression: is a simple machine learning method that you can use to predict an observation's category based on the relationship between the target feature and independent categorical predictive features in the data set.
Outliers: You want less then 5% of your data-set marked as outliers.
- Point outliers: observations anomalous with respect to the majority observations in a feature.
- Contextual outliers: observations concidered anomalous given a specific context.
- Collective outliers: A collection of observation anomalous but appear close to one another because they all have similar anomalous value
Network analysis
- Nodes: vertices around which the graph is formed
- Edges: the lines that connect them.
- Directed graphs: a graph where there is a direction assigned to each edge that connects the nodes.
- Directed edge: an edge feature that has been a assigned direction between those nodes.
- Undirected graphs: all the edges are bidirectional
- Undirected edge: an edge that flows both ways between nodes.
- Graph size: the number of edges in a graph.
- Graph order: the number of vertices in a graph, and the term degree refers to the number of edges connected to a vertex. It's a measure of connectedness.
Naive Baysian
- Assumptions:
  - your predictors are independent of one another
  - has an a priori assumption: This assumption is that the past condition still hold true, meaning that when we make predictions from historical values, we'll get incorrect results if present circumstances change.
- Multinomial: is good for when your features are categorical or continuous and describe discrete frequency counts.
- Bernoulli: is good for making predictions from binary features.
- Gaussian: approach is good for making predictions from normally distributed features.
PCA: Principle Component Analysis
Singular Value Decomposition: a linear algebra method that you use to decompose a matrix into three resultant matrices.
supervised methods: make predictions from labeled data,
SVD: Singular Value Decomposition.
Target: predictant or dependent variable
unsupervised methods: make predictions from unlabeled data.

Overview

Hierarchical Clustering

Distance metrics can be set as either
- Euclidean
- Manhattan
- Cosine
linkage parameters are:
- Ward
- Complete
- Average Basically, with hierarchical clustering, you just try every combination of parameter settings that are possible, and the model that returns the most accurate results is the one you want to go with.

Python nodules

Pandas

Dataframe: like a spreadsheet
Series: Like a single coloumn or a single row.
- series = dataframe['SeriesName']

cookbook

Put up to 2/3 of the data set into learning
Use the last 1/3 for testing.

Using Anaconda

Factor analysis

The factor analysis model assumes that you're features are metric, that they are either continuous or ordinal, that you have a correlation coefficient R greater than 0.3, that you have more than 100 observations, and more than five observations per feature. It also assumes that you're sample is homogenous.