Guidelines for Analyzing a Dataset - AP-State-Skill-Development-Corporation/Datasets GitHub Wiki
Follow these guidelines when performing your initial analysis of a machine learning dataset.
Identify Noise within a Dataset
In machine learning, noise takes on a number of forms that cause algorithms to misidentify patterns and get bogged down in unnecessary processing, ultimately resulting in an ineffective model. When looking for noise in tables of training data, look in:
- Specific data items—A particular data value may be missing, or it may contain an extreme or atypical value (outlier) that does not follow the general pattern established by other data items in the same column. Anomalous data items may need to be modified to adhere to the general pattern, or the record that contains them may need to be dropped from the training dataset.
- Columns—Some of the columns you have initially selected for use in a model may have little or no correlation to the target, or may be redundant since they are dependent on other columns that are already included in the model. Unnecessary columns may need to be dropped from the training dataset.
- Rows—In some cases, the individual data items within a particular row/record may each be fine, but together they may represent an anomalous combination, one that defies the overall pattern established by data in other records. Unusual or anomalous data combinations may mislead the algorithm during training. Anomalous rows may need to be dropped from the training dataset.
Note: Odd combinations of data values within records may be hard to reveal by looking at individual statistical measures, and may be discovered through machine learning methods such as anomaly detection.
Reveal Outliers in a Dataset
The primary goal of some models, such as those detecting fraud or network intrusions, is to reveal outliers. For other models, however, the presence of outliers in the training dataset can skew and mislead your initial analysis, hamper training, increase training time, and diminish the model's effectiveness. To identify outliers in a training dataset:
- Use statistics measures to reveal extreme values. For example, calculate the standard deviation for every value in the dataset, and filter them to reveal values with a standard deviation from the mean of more than 2 or 3, which are the outliers.
- Use visualization tools to reveal extreme values. For example, use visualizations such as a histogram or box-and-whisker plot for distributions extending out in long, skinny (leptokurtic) tails, which may represent outliers in the data. In a scatterplot, look for values far away from any clusters.
- Use machine learning algorithms to reveal outliers. For example, the k-means clustering algorithm can identify natural clusters of data based on multiple values within a dataset. Identify those values with the greatest distance from any cluster as outliers.
Note: Confirm suspected outliers by removing them from the dataset, retraining the model, and testing its performance to see if their removal improves performance of the model.
Use Python to Analyze Correlations
The pandas library provides various functions you can call to get information about the data loaded within a DataFrame. The following examples assume you have already created a DataFrame named my_dataframe, which is loaded with data.
- my_dataframe.corr()—Returns a table showing the correlation coefficient (a floating point number) for each pair of columns in the DataFrame. You can specify the type of correlation to be performed, with Pearson standard correlation being the default type.
- my_dataframe.sort_values()—Sorts the values along either axis. You can specify a list of names to sort by and specify whether the sort should be in ascending or descending order. The function returns a DataFrame with the values sorted. You can assign the return value back to the original DataFrame if you want the original to contain the sorted results.