Working diary from 'Practical Statistics for Data Scientists', authors: Peter Bruce & Andrew Bruce & Peter Gedeck - JXXCgit/Notes-Learning-for-Data-Scientists GitHub Wiki
- Relevant link: https://github.com/gedeck/practical-statistics-for-data-scientists
- Languages: R & Python
############################################### Chapter 1 Exploring data
- Data Types:
- Numeric data
Continuous (float(with decimal), interval(temperature in Celsius/Fahrenheit), numeric Discrete (count, integer)
- Categorical data
Ordinal Binary (1/0, yes/no, etc.)
- Rectangular data (Data Frame/Matrix with rows('record'/'case'/'sample'/'observation') and columns('feature'/'attribute'/'variable'/'predictor') In R, function to create data frame: data.frame(), with function 'data.table' and 'dplyr' to enable specify multilevel indexes In python, function to create data frame: pandas.DataFrame(), enabling to set multilevel/hierarchical indexes
Further refer to data frame
- Estimates of location
- Mean
- Median
- Percentile
- Outlier
- Robust
- Weighted mean:
- Weighted median
- Trimmed mean
the outlier can influence the mean in estimating the result, but` median and **weighted median ** are robust to outliers; trimmed median can avoid the influence of outliers, it is a compromise between median and mean, but use more data to estimate trimmed mean eliminates the influence of extreme values weighted mean is used when some values are more variable or the data does not equally represent different groups Package 'matrixStats' in R function to calculate 'weighted median', while 'NumPy' in Python function to calculate weighted mean
# ### In summary, median, weighted median, trimmed media are robust to outlier
- Estimate of Variability
- Deviations
- Mean absolute deviation
- Variance
- Standard deviation
- Mean absolute deviation
- Median absolute deviation from the median
- Range
- Order statistics
- Percentile
- Interquartile range: difference between 25th percentile and 75th percentile.
Averaging deviations themselves can offset between negative and positive deviations, Mean absolute deviation solve the problem Best known estimates of **variability ** are **variance and standard deviation Mathematically, working with squared values is more convenient than absolute values ** Variance, standard deviation, and mean absolute deviation is not robust to outliers and extreme values, But, median absolute deviation from median or MAD is robust. Order statistics by ranking or measuring **range **(difference between largest and smallest) is sensitive to outliers, but good for identifying outliers the difference between percentiles is robust to outlier, and IQR is common measurement to variability
## ## # In summary, median absolute deviation, IQR are most robust to outlier to estimate variability
Reference for percentiles: percentile Reference for Deviations from the median: Deviations from median
- Data distribution
- Boxplot: based on percentiles
- Histogram:
- Density plot: kernel density estimate Reference to choosing kernels and bandwidth on kernel density estimate: kernel estimate
## # Mode is the value appear most, which is used for the categorial data rather than numeric data
## # Expected value (EV) is a form of weighted mean, for the ideas of future expectations and probability weights
- Correlation
- correlation coeefficient gives the estimatede of the correlation between two variables
Pearson's correlation coefficient (https://www.gstatic.com/education/formulas2/472522532/en/correlation_coefficient_formula.svg)
## # But it is not robust to outliers.
## # Spearman's rho or Kendall's tau are based on rank of data rather than values, so they are more robust to outliers and can handle types of nonlinearities.
- Exploring of two more more variables
scatter plot: visualized for small number of data Hexagonal binning plot: Contours plot Contingency table
########################################################### Chapter 2
## # Sample (subset from a larger dataset), corresponding to n (sample size) / x bar (mean)
## # Population (the larger dataset), corresponding to N (population size) / u (mean)
## # Random sampling
## # Stratified sampling: dividing the population into strata and randomly sampling from each strata
## # Stratum: a homogeneous subgroup of a population with common characteristics
## # Simple random sample: the sample results from sampling without stratifying the population
## # Bias: systematic error
- Sample bias: sample that misrepresent the population
- Selection bias: Bias results from the way in which observation are selected
Data snooping: extensive hunting through data in search of something interesting
## ## # Vast search effect: Bias or nonreproducibility resulting from repeated data modeling, modeling data with large numbers of predictor variables.
The more variables you have, the easier it becomes to 'oversearch' and identify false pattern
## # Permutation test (hypothesis test): nonparametric test based on re-randomized sampling the observed data and proof-by-contradiction to test the distribution the population distribution. (In R package 'coin', we can directly perform permutation test rather than manually). When we have few samples, we can not do permutations.
Reference: Permutation test Reference: Computing & mathmatics Reference: Permutation & t-test Reference:Permutation test
## # Target shuffling: shuffle is a process of Permutation test.### ## # Target shuffling, sometimes called ### ## # randomization test, is a process to test the statistical accuracy of data mining results, particularly for identifying false positives or coincidental effect.
Reference: Target shuffling
## # Regression to mean:
randomized evaluation is essential in avoiding regression to mean Randomized control trials are gold standard in medical research (i.e., patients at all symptom levels should be treated with the same drug) To avoid regression to the mean, probability sampling method By calculating the percent of regression to the mean (Prm = 100(1 – r)), as r=1 (perfect correlation between variables), there is not regression to the mean.
### ## ## # # Sampling distribution can be estimated via bootstrap or formulas relying on central limit theorem.
## # Central limit theorem: calculate the standard deviation of sample by the standard deviation of population divided by sample size. It states the distribution of sample means approximates a normal distribution as the sample size gets larger
- Standard deviation (measure the variability of individual data points/a sample) vs Standard error (metric sums up the variability in the sampling distribution)
Standard deviation reflect variability within a sample, Standard error estimate the variability across samples of a population
In practice, collecting new samples to estimate standard error is not feasible and wasteful, but bootstrap to resample is standard to estimate standard error. The validity of the standard error formula arise from central limit theorem.
## # Bootstrap: draw additional samples, with replacement, from the sample itself and recalculate the statistic or model for each resample to estimate the sampling distribution, Steps as followed
Draw a sample value, record it and replace it Repeat n times Record the mean of the n resampled values Repeat the above three steps 1-3 R times Use the R results to: Calculate standard deviation to estimate sample mean standard error Produce a histogram or boxplot Find a confidence interval R package 'boot' combine steps of bootstrap into one function while python does not provide the implementation. But package 'scikit-learn' can implement resample Bagging (bootstrap aggregating): With classification and regression trees(decision trees) running multiple trees on bootstrap samples and averaging their prediction may perform better than a single tree.
### ### ### ### ## # **Summary to these two parts:
Resampling is the process of taking repeated samples from observed data, include bootstrap & permutation (shuffling) procedures_** Bootstrap is a powerful tool for assessing the variability of a sample statistic, but it does not compensate for small sample size. Aggregating multiple bootstrap sample prediction (bagging) outperform in predicting models than the use of a single model ** Bootstrap **focus on quantifying estimating population parameters (confidence interval, variability of sampling process); Permutation focus on the null distribution (test hypotheses), a special case of Randomization test; Randomization not particularly concern about population and their parameters, but considers every possible permutation of the labels Bagging: With classification and regression trees (decision tree), running multiple trees on bootstrap samples and then averaging their predictions generally performs better than using a single tree. Bagging is an ensemble technique for improving the robustness, Random forest is based on Bagging and Decision trees
Confidence Intervals
Reference: bootstrap in R Reference: Computational statistics
- Normal Distribution: bell shape, also referred as Gaussian distribution.
Standard normal: a normal distribution with mean=0, standard deviation=1 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively
Standardization: to compare data to a standard normal distribution, substract the mean and divide by the standard deviation. The transformed value is termed z-score.
- Long-Tailed Distribution
- Student's t-Distribution: though t-distribution is normally distributed, it is a bit thicker and longer on the tails.
Degrees of freedom: a parameter allow t-distribution to adjust to different sample size, statistics and number of groups
-
Binomial Distribution: Distribution of number of successes in x trials, also referred as Bernoulli distribution
-
Chi-Square Distribution
Reference: https://en.wikipedia.org/wiki/Chi-squared_distribution The primary reason for which the chi-squared distribution is extensively used in hypothesis testing is its relationship to the normal distribution. Chi-square statistic measures the extent of departure from what you would expect in a null model It is typically concerned with counts of subjects falling into categories
- F-Distribution: continuous probability distribution that arises frequently as the null distribution of a test statistic, most notably in the analysis of variance(ANOVA) and other F-tests
Compare the extent to which differences among group means are greater than we might expect under normal random variation. F-statistic measure the ratio of variability among group means to the variability within each group (residual variability) Similar to A/b/C test referred to chi-square distribution, but it is typically concerned with continuous values rather than counts
- Poisson distribution: It gives the probability of an event happening a certain number of times (k) within a given interval of time or space. The Poisson distribution has only one parameter, λ (lambda), which is the mean number of events
*Exponential distribution: It deals with the time between occurrences of successive events as time flows by continuously, it is an inverse of Poisson.
*Weibull distribution: it is used extensively in reliability applications to model failure times
Three parameters: shape (β), scale (eta, η), and threshold
For events that occur at a constant rate, the number of events per unit of time or space can be modeled as Poisson distribution We can model the time or distance between one event and the next as an exponential distribution A changing event rate over time can be modeled with Weibull distribution
############################################### Chapter 3 Statistical Experiments and Significance Testing A/B Testing: an experiment with two groups to establish which of two treatments is superior
blind study: one in which the subjects are unaware of whether they are getting treatment A or treatment B double-blind study: one in which the investigators and facilitators also are unaware which subjects are getting which treatment Ideally, subjects are randomized (randomly assigned) to treatments
Hypothesis Tests
Null hypothesis Alternative hypothesis One-way test Two-way test
Resampling: Repeatedly sample values from observed data with a general goal of assessing random variability in a statistic, including Permutation and Bootstrap
bootstrap used to assess the reliability of an estimate permutation used to test hypothesis Permutation test: procedure of combining two or more samples together and randomly reallocating the observation to resamples, also referred as Randomization test, Exact test, the permutation procedure as followed: *** Combine results from different groups into a single data set *** Shuffle the combined data and randomly draw (without replacement) a resample of the sample size as group A *** Randomly draw the remaining data as a resample (without replacement) of the same size as group B *** Do the same for groups C D and so on *** Calculate the statistic or estimate for the resamples *** Repeat R times of the previous steps to yield a permutation distribution of the test statistic *** Fisher's exact test is simply a permutation test for difference between two means from two different groups
Statistical significance and p-values *** how statisticians measure whether an experiment yield a result more extreme than what the chance might produce *** p-value is the probability of obtaining results as unusual or extreme as the observed results, given a chance model that embodies the null hypothesis *** Alpha (5%, 1%) is the probability threshold of 'unusualness' that chance results must surpass for actual outcomes to be statistically significant *** Type 1 error: when you mistakenly conclude that an effect is real when it is really just due to chance *** Type 2 error: when you mistakenly conclude that an effect is not real when it is actually real
t-Test: can be used to compare the mean of two groups, instead of a permutation test
Multiple Testing
Degrees of Freedom
ANOVA (analysis of variance, one factor)
F-statistic: it is based on the ratio of the variance across group means to the variance due to residual error.
Two-Way ANOVA (two factors) *** ANOVA is the extension of A/B test, used to assess whether the overall variation among groups is within the range of chance variation
Chi-Square Test: to assess whether the null hypothesis of independence among variables is reasonable; it measure the extent to which some observed data departs from expectation
***
*** Chi-square statistic is as the sum of the squared Pearson residuals
***Fisher's Exact Test: it is used when you have a small sample size and you want to assess the significant of association between categorical variables in 2x2 contingency table whereas Chi-Square Test is used in larger contingency table and larger sample size
Multi-Arm Bandit Algorithm:
Power and Sample Size
############################################### Chapter 4 Regression and Prediction Simple Linear Regression Multiple Linear Regression Prediction Using Regression Factor Variables in Regression Interpreting the Regression Equation Regression Diagnostics Polynomial Spline Regression
############################################### Chapter 5 Classification Naïve Bayes Discriminant Analysis Logistic Regression Performance of Classification Models Strategies for Imbalanced Data
############################################### Chapter 6 Statistical Machine Learning K-nearest Neighbors Tree Models Bagging and the Random Forest Boosting
############################################### Chapter 7 Unsupervised Learning Principal Component Analysis K-means Clustering Hierarchical Clustering Model-based Clustering Scaling and Categorical Variables