Working diary from 'Practical Statistics for Data Scientists', authors: Peter Bruce & Andrew Bruce & Peter Gedeck - JXXCgit/Notes-Learning-for-Data-Scientists GitHub Wiki

Relevant link: https://github.com/gedeck/practical-statistics-for-data-scientists
Languages: R & Python

############################################### Chapter 1 Exploring data

Data Types:

Numeric data

Continuous (float(with decimal), interval(temperature in Celsius/Fahrenheit), numeric Discrete (count, integer)

Categorical data

Ordinal Binary (1/0, yes/no, etc.)

Rectangular data (Data Frame/Matrix with rows('record'/'case'/'sample'/'observation') and columns('feature'/'attribute'/'variable'/'predictor') In R, function to create data frame: data.frame(), with function 'data.table' and 'dplyr' to enable specify multilevel indexes In python, function to create data frame: pandas.DataFrame(), enabling to set multilevel/hierarchical indexes

Further refer to data frame

data frame in R data frame in python

Estimates of location

Mean
Median
Percentile
Outlier
Robust
Weighted mean:

Weighted median
Trimmed mean

the outlier can influence the mean in estimating the result, but` median and **weighted median ** are robust to outliers; trimmed median can avoid the influence of outliers, it is a compromise between median and mean, but use more data to estimate trimmed mean eliminates the influence of extreme values weighted mean is used when some values are more variable or the data does not equally represent different groups Package 'matrixStats' in R function to calculate 'weighted median', while 'NumPy' in Python function to calculate weighted mean

# ### In summary, median, weighted median, trimmed media are robust to outlier

Estimate of Variability

Deviations
Mean absolute deviation

Variance

Standard deviation

Mean absolute deviation
Median absolute deviation from the median

Range
Order statistics
Percentile
Interquartile range: difference between 25th percentile and 75th percentile.

Averaging deviations themselves can offset between negative and positive deviations, Mean absolute deviation solve the problem Best known estimates of **variability ** are **variance and standard deviation Mathematically, working with squared values is more convenient than absolute values ** Variance, standard deviation, and mean absolute deviation is not robust to outliers and extreme values, But, median absolute deviation from median or MAD is robust. Order statistics by ranking or measuring **range **(difference between largest and smallest) is sensitive to outliers, but good for identifying outliers the difference between percentiles is robust to outlier, and IQR is common measurement to variability

## ## # In summary, median absolute deviation, IQR are most robust to outlier to estimate variability

Reference for percentiles: percentile Reference for Deviations from the median: Deviations from median

Data distribution

Boxplot: based on percentiles
Histogram:
Density plot: kernel density estimate Reference to choosing kernels and bandwidth on kernel density estimate: kernel estimate

## # Mode is the value appear most, which is used for the categorial data rather than numeric data

## # Expected value (EV) is a form of weighted mean, for the ideas of future expectations and probability weights

Correlation

correlation coeefficient gives the estimatede of the correlation between two variables

Pearson's correlation coefficient (https://www.gstatic.com/education/formulas2/472522532/en/correlation_coefficient_formula.svg)

## # But it is not robust to outliers.

## # Spearman's rho or Kendall's tau are based on rank of data rather than values, so they are more robust to outliers and can handle types of nonlinearities.

Exploring of two more more variables

scatter plot: visualized for small number of data Hexagonal binning plot: Contours plot Contingency table

########################################################### Chapter 2

## # Sample (subset from a larger dataset), corresponding to n (sample size) / x bar (mean)

## # Population (the larger dataset), corresponding to N (population size) / u (mean)

## # Random sampling

## # Stratified sampling: dividing the population into strata and randomly sampling from each strata

## # Stratum: a homogeneous subgroup of a population with common characteristics

## # Simple random sample: the sample results from sampling without stratifying the population

## # Bias: systematic error

Sample bias: sample that misrepresent the population
Selection bias: Bias results from the way in which observation are selected

Data snooping: extensive hunting through data in search of something interesting

## ## # Vast search effect: Bias or nonreproducibility resulting from repeated data modeling, modeling data with large numbers of predictor variables.

The more variables you have, the easier it becomes to 'oversearch' and identify false pattern

## # Permutation test (hypothesis test): nonparametric test based on re-randomized sampling the observed data and proof-by-contradiction to test the distribution the population distribution. (In R package 'coin', we can directly perform permutation test rather than manually). When we have few samples, we can not do permutations.

Reference: Permutation test Reference: Computing & mathmatics Reference: Permutation & t-test Reference:Permutation test

## # Target shuffling: shuffle is a process of Permutation test.### ## # Target shuffling, sometimes called ### ## # randomization test, is a process to test the statistical accuracy of data mining results, particularly for identifying false positives or coincidental effect.

Reference: Target shuffling

## # Regression to mean:

randomized evaluation is essential in avoiding regression to mean Randomized control trials are gold standard in medical research (i.e., patients at all symptom levels should be treated with the same drug) To avoid regression to the mean, probability sampling method By calculating the percent of regression to the mean (Prm = 100(1 – r)), as r=1 (perfect correlation between variables), there is not regression to the mean.

### ## ## # # Sampling distribution can be estimated via bootstrap or formulas relying on central limit theorem.

## # Central limit theorem: calculate the standard deviation of sample by the standard deviation of population divided by sample size. It states the distribution of sample means approximates a normal distribution as the sample size gets larger

Standard deviation (measure the variability of individual data points/a sample) vs Standard error (metric sums up the variability in the sampling distribution)

Standard deviation reflect variability within a sample, Standard error estimate the variability across samples of a population

In practice, collecting new samples to estimate standard error is not feasible and wasteful, but bootstrap to resample is standard to estimate standard error. The validity of the standard error formula arise from central limit theorem.

## # Bootstrap: draw additional samples, with replacement, from the sample itself and recalculate the statistic or model for each resample to estimate the sampling distribution, Steps as followed

Draw a sample value, record it and replace it Repeat n times Record the mean of the n resampled values Repeat the above three steps 1-3 R times Use the R results to: Calculate standard deviation to estimate sample mean standard error Produce a histogram or boxplot Find a confidence interval R package 'boot' combine steps of bootstrap into one function while python does not provide the implementation. But package 'scikit-learn' can implement resample Bagging (bootstrap aggregating): With classification and regression trees(decision trees) running multiple trees on bootstrap samples and averaging their prediction may perform better than a single tree.

### ### ### ### ## # ***Summary to these two parts*:

Resampling is the process of taking repeated samples from observed data, include bootstrap & permutation (shuffling) procedures_** Bootstrap is a powerful tool for assessing the variability of a sample statistic, but it does not compensate for small sample size. Aggregating multiple bootstrap sample prediction (bagging) outperform in predicting models than the use of a single model ** Bootstrap **focus on quantifying estimating population parameters (confidence interval, variability of sampling process); Permutation focus on the null distribution (test hypotheses), a special case of Randomization test; Randomization not particularly concern about population and their parameters, but considers every possible permutation of the labels Bagging: With classification and regression trees (decision tree), running multiple trees on bootstrap samples and then averaging their predictions generally performs better than using a single tree. Bagging is an ensemble technique for improving the robustness, Random forest is based on Bagging and Decision trees

Confidence Intervals

Reference: bootstrap in R Reference: Computational statistics

Normal Distribution: bell shape, also referred as Gaussian distribution.

Standard normal: a normal distribution with mean=0, standard deviation=1 68%, 95%, and 99.7% of the values lie within one, two, and three standard deviations of the mean, respectively Standardization: to compare data to a standard normal distribution, substract the mean and divide by the standard deviation. The transformed value is termed z-score.

Long-Tailed Distribution
Student's t-Distribution: though t-distribution is normally distributed, it is a bit thicker and longer on the tails.

Degrees of freedom: a parameter allow t-distribution to adjust to different sample size, statistics and number of groups

Binomial Distribution: Distribution of number of successes in x trials, also referred as Bernoulli distribution
Chi-Square Distribution

Reference: https://en.wikipedia.org/wiki/Chi-squared_distribution The primary reason for which the chi-squared distribution is extensively used in hypothesis testing is its relationship to the normal distribution. Chi-square statistic measures the extent of departure from what you would expect in a null model It is typically concerned with counts of subjects falling into categories

F-Distribution: continuous probability distribution that arises frequently as the null distribution of a test statistic, most notably in the analysis of variance(ANOVA) and other F-tests

Compare the extent to which differences among group means are greater than we might expect under normal random variation. F-statistic measure the ratio of variability among group means to the variability within each group (residual variability) Similar to A/b/C test referred to chi-square distribution, but it is typically concerned with continuous values rather than counts

Poisson distribution: It gives the probability of an event happening a certain number of times (k) within a given interval of time or space. The Poisson distribution has only one parameter, λ (lambda), which is the mean number of events

*Exponential distribution: It deals with the time between occurrences of successive events as time flows by continuously, it is an inverse of Poisson.

*Weibull distribution: it is used extensively in reliability applications to model failure times

Three parameters: shape (β), scale (eta, η), and threshold

For events that occur at a constant rate, the number of events per unit of time or space can be modeled as Poisson distribution We can model the time or distance between one event and the next as an exponential distribution A changing event rate over time can be modeled with Weibull distribution

############################################### Chapter 3 Statistical Experiments and Significance Testing A/B Testing: an experiment with two groups to establish which of two treatments is superior

blind study: one in which the subjects are unaware of whether they are getting treatment A or treatment B double-blind study: one in which the investigators and facilitators also are unaware which subjects are getting which treatment Ideally, subjects are randomized (randomly assigned) to treatments

Hypothesis Tests

Null hypothesis Alternative hypothesis One-way test Two-way test

Resampling: Repeatedly sample values from observed data with a general goal of assessing random variability in a statistic, including Permutation and Bootstrap

bootstrap used to assess the reliability of an estimate permutation used to test hypothesis Permutation test: procedure of combining two or more samples together and randomly reallocating the observation to resamples, also referred as Randomization test, Exact test, the permutation procedure as followed: *** Combine results from different groups into a single data set *** Shuffle the combined data and randomly draw (without replacement) a resample of the sample size as group A *** Randomly draw the remaining data as a resample (without replacement) of the same size as group B *** Do the same for groups C D and so on *** Calculate the statistic or estimate for the resamples *** Repeat R times of the previous steps to yield a permutation distribution of the test statistic *** Fisher's exact test is simply a permutation test for difference between two means from two different groups

Statistical significance and p-values *** how statisticians measure whether an experiment yield a result more extreme than what the chance might produce *** p-value is the probability of obtaining results as unusual or extreme as the observed results, given a chance model that embodies the null hypothesis *** Alpha (5%, 1%) is the probability threshold of 'unusualness' that chance results must surpass for actual outcomes to be statistically significant *** Type 1 error: when you mistakenly conclude that an effect is real when it is really just due to chance *** Type 2 error: when you mistakenly conclude that an effect is not real when it is actually real

t-Test: can be used to compare the mean of two groups, instead of a permutation test

Multiple Testing

Degrees of Freedom

ANOVA (analysis of variance, one factor)

F-statistic: it is based on the ratio of the variance across group means to the variance due to residual error.

Two-Way ANOVA (two factors) *** ANOVA is the extension of A/B test, used to assess whether the overall variation among groups is within the range of chance variation

Chi-Square Test: to assess whether the null hypothesis of independence among variables is reasonable; it measure the extent to which some observed data departs from expectation *** *** Chi-square statistic is as the sum of the squared Pearson residuals ***Fisher's Exact Test: it is used when you have a small sample size and you want to assess the significant of association between categorical variables in 2x2 contingency table whereas Chi-Square Test is used in larger contingency table and larger sample size

Multi-Arm Bandit Algorithm:

Power and Sample Size

############################################### Chapter 4 Regression and Prediction Simple Linear Regression Multiple Linear Regression Prediction Using Regression Factor Variables in Regression Interpreting the Regression Equation Regression Diagnostics Polynomial Spline Regression

############################################### Chapter 5 Classification Naïve Bayes Discriminant Analysis Logistic Regression Performance of Classification Models Strategies for Imbalanced Data

############################################### Chapter 6 Statistical Machine Learning K-nearest Neighbors Tree Models Bagging and the Random Forest Boosting

############################################### Chapter 7 Unsupervised Learning Principal Component Analysis K-means Clustering Hierarchical Clustering Model-based Clustering Scaling and Categorical Variables