15 02 Statistical Thinking in Python (Part 2) - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

01 Parameter estimation by optimization

Optimal parameters

Parameter values that bring the model in closest agreement with the data.
Overlay the theoretical CDF with the ECDF from the data.
- This helps you to verify that the Exponential distribution describes the observed data.
Packages to do statistical inference
- scipy.stats
- statsmodels
- hacker stats with numpy

Linear regression by least squares

The process of finding the parameters for which the sum of the squares of the residuals is minimal.
np.polyfit(x, y ,1)
- slope, intercept = np.polyfit(x, y, 1)

# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')

# Perform a linear regression using np.polyfit(): a, b
a, b = np.polyfit(illiteracy, fertility, 1)

# Print the results to the screen
print('slope =', a, 'children per woman / percent illiterate')
print('intercept =', b, 'children per woman')

# Make theoretical line to plot
x = np.array([0,100])
y = a * x + b

# Add regression line to your plot
_ = plt.plot(x, y)

# Draw the plot
plt.show()

The importance of EDA: Anscombe's quartet

02 Bootstrap confidence intervals

Generating bootstrap replicates

Bootstrapping: The use of resampled data to perform statistical inference.
Bootstrap sample: A resampled array of the data.
- Sampling with replacement.
Bootstrap replicates: A statistic computed from a resampled array (e.g. mean)
Resampling engine: np.random.choice
- np.random.choice([1,2,3,4,5], size=5)
- e.g: array([5, 3, 5, 5, 2])

Bootstrap confidence interval

In fact, it can be shown theoretically that under not-too-restrictive conditions, the value of the mean will always be Normally distributed. (This does not hold in general, just for the mean and a few other statistics.)

Step 1: Generate (many) bootstrap replicates

def draw_bs_reps(data, func, size=1):
    """Draw bootstrap replicates."""

    # Initialize array of replicates: bs_replicates
    bs_replicates = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_sample = np.random.choice(data, len(data))
        bs_replicates[i] = func(bs_sample)

    return bs_replicates

Step 2: Plot a histogram of bootstrap replicates
Step 3: Confidence interval np.percentile(bs_replicates, [2.5, 97.5])

Pairs bootstrap

Non parametric inference

Make no assumptions about the model or probability distribution underlying the data Pairs bootstrap for linear regression
Resample data in pairs
Compute slope and intercept from resampled data
Each slope and intercept is a bootstrap replicate
Compute confidence intervals from percentiles

A function to do pair bootstrap

def draw_bs_pairs_linreg(x, y, size=1):
    """Perform pairs bootstrap for linear regression."""

    # Set up array of indices to sample from: inds
    inds = np.arange(0,len(x))

    # Initialize replicates: bs_slope_reps, bs_intercept_reps
    bs_slope_reps = np.empty(size)
    bs_intercept_reps = np.empty(size)

    # Generate replicates
    for i in range(size):
        bs_inds = np.random.choice(inds, size=len(inds))
        bs_x, bs_y = x[bs_inds], y[bs_inds]
        bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x, bs_y, 1)

    return bs_slope_reps, bs_intercept_reps

03 Introduction to hypothesis testing

Formulating and simulating a hypothesis

Hypothesis testing

Assessment of how reasonable the observed data are assuming a hypothesis is true.

Null hypothesis

Another name for the hypothesis you are testing.

Permutation

Random reordering of entries in an array.
Permutation sampling is a great way to simulate the hypothesis that two variables have identical probability distributions.
A function to generate a permutation sample from two data sets.

def permutation_sample(data1, data2):
    """Generate a permutation sample from two data sets."""

    # Concatenate the data sets: data
    data = np.concatenate((data1, data2))

    # Permute the concatenated array: permuted_data
    permuted_data = np.random.permutation(data)

    # Split the permuted array into two: perm_sample_1, perm_sample_2
    perm_sample_1 = permuted_data[:len(data1)]
    perm_sample_2 = permuted_data[len(data1):]

    return perm_sample_1, perm_sample_2

visualizing permutation sampling

for _ in range(50):
    # Generate permutation samples
    perm_sample_1, perm_sample_2 = permutation_sample(rain_june, rain_november)


    # Compute ECDFs
    x_1, y_1 = ecdf(perm_sample_1)
    x_2, y_2 = ecdf(perm_sample_2)

    # Plot ECDFs of permutation sample
    _ = plt.plot(x_1, y_1, marker='.', linestyle='none',
                 color='red', alpha=0.02)
    _ = plt.plot(x_2, y_2, marker='.', linestyle='none',
                 color='blue', alpha=0.02)

# Create and plot ECDFs from original data
x_1, y_1 = ecdf(rain_june)
x_2, y_2 = ecdf(rain_november)
_ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='red')
_ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='blue')

# Label axes, set margin, and show plot
plt.margins(0.02)
_ = plt.xlabel('monthly rainfall (mm)')
_ = plt.ylabel('ECDF')
plt.show()

Apparently, not the same distribution

Test statistics and p-values

Test statistic

A single number that can be computed from observed data and from data you simulate under the null hypothesis.
It serves as a basis of comparison between the two.

p-value

The probability of obtaining a value of your test statistic that is at least as extreme as what was observed, under the assumption that the null hypothesis is true.
NOT the probability that the null hypothesis is true.

Statistical significance

Determined by the smallness of a p-value.

statistical significance vs. practical significance

Remember: statistical significance (that is, low p-values) and practical significance, whether or not the difference of the data from the null hypothesis matters for practical considerations, are two different things.

Generate permutation replicates

def draw_perm_reps(data_1, data_2, func, size=1):
    """Generate multiple permutation replicates."""

    # Initialize array of replicates: perm_replicates
    perm_replicates = np.empty(size)

    for i in range(size):
        # Generate permutation sample
        perm_sample_1, perm_sample_2 = permutation_sample(data_1, data_2)

        # Compute the test statistic
        perm_replicates[i] = func(perm_sample_1, perm_sample_2)

    return perm_replicates

Bootstrap hypothesis tests

Pipeline for hypothesis testing

Clearly state the null hypothesis
Define test statistic
Generate many sets of simulated data assuming the null hypothesis is true
Compute the test statistic for each simulated data set
The p-value

One sample test

Compare one set of data to a single number
- The mean of dataset A is equal to number B.

To set up the bootstrap hypothesis test, you will take the mean as our test statistic. Remember, your goal is to calculate the probability of getting a mean impact force less than or equal to what was observed for Frog B if the hypothesis that the true mean of Frog B's impact forces is equal to that of Frog C is true. You first translate all of the data of Frog B such that the mean is 0.55 N. This involves adding the mean force of Frog C and subtracting the mean force of Frog B from each measurement of Frog B. This leaves other properties of Frog B's distribution, such as the variance, unchanged.

# Make an array of translated impact forces: translated_force_b
translated_force_b = force_b - force_b.mean() + 0.55

# Take bootstrap replicates of Frog B's translated impact forces: bs_replicates
bs_replicates = draw_bs_reps(translated_force_b, np.mean , 10000)

# Compute fraction of replicates that are less than the observed Frog B force: p
p = np.sum(bs_replicates <= np.mean(force_b)) / 10000

# Print the p-value
print('p = ', p)  #p =  0.0046

The low p-value suggests that the null hypothesis that Frog B and Frog C have the same mean impact force is false.

Two sample test

Compare two sets of data
- The mean of dataset A = The mean of dataset B (but not exactly the same distribution)

04 Hypothesis test examples

A/B testing

Used by organizations to see if a strategy change gives a better result.
Null hypothesis: the test statistic is impervious to the change.

Test of correlation

Posit null hypothesis: the two variables are completely uncorrelated
Simulate data assuming null hypothesis is true
Use Pearson correlation as test statistic
Compute p-value

To do so, permute the illiteracy values but leave the fertility values fixed. This simulates the hypothesis that they are totally independent of each other. For each permutation, compute the Pearson correlation coefficient and assess how many of your permutation replicates have a Pearson correlation coefficient greater than the observed one.

# Compute observed correlation: r_obs
r_obs = pearson_r(illiteracy, fertility)

# Initialize permutation replicates: perm_replicates
perm_replicates = np.empty(10000)

# Draw replicates
for i in range(10000):
    # Permute illiteracy measurments: illiteracy_permuted
    illiteracy_permuted = np.random.permutation(illiteracy)

    # Compute Pearson correlation
    perm_replicates[i] = pearson_r(illiteracy_permuted, fertility)

# Compute p-value: p
p = np.sum(perm_replicates >= r_obs)/len(perm_replicates)
print('p-val =', p)