15 01 Statistical Thinking in Python (Part 1) - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

01 Graphical exploratory data analysis

Before diving into sophisticated statistical inference techniques, you should first explore your data by plotting them and computing simple summary statistics. This process, called exploratory data analysis, is a crucial first step in statistical analysis of data.

EDA tools

histogram
Bee swarm plot
ECDF: empirical cumulation distribution function

def ecdf(data):
    """Compute ECDF for a one-dimensional array of measurements."""
    # Number of data points: n
    n = len(data)

    # x-data for the ECDF: x
    x = np.sort(data)

    # y-data for the ECDF: y
    y = np.arange(1, n+1) / n

    return x, y

# Compute ECDF for versicolor data: x_vers, y_vers
x_vers, y_vers = ecdf(versicolor_petal_length)

# Generate plot
_ = plt.plot(x_vers, y_vers, marker = '.', linestyle = 'none')

# Label the axes
_ = plt.xlabel('petal_length')
_ = plt.ylabel('ECDF')

# Display the plot
plt.show()

02 Quantitative exploratory data analysis

means: np.mean()
- heavily influenced by outliers
median: np.median()
- immune to data that takes on extreme values
percentiles: np.percentile(df['col'], [25,50,75])
- box plots: sns.boxplot()
- an outlier is not necessarily erroneous
variance & standard deviation
- np.var(), np.std()
- also, to calculate sqrt root : np.sqrt()
covariance and Pearson correlation coefficient
- np.cov()
- np.corrcoef()

03 Thinking probabilistically-- Discrete variables

Give a set of data, you describe probabilistically what you might expect if those data were acquired again and again.

Random number generators and hacker statistics

Use simulated repeated measurements to compute probabilities.
- Determine how to simulate data
- Simulate many, many times (like 10,000)
- Probability is approximately the fraction of trails with the outcome of interest.
np.random.random() 0-1, keyword: size=
np.random.seed()

Binomial distribution

np.random.binomial(n= ,p= ,size = )

Poisson process and Poisson distribution

Poisson Process

The timing of the next event is completely independent of when the previous event happened.
- e.g. Natural births in a given hospital

Poisson Distribution

np.random.poisson(mean, size= )
The number r of arrivals of a Poisson process in a given time interval with average rate of ? arrivals per interval in Poisson distributed.
Limit of the Binomial distribution for low probability of success & large number of trails (rare events).

04 Thinking probabilistically-- Continuous variables

Probability density function (PDF)

Continuous analog to the PMF
Mathematical description of the relative likelihood of observing a value of a continuous variable.

Normal distribution

np.random.normal(mean, std, size)
The normal PDF

# Draw 100000 samples from Normal distribution with stds of interest: samples_std1
samples_std1 = np.random.normal(20, 1, size = 100000)

# Make histograms
_ = plt.hist(samples_std1, normed=True, histtype='step',bins=100)

# Make a legend, set limits and show plot
plt.ylim(-0.01, 0.42)
plt.show()

The normal CDF

# Generate CDFs
x_std1, y_std1 = ecdf(samples_std1)

# Plot CDFs
_ = plt.plot(x_std1, y_std1, marker='.', linestyle='none')

# Make a legend and show the plot
plt.show()

caveats:
- compare the theoretical CDF with the ECDF
- light tails: tiny probability of being >4 stdev (extreme values)

The Exponential distribution

The waiting time between arrivals of a Poisson process is Exponentially distributed.

mean = np.mean(inter_times)
samples = np.random.exponential(mean, size=10000) 
x, y = ecdf(inter_times)
x_theor, y_theor = ecdf(samples)
_ = plt.plot(x_theor, y_theor)
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.xlabel('time (days)')
_ = plt.ylabel('CDF')
plt.show()