00 STA130 Course Wiki Textbook - pointOfive/stat130chat130 GitHub Wiki

00 Tools

This is the course wiki-textbook. The other primary tools and resources of the course are as follows.

  1. UofT Jupyterhub (classic notebook, jupyterhub, or google colab are all fine) .ipynb notebook files
  2. ChatGPT (or Copilot) "vanilla" ChatBots
  3. STA130 custom NBLM ChatBot
  4. Course GitHub
  5. Course Quercus Homepage
  6. Course Piazza Discussion Board

Week 01 Data Summarization

Simple exploratory data analysis (EDA)
and AI ChatBots are Very Good (at some things)

Tutorial/Homework: Topics

  1. importing libraries... like pandas
  2. loading data... with pd.read_csv()
  3. counting missing values... with df.isna().sum()
  4. observations (rows) and variables (columns)... df.shape and df.columns
  5. numeric versus non-numeric... df.describe() and df.value_counts()
  6. removing missing data... with df.dropna() and del df['col']
  7. grouping and aggregation.... with df.groupby("col1")["col2"].describe()

Tutorial/Homework: Lecture Extensions

Topic numbers below correspond to extensions of topic items above.

2. function/method arguments (like encoding, dropna, inplace, and return vs side-effect)
3. boolean values and coercion
4. _ i. .dtypes and .astype()
___ ii. statistic calculation functions

Lecture: New Topics

  1. sorting and (0-based) indexing
  2. subsetting via conditionals and boolean selection

Out of Scope

  1. Material covered in future weeks
  2. Anything not substantively addressed above...
  3. ...such as how to handle missing values using more advanced techniques that don't just "ignore" or "remove" them (for example by filling or imputing the missing values and the assumptions required when doing so...)
  4. ...further "data wrangling topics" such as "joining" and "merging"; "pivoting", "wide to long", and "tidy" data formats; etc.

Week 02 Coding and Probability

Chance is intuitive and use AI ChatBots to make coding and understanding code easier

Tutorial/Homework: Topic

  1. python object types... tuple, list, dict
  2. another key data type... np.array (and np.random.choice)
  3. for loops... for i in range(n):
    1. print()
    2. for x in some_list:
    3. for i,x in enumerate(some_list):
    4. for key,val in dictionary.items() and dictionary.keys() and dictionary.values()
  4. logical flow control... if, elif, else
    1. try-except blocks

Tutorial/Homework: Lecture Extensions

  1. more object types... type()
    1. more indexing for "lists"
    2. more np.array with .dtype
    3. more "list" behavior with str and .split()
      1. text manipulation with .apply(lambda x: ...), .replace(), and re
    4. operator overloading
  2. What are pandas DataFrame objects?
  3. for word in sentence.split():

Lecture: New Topics

  1. from scipy import stats, stats.multinomial, and probability (and np.random.choice)
    1. conditional probability Pr(A|B) and independence Pr(A|B)=Pr(A)

Out of Scope

  1. Material covered in future weeks
  2. Anything not substantively addressed above...
  3. ...such as modular code design (with def based functions or classes)
  4. ...such as dictionary iteration (which has been removed from the above material)
  5. ...such as text manipulation with .apply(lambda x: ...), .replace(), re (which are introduced but are generally out of scope for STA130)

Week 03 Data Visualization

Populations and Sampling and more interesting EDA
by making figures with AI ChatBots

Tutorial/Homework: Topics

  1. More Precise Data Types (As Opposed to Object Types): continuous, discrete, nominal and ordinal categorical, and binary
  2. Bar Plots and Modes
  3. Histograms
  4. Box Plots, Range, IQR, and Outliers
  5. Skew and Multimodality
    1. Mean versus Median
    2. Normality and Standard Deviations
    3. Characteristics of a Normal Distribution

Tutorial/Homework: Lecture Extensions

These are topics introduced in the lecture that build upon the tutorial/homework topics discussed above

Topic numbers below correspond to extensions of topic items above.

2. Plotting: Plotly, Seaborn, Matplotlib, Pandas, and other visualization tools.
___ i. Legends, annotations, figure panels, etc.
3. Kernel Density Estimation using Violin Plots
5. Log Transformations

Lecture: New Topics

This section introduces new concepts that are not covered in the tutorial/homework topics.

  1. Populations from scipy import stats
    1. stats.multinomial and np.random.choice()
    2. stats.norm, stats.gamma, and stats.poisson
  2. Samples versus populations (distributions)
  3. Statistical Inference

Out of Scope

  1. Material covered in future weeks
  2. Anything not substantively addressed above
    1. Expectation, moments, integration, heavy tailed distributions
    2. Kernel functions for kernel density estimation
  3. bokeh, shiny, d3, etc...

Week 04 Bootstrapping

Confidence Intervals and Statistical Inference
(as opposed to just Estimation) using Sampling Distributions

Tutorial/Homework: Topic

  1. Simulation (with for loops and from scipy import stats)
  2. Sampling Distribution of the Sample Mean
  3. Standard Deviation versus Standard Error
  4. How n Drives Standard Error

Tutorial/Homework: Lecture Extensions

  1. Independent Sampling functions like df.sample([n=n/frac=1], replace=False)
    1. Are Sampling Distributions Skewed?
    2. Bootstrapping
    3. Not Bootstrapping

Lecture: New Topics

  1. Confidence Intervals
  2. Bootstrapped Confidence Intervals
  3. "Double" for loops
    1. Proving Bootstrapped Confidence Intervals using Simulation

Out of Scope

  1. Material covered in future weeks
  2. Anything not substantively addressed above...
  3. ...such as the Central Limit Theorem (CLT), Law of Large Numbers (LLN), and theoretical "x-bar plus/minus about 2 standard errors" confidence intervals (based on the so-called "pivot" form)
  4. ... the alternative sampling function np.random.choice(list_of_options, p, replace=True) which will be introduced for different purposes later

Week 05 Hypothesis Testing

P-values And How To Use And Not Use Them

Tutorial/Homework: Topics

  1. Null and Alternative Hypotheses
  2. The Sampling Distribution of the Null Hypothesis
    1. The role Sample Size n (re: How n Drives Standard Error)
    2. "One sample" paired difference hypothesis tests with a "no effect" null
  3. p-values

Tutorial/Homework: Lecture Extensions

These are topics introduced in the lecture that build upon the tutorial/homework topics discussed above

  1. Using p-values
    1. Using confidence intervals
    2. Misusing p-values
    3. One- versus two-sided hypothesis tests

Lecture: New Topics

  1. Type I and Type II Errors
  2. The Reproducibility Crisis

Out of Scope

  1. Material covered in future weeks
  2. Anything not substantively addressed above...
  3. Simulation versus theoretical hypothesis testing frameworks, z-tests and t-tests, parametric versus nonparametric hypothesis testing frameworks, other tests such as Fisher Exam or Chi-squared or F-tests, etc...
  4. Well, these above are indeed out of scope for the the STA130 final exam but it looks like they're going to be DEFINITELY NOT out of scope for the course project*...

Week 7ate9 Simple Linear Regression

Normal Distributions gettin' jiggy wit it

LEC 1 New Topics

  1. Correlation Association (IS NOT Causation)
    1. DO NOT USE Correlation to Measure ANYTHING EXCEPT "Straight Line" Linear Association
    2. Correlation is just for Y = mx + b
  2. Simple Linear Regression is Just a Normal Distribution
    1. Terminology: predictor, outcome, intercept and slope coefficients, and error terms

TUT/HW Topics

  1. import statsmodels.formula.api as smf
  2. smf.ols
    1. "R-style" formulas I
    2. "quoting" non-standard columns
  3. smf.ols("y~x", data=df).fit() and .params $\hat \beta_k$ versus $\beta_k$
    1. .fittedvalues
    2. .rsquared "variation proportion explained"
    3. .resid residuals and assumption diagnostics
  4. smf.ols("y~x", data=df).fit().summary() and .tables[1] for Testing "On Average" Linear Association

LEC 2 New Topics / Extensions

  1. Two(2) unpaired samples group comparisons
  2. Two(2) unpaired sample permutation tests
  3. Two(2) unpaired sample bootstrapping
  4. Indicator variables and contrasts linear regression

Out of scope:

  1. Material covered in future weeks
  2. Anything not substantively addressed above...
  3. ...such as all the stuff around multi/bivariate normal distribution and their covariance matrices, ellipses and their math and visual weirdness outside of a 1:1 aspect ratio, and eigenvectors and eigenvalues and major axis lines, etc...
  4. ...such as the mathematical formulas correlation, but just noting that they sort of just look like formulas for variance...

Weeks 10 Multiple Linear Regression

Normal Distributions Now REGRESSION'S gettin' jiggy wit it

Tutorial/Homework: Topics

  1. Multiple Linear Regression
    1. Interactions
    2. Categoricals
  2. Model Fitting
    1. Evidence-based Model Building
    2. Performance-based Model Building
    3. Complexity, Multicollinearity, and Generalizability

Tutorial/Homework/Lecture Extensions

These are topics introduced in the lecture that build upon the tutorial/homework topics discussed above

  1. Logistic Regression
    1. Categorical to Binary Cat2Bin Variables
  2. And Beyond

Lecture: New Topics

  1. I'm planning to just show you how I work on this kind of data with a pretty interesting example...

Out of scope:

  1. Material covered in future weeks
  2. Anything not substantively addressed above...
  3. ...the deep mathematical details condition numbers, variance inflation factors, K-Folds Cross-Validation...
  4. ...the actual deep details of log odds, link functions, generalized linear models...

Weekz 11 Classification Decision Trees

Machine Learning

Tutorial/Homework: Topics

  1. Classification Decision Trees
    1. Classification versus Regression
  2. scikit-learn versus statsmodels
    1. Feature Importances
  3. Confusion Matrices
    1. Metrics
  4. In Sample versus Out of Sample

Tutorial/Homework Extensions/New Topics for Lecture

These are topics introduced in the lecture that build upon the tutorial/homework topics discussed above

  1. Model Fitting: Decision Trees Construction
  2. Model Complexity and Machine Learning
  3. Prediction
    1. ROC curves
    2. Partial Dependency Plots

Out of scope:

  1. Additional classification metrics and additional considerations around confusion matrices beyond those discussed above [previously, "Material covered in future weeks"]
  2. Deeper details of Decision Tree and Random Forest construction (model fitting) processes [previously, "Anything not substantively addressed above"]
  3. ...the actual deep details of log odds, link functions, generalized linear models, and now multi-class classification since we can instead just use .predict() and we now know about predict_proba()...
  4. ...other Machine Learning models and the rest of scikit-learn, e.g., K-Folds Cross-Validation and model complexity regularization tuning with sklearn.model_selection.GridSearchCV, etc.