00 STA130 Course Wiki Textbook - pointOfive/stat130chat130 GitHub Wiki

00 Tools

This is the course wiki-textbook. The other primary tools and resources of the course are as follows.

UofT Jupyterhub (classic notebook, jupyterhub, or google colab are all fine) .ipynb notebook files
ChatGPT (or Copilot) "vanilla" ChatBots
STA130 custom NBLM ChatBot
Course GitHub
Course Quercus Homepage
Course Piazza Discussion Board

Week 01 Data Summarization

Simple exploratory data analysis (EDA)
and AI ChatBots are Very Good (at some things)

Tutorial/Homework: Topics

importing libraries... like pandas
loading data... with pd.read_csv()
counting missing values... with df.isna().sum()
observations (rows) and variables (columns)... df.shape and df.columns
numeric versus non-numeric... df.describe() and df.value_counts()
removing missing data... with df.dropna() and del df['col']
grouping and aggregation.... with df.groupby("col1")["col2"].describe()

Tutorial/Homework: Lecture Extensions

Topic numbers below correspond to extensions of topic items above.

2. function/method arguments (like encoding, dropna, inplace, and return vs side-effect)
3. boolean values and coercion
4. _ i. .dtypes and .astype()
___ ii. statistic calculation functions

Lecture: New Topics

Out of Scope

Material covered in future weeks
Anything not substantively addressed above...
...such as how to handle missing values using more advanced techniques that don't just "ignore" or "remove" them (for example by filling or imputing the missing values and the assumptions required when doing so...)
...further "data wrangling topics" such as "joining" and "merging"; "pivoting", "wide to long", and "tidy" data formats; etc.

Week 02 Coding and Probability

Chance is intuitive and use AI ChatBots to make coding and understanding code easier

Tutorial/Homework: Topic

python object types... tuple, list, dict
another key data type... np.array (and np.random.choice)
for loops... for i in range(n):
1. print()
2. for x in some_list:
3. for i,x in enumerate(some_list):
4. ~~for key,val in dictionary.items() and dictionary.keys() and dictionary.values()~~
logical flow control... if, elif, else
1. try-except blocks

Tutorial/Homework: Lecture Extensions

more object types... type()
1. more indexing for "lists"
2. more np.array with .dtype
3. more "list" behavior with str and .split()
  1. ~~text manipulation with .apply(lambda x: ...), .replace(), and re~~
4. operator overloading
What are pandas DataFrame objects?
for word in sentence.split():

Lecture: New Topics

from scipy import stats, stats.multinomial, and probability (and np.random.choice)
1. conditional probability Pr(A|B) and independence Pr(A|B)=Pr(A)

Out of Scope

Material covered in future weeks
Anything not substantively addressed above...
...such as modular code design (with def based functions or classes)
...such as dictionary iteration (which has been removed from the above material)
...such as text manipulation with .apply(lambda x: ...), .replace(), re (which are introduced but are generally out of scope for STA130)

Week 03 Data Visualization

Populations and Sampling and more interesting EDA
by making figures with AI ChatBots

Tutorial/Homework: Topics

More Precise Data Types (As Opposed to Object Types): continuous, discrete, nominal and ordinal categorical, and binary
Bar Plots and Modes
Histograms
Box Plots, Range, IQR, and Outliers
Skew and Multimodality

Tutorial/Homework: Lecture Extensions

These are topics introduced in the lecture that build upon the tutorial/homework topics discussed above

Topic numbers below correspond to extensions of topic items above.

2. Plotting: Plotly, Seaborn, Matplotlib, Pandas, and other visualization tools.
___ i. Legends, annotations, figure panels, etc.
3. Kernel Density Estimation using Violin Plots
5. Log Transformations

Lecture: New Topics

This section introduces new concepts that are not covered in the tutorial/homework topics.

Populations from scipy import stats
1. stats.multinomial and np.random.choice()
2. stats.norm, stats.gamma, and stats.poisson
Samples versus populations (distributions)
Statistical Inference

Out of Scope

Material covered in future weeks
Anything not substantively addressed above
1. Expectation, moments, integration, heavy tailed distributions
2. Kernel functions for kernel density estimation
bokeh, shiny, d3, etc...

Week 04 Bootstrapping

Confidence Intervals and Statistical Inference
(as opposed to just Estimation) using Sampling Distributions

Tutorial/Homework: Topic

Simulation (with for loops and from scipy import stats)
Sampling Distribution of the Sample Mean
Standard Deviation versus Standard Error
How n Drives Standard Error

Tutorial/Homework: Lecture Extensions

Independent Sampling functions like df.sample([n=n/frac=1], replace=False)

Lecture: New Topics

Out of Scope

Material covered in future weeks
Anything not substantively addressed above...
...such as the Central Limit Theorem (CLT), Law of Large Numbers (LLN), and theoretical "x-bar plus/minus about 2 standard errors" confidence intervals (based on the so-called "pivot" form)
... the alternative sampling function np.random.choice(list_of_options, p, replace=True) which will be introduced for different purposes later

Week 05 Hypothesis Testing

P-values And How To Use And Not Use Them

Tutorial/Homework: Topics

Tutorial/Homework: Lecture Extensions

These are topics introduced in the lecture that build upon the tutorial/homework topics discussed above

Using p-values

Lecture: New Topics

Out of Scope

Material covered in future weeks
Anything not substantively addressed above...
Simulation versus theoretical hypothesis testing frameworks, z-tests and t-tests, parametric versus nonparametric hypothesis testing frameworks, other tests such as Fisher Exam or Chi-squared or F-tests, etc...
Well, these above are indeed out of scope for the the STA130 final exam but it looks like they're going to be DEFINITELY NOT out of scope for the course project*...

Week 7ate9 Simple Linear Regression

Normal Distributions gettin' jiggy wit it

LEC 1 New Topics

TUT/HW Topics

LEC 2 New Topics / Extensions

Out of scope:

Material covered in future weeks
Anything not substantively addressed above...
...such as all the stuff around multi/bivariate normal distribution and their covariance matrices, ellipses and their math and visual weirdness outside of a 1:1 aspect ratio, and eigenvectors and eigenvalues and major axis lines, etc...
...such as the mathematical formulas correlation, but just noting that they sort of just look like formulas for variance...

Weeks 10 Multiple Linear Regression

~~Normal Distributions~~ Now REGRESSION'S gettin' jiggy wit it

Tutorial/Homework: Topics

Tutorial/Homework/Lecture Extensions

These are topics introduced in the lecture that build upon the tutorial/homework topics discussed above

Lecture: New Topics

I'm planning to just show you how I work on this kind of data with a pretty interesting example...

Out of scope:

Material covered in future weeks
Anything not substantively addressed above...
...the deep mathematical details condition numbers, variance inflation factors, K-Folds Cross-Validation...
...the actual deep details of log odds, link functions, generalized linear models...

Weekz 11 Classification Decision Trees

Machine Learning

Tutorial/Homework: Topics

Tutorial/Homework Extensions/New Topics for Lecture

These are topics introduced in the lecture that build upon the tutorial/homework topics discussed above

Out of scope:

Additional classification metrics and additional considerations around confusion matrices beyond those discussed above [previously, "Material covered in future weeks"]
Deeper details of Decision Tree and Random Forest construction (model fitting) processes [previously, "Anything not substantively addressed above"]
...the actual deep details of log odds, link functions, generalized linear models, and now multi-class classification since we can instead just use .predict() and we now know about predict_proba()...
...other Machine Learning models and the rest of scikit-learn, e.g., K-Folds Cross-Validation and model complexity regularization tuning with sklearn.model_selection.GridSearchCV, etc.