00 STA130 Course Wiki Textbook - pointOfive/stat130chat130 GitHub Wiki
00 Tools
This is the course wiki-textbook. The other primary tools and resources of the course are as follows.
- UofT Jupyterhub (classic notebook, jupyterhub, or google colab are all fine)
.ipynb
notebook files - ChatGPT (or Copilot) "vanilla" ChatBots
- STA130 custom NBLM ChatBot
- Course GitHub
- Course Quercus Homepage
- Course Piazza Discussion Board
Week 01 Data Summarization
Simple exploratory data analysis (EDA)
and AI ChatBots are Very Good (at some things)
Tutorial/Homework: Topics
- importing libraries... like pandas
- loading data... with pd.read_csv()
- counting missing values... with df.isna().sum()
- observations (rows) and variables (columns)... df.shape and df.columns
- numeric versus non-numeric... df.describe() and df.value_counts()
- removing missing data... with df.dropna() and del df['col']
- grouping and aggregation.... with df.groupby("col1")["col2"].describe()
Tutorial/Homework: Lecture Extensions
Topic numbers below correspond to extensions of topic items above.
2. function/method arguments (like encoding
, dropna
, inplace
, and return vs side-effect)
3. boolean values and coercion
4. _ i. .dtypes and .astype()
___ ii. statistic calculation functions
Lecture: New Topics
Out of Scope
- Material covered in future weeks
- Anything not substantively addressed above...
- ...such as how to handle missing values using more advanced techniques that don't just "ignore" or "remove" them (for example by filling or imputing the missing values and the assumptions required when doing so...)
- ...further "data wrangling topics" such as "joining" and "merging"; "pivoting", "wide to long", and "tidy" data formats; etc.
Week 02 Coding and Probability
Chance is intuitive and use AI ChatBots to make coding and understanding code easier
Tutorial/Homework: Topic
- python object types... tuple, list, dict
- another key data type... np.array (and
np.random.choice
) - for loops... for i in range(n):
- print()
- for x in some_list:
- for i,x in enumerate(some_list):
for key,val in dictionary.items()
anddictionary.keys()
anddictionary.values()
- logical flow control... if, elif, else
Tutorial/Homework: Lecture Extensions
- more object types... type()
- more indexing for "lists"
- more np.array with .dtype
- more "list" behavior with str and .split()
text manipulation with.apply(lambda x: ...)
,.replace()
, andre
- operator overloading
- What are pandas DataFrame objects?
- for word in sentence.split():
Lecture: New Topics
- from scipy import stats, stats.multinomial, and probability (and
np.random.choice
)
Out of Scope
- Material covered in future weeks
- Anything not substantively addressed above...
- ...such as modular code design (with
def
based functions orclasses
) - ...such as dictionary iteration (which has been removed from the above material)
- ...such as text manipulation with
.apply(lambda x: ...)
,.replace()
,re
(which are introduced but are generally out of scope for STA130)
Week 03 Data Visualization
Populations and Sampling and more interesting EDA
by making figures with AI ChatBots
Tutorial/Homework: Topics
- More Precise Data Types (As Opposed to Object Types): continuous, discrete, nominal and ordinal categorical, and binary
- Bar Plots and Modes
- Histograms
- Box Plots, Range, IQR, and Outliers
- Skew and Multimodality
Tutorial/Homework: Lecture Extensions
These are topics introduced in the lecture that build upon the tutorial/homework topics discussed above
Topic numbers below correspond to extensions of topic items above.
2. Plotting: Plotly, Seaborn, Matplotlib, Pandas, and other visualization tools.
___ i. Legends, annotations, figure panels, etc.
3. Kernel Density Estimation using Violin Plots
5. Log Transformations
Lecture: New Topics
This section introduces new concepts that are not covered in the tutorial/homework topics.
- Populations from scipy import stats
stats.multinomial
andnp.random.choice()
stats.norm
,stats.gamma
, andstats.poisson
- Samples versus populations (distributions)
- Statistical Inference
Out of Scope
- Material covered in future weeks
- Anything not substantively addressed above
- Expectation, moments, integration, heavy tailed distributions
- Kernel functions for kernel density estimation
- bokeh, shiny, d3, etc...
Week 04 Bootstrapping
Confidence Intervals and Statistical Inference
(as opposed to just Estimation) using Sampling Distributions
Tutorial/Homework: Topic
- Simulation (with
for
loops andfrom scipy import stats
) - Sampling Distribution of the Sample Mean
- Standard Deviation versus Standard Error
- How n Drives Standard Error
Tutorial/Homework: Lecture Extensions
- Independent Sampling functions like
df.sample([n=n/frac=1], replace=False)
Lecture: New Topics
Out of Scope
- Material covered in future weeks
- Anything not substantively addressed above...
- ...such as the Central Limit Theorem (CLT), Law of Large Numbers (LLN), and theoretical "x-bar plus/minus about 2 standard errors" confidence intervals (based on the so-called "pivot" form)
- ... the alternative sampling function
np.random.choice(list_of_options, p, replace=True)
which will be introduced for different purposes later
Week 05 Hypothesis Testing
P-values And How To Use And Not Use Them
Tutorial/Homework: Topics
Tutorial/Homework: Lecture Extensions
These are topics introduced in the lecture that build upon the tutorial/homework topics discussed above
Lecture: New Topics
Out of Scope
- Material covered in future weeks
- Anything not substantively addressed above...
- Simulation versus theoretical hypothesis testing frameworks, z-tests and t-tests, parametric versus nonparametric hypothesis testing frameworks, other tests such as Fisher Exam or Chi-squared or F-tests, etc...
- Well, these above are indeed out of scope for the the STA130 final exam but it looks like they're going to be DEFINITELY NOT out of scope for the course project*...
Week 7ate9 Simple Linear Regression
Normal Distributions gettin' jiggy wit it
LEC 1 New Topics
TUT/HW Topics
- import statsmodels.formula.api as smf
- smf.ols
- smf.ols("y~x", data=df).fit() and .params $\hat \beta_k$ versus $\beta_k$
- smf.ols("y~x", data=df).fit().summary() and .tables[1] for Testing "On Average" Linear Association
LEC 2 New Topics / Extensions
- Two(2) unpaired samples group comparisons
- Two(2) unpaired sample permutation tests
- Two(2) unpaired sample bootstrapping
- Indicator variables and contrasts linear regression
Out of scope:
- Material covered in future weeks
- Anything not substantively addressed above...
- ...such as all the stuff around multi/bivariate normal distribution and their covariance matrices, ellipses and their math and visual weirdness outside of a 1:1 aspect ratio, and eigenvectors and eigenvalues and major axis lines, etc...
- ...such as the mathematical formulas correlation, but just noting that they sort of just look like formulas for variance...
Weeks 10 Multiple Linear Regression
Normal DistributionsNow REGRESSION'S gettin' jiggy wit it
Tutorial/Homework: Topics
Tutorial/Homework/Lecture Extensions
These are topics introduced in the lecture that build upon the tutorial/homework topics discussed above
Lecture: New Topics
- I'm planning to just show you how I work on this kind of data with a pretty interesting example...
Out of scope:
- Material covered in future weeks
- Anything not substantively addressed above...
- ...the deep mathematical details condition numbers, variance inflation factors, K-Folds Cross-Validation...
- ...the actual deep details of log odds, link functions, generalized linear models...
Weekz 11 Classification Decision Trees
Machine Learning
Tutorial/Homework: Topics
- Classification Decision Trees
scikit-learn
versusstatsmodels
- Confusion Matrices
- In Sample versus Out of Sample
Tutorial/Homework Extensions/New Topics for Lecture
These are topics introduced in the lecture that build upon the tutorial/homework topics discussed above
Out of scope:
- Additional classification metrics and additional considerations around confusion matrices beyond those discussed above [previously, "Material covered in future weeks"]
- Deeper details of Decision Tree and Random Forest construction (model fitting) processes [previously, "Anything not substantively addressed above"]
- ...the actual deep details of log odds, link functions, generalized linear models, and now multi-class classification since we can instead just use
.predict()
and we now know aboutpredict_proba()
... - ...other Machine Learning models and the rest of
scikit-learn
, e.g., K-Folds Cross-Validation and model complexity regularization tuning withsklearn.model_selection.GridSearchCV
, etc.