14.310x Data Analysis for Social Scientists

General

YaRrr! The Pirate’s Guide to R by Nathaniel D. Phillips
Data manipulation R-Python conversion guide by Shervine Amidi
Foundations of Probability in R by Daniel Trombley
Tidy Data by Hadley Wickham

Module 1: Introduction to the Course

Lecture summary (from slides)

Instructors and Course Title: Esther Duflo and Sara Ellison are leading a data analysis course titled "14.31x Data Analysis for Social Scientists."
Data Visualization Examples: They demonstrate the beauty and insightfulness of data through examples like mapping Facebook networks of Somalis in Eastleigh and illustrating pollution levels in China using high-resolution grids.
Impact of Data Analysis: The course highlights the power of data analysis by showcasing its influence on policy changes, such as altering regulations in India based on experiments eliminating conflicts of interest in reporting structures.
Cautionary Notes on Data Interpretation: Students are warned about the deceitful nature of data, including correlations with autism and the need to distinguish between causation and correlation.
Learning Objectives: The course emphasizes understanding causality, modeling data-generating processes, and practical skills like using R, experimental design, and presenting results effectively through graphics, tables, and text.

Module 2: Fundamentals of Probability, Random Variables, Joint Distributions + Collecting Data

Fundamentals of Probability

Lecture summary (from slides)

Definitions of Sample Space and Event: Sample space ( S ) is the collection of all possible outcomes of an experiment, while event ( A ) is any collection of outcomes. Events can be individual outcomes, the entire sample space, or the null set.
Probability Set Operations: Probability involves sets, utilizing definitions like intersection, union, complement, etc. Various set theory properties like associative, commutative, and distributive apply.
Axioms of Probability: Probability assigns a number ( P(A) ) to each event ( A ), ensuring ( P(A) \geq 0 ) for all ( A ), ( P(S) = 1 ), and satisfying the summation rule for disjoint sets.
Special Cases and Simple Sample Space: For finite sample spaces, probability can be calculated by counting outcomes. A simple sample space is defined where ( P(A) = \frac{n(A)}{n(S)} ), allowing for easy probability calculation.
Combinatorics and Probability: Probability computations often involve counting methods like permutations and combinations. These methods aid in determining outcomes and probabilities in various scenarios.

Bayes' Rule – Explained For Beginners by Peter Gleeson
Bayes theorem by 3Blue1Brown
Bayes' theorem calculator by Richard Lowry

Random Variables, Distributions, and Joint Distributions

Lecture summary (from slides)

Lecture outline includes housekeeping notes, topics for the day, and upcoming sessions.
Introduction to random variables and their significance in analyzing numerical characteristics of sample spaces.
Explanation of discrete and continuous random variables, with examples such as pizza topping selection and basketball shot outcomes.
Discussion on probability functions (PF) for discrete random variables and probability density functions (PDF) for continuous random variables.
Introduction to cumulative distribution functions (CDF) and their relationship with PFs/PDFs, along with joint distributions for multiple random variables.

Joint Distributions, Independence by Jeremy Orloff and Jonathan Bloom

Gathering and Collecting Data

Lecture summary (from slides)

Finding Data Sources: Various methods for accessing data are discussed, including existing data libraries, international household survey data, replication data from researchers, and scraping data from the internet.
Existing Data Libraries: Resources such as Data.gov, IPUMS, ICPSR, and Harvard-MIT Data Center are highlighted for accessing datasets.
International Household Survey Data: Sources like Demographic and Health Surveys, World Bank, LSMS, and Rand public-use databases are mentioned for international household survey data.
Replication Data from Researchers: Randomized control trials and data repositories like Harvard Dataverse are noted for accessing replication data.
Harvesting Data and Web Scraping: Techniques such as scraping data from the internet, using APIs, and examples like using Google Maps API for traffic analysis are discussed.

R Course and R Tutorial: ggplot

Module 3: Describing Data, Joint and Conditional Distributions of Random Variables

Summarizing and Describing Data

Lecture summary (from slides)

Lecture by Prof. Esther Duflo on Exploratory Data Analysis (EDA)
Techniques covered include: Histogram plotting, Kernel density estimation, Comparing distributions, and Plotting estimates of Cumulative Distribution Functions (CDFs)
Utilization of R studio and packages like ggplot2, tidyverse, and dplyr emphasized for data manipulation and visualization
Explanation of histogram construction, density calculation, and bin width adjustment
Discussion on Kernel Density Estimation (KDE), bandwidth selection, and comparison between PDF and CDF for probability representation.

Joint, Marginal, and Conditional Distributions

Lecture summary (from slides)

Lecture topic: Probability with examples
Introducing joint probability distribution (PDF) with an example involving ( f_{XY}(x,y) )
Determining constant ( c ) for ( f_{XY}(x,y) = cx^2y ) within given limits
Exploring marginal distributions for discrete and continuous random variables
Understanding conditional distributions and their relationship with independence

Conditional Probability given Joint PDF by Michelle Lesh

Module 4: Functions and Moments of a Random Variables & Intro to Regressions

Functions of Random Variables

Lecture summary (from slides)

Methods for determining the distribution of a function of random variables depend on the nature of the original random variable(s) (discrete or continuous), whether it's a single variable or a vector, and the invertibility of the function.
To find the distribution of ( Y = h(X) ), first calculate the cumulative distribution function (CDF) by integrating over the appropriate region, then differentiate it to obtain the probability density function (PDF) if ( Y ) is continuous.
An example explores finding the PDF ( f_Y ) when ( X ) has a given PDF and ( Y = X^2 ), demonstrating the process of integrating over the appropriate region and solving for ( f_Y ).
Key examples include linear transformation of random variables, probability integral transformation, convolution, and order statistics.
Order statistics, such as the ( n )th order statistic ( Y_n ), can be crucial in various applications, especially when dealing with independent, identically distributed random variables, providing insights into the distribution of extreme values.

Moments of a Distribution

Lecture summary (from slides)

Lecture on Probability: Example involving reverse probability integral transformation discussed.
Given ( X ) follows a uniform distribution ( U[0,1] ), and ( Y = -\log(X)/\lambda ), where ( \lambda > 0 ).
Calculated cumulative distribution function ( F_Y(y) ) for ( y \geq 0 ).
Derived probability density function ( f_Y(y) ) for ( y > 0 ), found it to be the exponential distribution.
Explored moments of a distribution, including mean, median, and mode, to summarize key features.

Expectation, Variance, and an Introduction to Regression

Lecture summary (from slides)

Moments of a distribution: Discusses moments and expectations of distributions, particularly focusing on finding features of the distribution of Y = g(X) instead of X itself.
St. Petersburg paradox: Introduces the paradox where players should theoretically be willing to pay an infinite amount to play a game, but in reality, their valuation diminishes with increasing money. Explores calculations related to the paradox.
Properties of expectation: Lists properties including linearity and handling non-independent variables.
Variance of a function: Describes variance as an expectation and explores variance of a function of a random variable.
Covariance and correlation: Introduces covariance and correlation to describe the relationship between random variables, detailing their properties and implications.

Optional Unit: Auctions

Module 5: Special Distributions, the Sample Mean, the Central Limit Theorem

Special Distributions

Lecture summary (from slides)

Some distributions are special because they are connected to others in useful ways.
Special distributions can model a wide variety of random phenomena, often due to fundamental underlying principles or rich collections of probability density functions with a small number of parameters.
To be considered truly special, a distribution must be mathematically elegant and have interesting and diverse applications.
Many special distributions have standard members corresponding to specified parameter values.
Examples of special distributions include Bernoulli, Binomial, Uniform, Negative Binomial, Geometric, Normal, Log-normal, and Pareto distributions.

Special distribution descriptions (from slides)

Bernoulli: Represents two possible outcomes (success or failure) with a given probability for success ( p ).
Binomial: Models the number of successes in a sequence of independent trials, each with the same probability of success ( p ).
Uniform: Probability distribution where all outcomes are equally likely within a given range.
Negative Binomial: Describes the number of trials required to achieve a specified number of successes (( r )) in a sequence of independent Bernoulli trials.
Geometric: Represents the number of failures before the first success in a sequence of independent Bernoulli trials.
Normal: Characterized by its bell-shaped curve and defined by its mean (( \mu )) and variance (( \sigma^2 )).
Log-normal: Distribution of a random variable whose logarithm is normally distributed.
Pareto: Represents a power-law distribution often used to describe wealth or income distributions, where a small number of items account for the majority of the effect.

A brief visualization of R's distribution functions by Jason Mercer

The Sample Mean, Central Limit Theorem, and Estimation

Lecture summary (from slides)

Sample Mean: Defined as the arithmetic average of n random variables from a random sample of size n, denoted as X_n.
Distribution of Sample Mean: Explores how the sample mean's distribution is centered around the mean, becomes more concentrated as sample size increases, and has a variance of σ^2/n.
Central Limit Theorem: States that the sample mean from any distribution, with a large enough random sample size, will approximate a normal distribution.
Importance of Central Limit Theorem: Highlights the practical significance of being able to approximate the distribution of the sample mean regardless of the original distribution.
Statistics: Introduces statistics as the study of estimation and inference, focusing initially on estimation and defining parameters and estimators.

R Tutorial: Simulations

Module 6: Assessing and Deriving Estimators - Confidence Intervals, and Hypothesis Testing

Assessing and Deriving Estimators

Lecture summary (from slides)

Estimators are assessed based on the characteristics of their distributions, focusing on unbiasedness.
The sample mean and variance for an i.i.d. sample are unbiased estimators for the population mean and variance, respectively.
Efficiency of estimators is compared based on a given sample size, with a preference for more efficient ones among unbiased estimators.
Mean squared error is considered to balance bias and variance in estimators, aiming for a minimum.
Consistency is another criterion, indicating that an estimator's distribution converges to a single point at the true parameter as sample size increases.

Confidence Intervals, Hypothesis Testing and Power Calculations

Lecture summary (from slides)

The lecture discusses the concept of standard error:
- Standard error as a measure of how tightly concentrated around the unknown parameter the distribution of the estimator is.
- Reporting an estimate along with its standard error, including methods of estimation and calculation.
- Construction of confidence intervals as an alternative method to quantify reliability, providing equivalent information in a different form.
Confidence intervals are introduced as a way to quantify the reliability of an estimate:
- Finding functions that define the interval such that a certain probability falls within the interval.
- Different scenarios for constructing confidence intervals, such as when the variance is known or unknown.
- Criteria for choosing functions A & B to construct confidence intervals, typically balancing the probability on each side of the interval.
The lecture explains hypothesis testing as a tool to determine if there is enough evidence to contradict some assertion about a population based on a random sample:
- Purpose and definition of hypothesis testing.
- Different types of hypotheses: simple, composite, null, alternative.
- Example setups and calculations for hypothesis testing, considering type I and type II errors.

Understanding the t-distribution and its normal approximation by Kristoffer Magnusson
Interpreting Confidence Intervals by Kristoffer Magnusson
Understanding Statistical Power and Significance Testing by Kristoffer Magnusson

Module 7: Causality, Analyzing Randomized Experiments, & Nonparametric Regression

Causality

Lecture summary (from slides)

Causality is the relationship between cause and effect, often expressed in everyday statements like taking a pill for a headache or attending MIT for a good job.
Many questions in economics and social sciences revolve around causality, such as the impact of immigration on wages or whether a wall between Mexico and the US would stop immigration.
The potential outcome framework, pioneered by Donald Rubin, helps to conceptualize causality by associating each action with a potential outcome.
Understanding causal effects involves comparing potential outcomes with and without a treatment, where the treatment effect is the difference between the two outcomes.
Issues like selection bias and interference can complicate causal inference, but randomized controlled trials (RCTs) offer a solution by ensuring treatment assignment does not depend on potential outcomes.

Analyzing Randomized Experiments

Lecture summary (from slides)

Lecture 15 by Prof. Esther Duflo focuses on analyzing randomized experiments in the context of 14.310x.
The lecture covers conventional approaches to analyzing Randomized Controlled Trials (RCTs), including the Fisher exact test and power calculations.
It discusses estimating treatment effects and their standard deviation using sample averages and variance estimators.
Confidence intervals for treatment effects are computed using the ratio of the difference and estimated standard error, with critical values from t-distribution or normal approximation.
Hypothesis testing involves assessing whether the average treatment effect is zero, utilizing test statistics like t-tests and p-values.

Power of a hypothesis test by melbapplets

Explanatory Data Analysis: Nonparametric Comparisons and Regressions

Lecture summary (from slides)

Exploratory Data Analysis (EDA) is crucial before building models, especially with observational data where clear hypotheses may not exist.
Techniques like the Kolmogorov-Smirnov Test are used to compare distributions, assessing differences between groups or conditions.
One-sided Kolmogorov-Smirnov Tests are employed to ascertain if one distribution dominates another.
Asymptotic distribution of the KS statistic allows for hypothesis testing using critical values.
Non-parametric regression techniques like Kernel regression estimate relationships between variables without assuming a specific functional form.

Module 8: Single and Multivariate Linear Models

The Linear Model

Lecture summary (from slides)

Lecture 17 covers the linear model in statistics, building upon probability and parameter estimation concepts.
It discusses the importance of joint distributions in social science and introduces the concept of the linear model as a way to estimate parameters of these joint distributions.
The linear model is presented as a means to analyze the conditional distribution of outcome variables based on continuous random variables.
Estimation methods like linear regression and least squares are introduced for parameter estimation.
Assumptions for linear regression include fixed explanatory variables, homoskedasticity, and no serial correlation.
Interpretation of parameter estimates involves understanding the effect of explanatory variables on the outcome variable, with special considerations for dummy variables and non-linear relationships.

The Multivariate Linear Model

Lecture summary (from slides)

Introduction to a general linear model with matrix notation
Assumptions including identification and error behavior
Explanation of linear independence and invertibility of $X^TX$
Examples illustrating issues with linear dependence among regressors
Inference methods including hypothesis testing and estimation under restrictions, particularly F-tests

Full-factorial between-subjects ANOVA by Nathaniel D. Phillips

R Tutorial: Introduction to the Class lm

Summary (from video)

Explanation of lm as a function for building linear models.
Concept of classes and objects using the analogy of bananas as instances of the class "fruit."
Creation of artificial data points for demonstration.
Demonstration of fitting the linear model, checking its class, summarizing its coefficients, plotting the data points, and visualizing the linear regression line.

Some important code commands used in the video include:

lm: Function used to fit linear models.
summary: Function used to summarize the output of the linear model.
plot: Function used to create plots of data.
abline: Function used to add a straight line to a plot.
class: Function used to check the class of an object.
names: Function used to retrieve the names of columns in a data frame.

Linear regression with lm() by Nathaniel D. Phillips

Module 9: Practical Issues in Running Regressions and Omitted Variable Bias

Practical Issues in Running Regressions

Additional content about t-test for Regression coefficients (at beginning of Lecture 19 slides)

The t-test is an essential tool in statistical inference within the linear model framework, often used when the errors do not necessarily follow a normal distribution.
Its mathematical basis relies on the assumption of normality for the errors, which extends to the distribution of the coefficients, although their variances are typically unknown.
In cases where normality assumptions are not met, the t-test provides a more conservative hypothesis test, especially for small sample sizes.
The formula for the t-test involves calculating a statistic (T) to test a specific hypothesis about a coefficient (β), with its standard error (SE) accounting for the unknown error variance.
The t-test is particularly useful for testing single coefficients, such as in the case of a one-sided test, and is printed out automatically when running regressions, making it convenient for analysis.

Lecture summary (from slides)

Dummy variables play a crucial role in regression analysis, representing categorical variables with binary indicators, allowing for the inclusion of qualitative data in regression models.
When dealing with a categorical variable with multiple levels, transforming it into dummy variables is necessary for regression analysis, typically resulting in one less dummy variable than the number of levels to avoid multicollinearity.
Interpreting coefficients of dummy variables involves comparing each group to a reference group, with the coefficient representing the difference in the outcome variable between the group and the reference.
Dummy variables can be combined with interaction terms to analyze the effects of multiple variables simultaneously, providing insights into how different groups or factors influence the regression function.
Techniques such as difference-in-differences models utilize dummy variables and interaction terms to examine changes in outcomes before and after an intervention or treatment, facilitating causal inference in observational studies.

Omitted Variable Bias

Lecture summary (from slides)

Kernel regression can reveal non-linear relationships between variables, but OLS can still be used by transforming data through polynomials or partitioning ranges of variables.
Polynomial models offer flexibility, with options like straight polynomials, series expansions, or orthogonal polynomials.
Non-linear transformations like taking the log of variables or interacting them can be useful, and machine learning tools can aid in choosing transformations.
Using dummies for approximation involves partitioning the range of variables into intervals and defining dummies accordingly.
Locally linear regression balances bias and variance by performing weighted linear regressions around points of interest, using a kernel to determine weights within a bandwidth.

Module 10: Endogeneity, Instrumental Variables, Experimental Design, and Data Visualization

Endogeneity and Instrumental Variables

Lecture summary (from slides)

Establishing causality often requires more than regression controls, especially when facing endogeneity issues.
Endogeneity arises when there is a mutual relationship between variables, making it challenging to determine causality.
Instrumental variables (IV) provide a solution by using external factors that affect the treatment variable but not the outcome directly.
Randomized experiments, like the scholarship program in Ghana, can serve as instrumental variables, providing exogenous variation in the treatment variable.
IV estimation involves two stages: the first stage estimates the effect of the instrument on the treatment variable, while the second stage estimates the effect of the treatment variable on the outcome.

Experimental Design

Lecture summary (from slides)

Experimental Design:
- Randomization targets various aspects including interventions, participants, and levels of randomization such as schools, individuals, villages, or cells.
- Methods like simple randomization, stratification, and clustering are employed.
- Randomization can occur around cutoffs or through phase-ins, with methods like encouragement design utilized.
Question-driven Designs:
- Aim to answer specific questions or achieve objectives like estimating equilibrium effects or understanding interventions better.
Example: Active Labor Market Policy Evaluation:
- Job placement assistance evaluated through randomized trials to address high unemployment concerns.
- Criticism regarding displacement effects highlighted, emphasizing the need for thorough evaluation.
Two-step Randomized Controlled Trial:
- Involves a two-stage randomization process: first, treatment proportions are assigned to geographic areas, followed by random assignment of treatment status to individuals within those areas.
- Enables researchers to assess the impact of the program while accounting for both individual and geographical variations, thus providing robust insights into its effectiveness.
- Example: evaluate the efficacy of a large-scale search assistance program in France aimed at young, educated job seekers.
Impact Assessment of Raskin Program in Indonesia:
- Investigates the effect of providing transparency about subsidy eligibility on household subsidy receipt and usage.
- Utilizes randomized trials in villages and various card design variations to elucidate mechanisms and impacts.

Visualizing Data

Lecture summary (from slides)

Lecture focuses on two goals of data visualization: personal understanding and communication to others.
Emphasizes scientific visualization over journalistic or non-academic methods.
Defines visualization as transforming data into visible forms, emphasizing readability and relevance.
Goals include showing data truthfully, illustrating a story, reducing clutter, and convincing while complementing text.
Highlights principles from Tufte, such as maximizing data-ink ratio and avoiding chart junk, for effective visualization.

Module 11: Machine Learning

Lecture summary (from slides)

Introduction to Machine Learning and Econometrics: Presented by Sendhil Mullainathan with Jann Spiess, covering their interest in the topic and its relevance to economics.
Magic of Machine Learning: Discusses the awe-inspiring nature of machine learning, its potential applications beyond vision, and the AI approach emphasizing perfect execution and empirical learning.
Programming vs. Learning: Contrasts traditional programming methods with machine learning's data-driven approach, exemplified by sentiment analysis using word vectors.
Applications and Innovations of Machine Learning: Highlights various domains benefiting from machine learning, such as post office address reading, voice recognition, spam filters, recommender systems, and driverless cars.
Overfitting and High Dimensionality: Explores challenges like overfitting in estimation, the impact of wide data, and the necessity of understanding algorithms to address these issues effectively.

Lecture summary II (from slides)

Understanding OLS and Model Fit: The focus is on understanding the problem of minimizing in-sample fit versus out-of-sample fit in econometrics, particularly in the context of decision trees.
OLS vs Subset Selection: Discussion on the issue of using too many variables and exploring functions that use a subset of variables to avoid overfitting.
Constrained Minimization and Regularization: Introduction of constrained minimization to control complexity and regularization techniques like Lasso and Ridge to penalize more expressive functions.
Tuning Parameter Lambda: Exploring the tradeoff between penalizing expressiveness and in-sample fit, highlighting the importance of choosing the right level of complexity.
Empirical Tuning and Cross-Validation: Use of cross-validation to estimate the performance of different regularization levels and the importance of out-of-sample performance in choosing the best model.

14.310x - roadfoodr/mitx-sds-resources GitHub Wiki

14.310x Data Analysis for Social Scientists

General

Module 1: Introduction to the Course

Module 2: Fundamentals of Probability, Random Variables, Joint Distributions + Collecting Data

Fundamentals of Probability

Random Variables, Distributions, and Joint Distributions

Gathering and Collecting Data

R Course and R Tutorial: ggplot

Module 3: Describing Data, Joint and Conditional Distributions of Random Variables

Summarizing and Describing Data

Joint, Marginal, and Conditional Distributions

Module 4: Functions and Moments of a Random Variables & Intro to Regressions

Functions of Random Variables

Moments of a Distribution

Expectation, Variance, and an Introduction to Regression

Optional Unit: Auctions

Module 5: Special Distributions, the Sample Mean, the Central Limit Theorem

Special Distributions

The Sample Mean, Central Limit Theorem, and Estimation

R Tutorial: Simulations

Module 6: Assessing and Deriving Estimators - Confidence Intervals, and Hypothesis Testing

Assessing and Deriving Estimators

Confidence Intervals, Hypothesis Testing and Power Calculations

Module 7: Causality, Analyzing Randomized Experiments, & Nonparametric Regression

Causality

Analyzing Randomized Experiments

Explanatory Data Analysis: Nonparametric Comparisons and Regressions

Module 8: Single and Multivariate Linear Models

The Linear Model

The Multivariate Linear Model

R Tutorial: Introduction to the Class lm

Module 9: Practical Issues in Running Regressions and Omitted Variable Bias

Practical Issues in Running Regressions

Omitted Variable Bias

Module 10: Endogeneity, Instrumental Variables, Experimental Design, and Data Visualization

Endogeneity and Instrumental Variables

Experimental Design

Visualizing Data

Module 11: Machine Learning

Optional Module: Writing an Empirical Paper

Writing an Empirical Social Science Paper

This wiki is not a publication of MITx and is not affiliated with the SDS MicroMasters program. Resources here are contributed by students collaborating with one another for learning purposes.

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️