SLLD - EMbeDS-education/ComputingDataAnalysisModeling20242025 GitHub Wiki
This is the home page of the course Statistical Learning & Large Data, Modules 1 and 2.
The right-sidebar can be used to navigate pages related to the course, e.g., to consult the calendar, access example datasets, retrieve slides, code and materials for our Lectures and Practicum sessions.
SYLLABUS INFORMATION
Instructor: Francesca Chiaromonte ([email protected])
Practicum coordinator: Simone Tonini ([email protected])
Course referent: Giaime Paolo Pes ([email protected])
Language: English
Duration: Module 1 20h (February 5 to 21, 2025); Module 2 20h (March 5 to 21, 2025).
Description: This course will introduce the students to various aspects of contemporary Statistical Learning, with a particular focus on approaches for the analysis of large, complex datasets. The content will be organized in two Modules, and will include topics selected from the following areas:
Module 1
- Unsupervised classification; Clustering methods
- Unsupervised dimension reduction; Principal Components Analysis and related techniques
- Supervised classification methods
- Non-parametric regression methods
- Resampling methods, Cross Validation, the Bootstrap and permutation-based techniques.
Module 2
- Feature selection and regularization techniques for high-dimensional Linear and Generalized Linear Models
- Feature screening algorithms for ultra-high dimensional supervised problems
- Supervised dimension reduction; Sufficient Dimension Reduction and related techniques
- Subsampling/partitioning approaches for ultra-high sample sizes
- Under- and oversampling approaches for data rebalancing
Compared to traditional courses on multivariate statistics, regression and linear/generalized linear models, the focus will be on analyzing actual datasets of interest to the students through projects and Practicum sessions associated to each lecture.
Materials: Our main reference texts will be
- An Introduction to Statistical Learning – with Applications in R (James, Witten, Hastie, Tibshirani; Springer 1st ed. 2013, 2nd ed. 2021). Copies of the book are available at the SSSA library, but you can download pdf versions of both the 1st and the 2nd editions, which comprises additional topics and materials, from the site https://www.statlearning.com/ (scroll to the bottom). At the site you can also find a python-based version of the book. There are also numerous publicly available MOOC materials associated with the textbook, some from the authors themselves, see for instance:
- Computer Age Statistical Inference (Efron, Hastie; Cambridge University Press 2016).
We will employ R as the statistical software of choice for the course. For information and free downloads see:
Slides and other support materials for the course will be made available through this GitHub Wiki.
Evaluation: Evaluation will be based on project presentations and written reports to be held/handed in at the end of the course. Each project will revolve around a dataset, to which students will apply techniques and approaches described during the course – thus building an overall analysis to be summarized in the final presentation and report. Ideally, students will work on datasets of their own choice. These could be related to their own research, or selected from public sources, see for instance:
- Compilation of data sources by R.H. Lock
- UCI Machine Learning Repository
- UCR Time Series Classification Archive
- AWS public data sets (Amazon)
- Italian Covid-19 data (Protezione Civile)
Note 1: projects can be individual, or group-based, depending on interests and background of the students attending each module. Details will be determined and groups will be formed at the beginning of the course.
Note 2: Students attending only Module 1 or only Module 2 will still present their projects and hand in their written reports during a joint session to be scheduled after the end of both modules (likely in May).
Attendance: The course will be offered in person (see rooms specified in the general calendar), possibly with an option for remote attendance through WebEx or Teams for students who may not be able to participate in person. Details on remote attendance will also be determined and announced at the beginning of the course. Allievi Ordinari of Scuola Superiore Sant'Anna are required to attend in person, if not explicitly justified (e.g., Allievi participating to the ERASMUS project abroad).
Prerequisites: A working knowledge of basic statistical inference procedures (point estimation, confidence intervals, testing) and linear and generalized linear models. Such a working knowledge can be obtained, or refreshed, through the ASM course offered by Chiara Seghieri.