# CS410: Text Information Systems

* Semester Last Updated:* Fall 2021

## Instructor(s):

Chengxiang Zhai

## Topics Covered:

**Languages used**: Python
Covers the following:

- Overview of NLP and information extraction (techniques, motivations, issues, and limitations)
- Text retrieval systems:
- Vector space models (includes bag-of-words representation, similarity measures, and frequently used VSM models)
- TF-IDF weighting
- Text retrieval systems evaluation (precision, recall, average precision, F-score)

- Web crawlers and web indexing (includes coverage of PageRank and MapReduce)
- Recommender systems
- Topic modeling and text mining:
- Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA)
- Paradigmatic and syntagmatic word relations
- k-means clustering, more generally hierarchical agglomerative clustering (HAC)
- Text categorization methods (Naive bayes, k-nearest neighbors)
- Context-based text mining (CPLSA, time series analysis)

- Statistical language models:
- Mixture models, n-gram models
- Maximum likelihood estimation
- Smoothing methods (Dirichlet prior smoothing, Jelinek-Mercer,
- Basic overview of information theory (Entropy/conditional entropy, mutual information) and its applications to SLMs.

## Meeting Schedule:

- None. CS410 is currently offered as an online-only asynchronous course.

## Assignment/Exam Overview:

**Quizzes**: Weekly conceptual quizzes based on the lectures given for that week.**MPs**: Mid-sized programming assignments giving at various points throughout the semester.**Exams**: 2 Non-cumulative exams, with Exam 1 covering the first half of course material and Exam 2 covering the second half.**Final Project**: An open-ended, end-of-semester coding project. Can be done on any topics relevant to the course.

## Grading Scheme:

Sourced from CS410's syllabus:

## Tips for Success:

**Beware MP3:**MP3 is the last MP given during the semester and is notably more conceptually difficult and much more time consuming than the other MPs in the course. If there is any MP that you want to start earlier, this would be the one.**Linear algebra and statistics helps:**A fair portion of the course's material relies on having a working knowledge of linear algebra and (even moreso) statistics. Despite not being in the course's prerequisites, having credit or concurrent enrollment in MATH257/STAT400 or equivalent classes is genuinely helpful for wrapping your head around some of the material in the course. All said, not having this background isn't likely to have an effect on your overall grade.

## Additional Notes:

- CS410 has been offered as an online-only asynchronous Coursera course since FA2019. Students looking for an in-person class experience are likely not going to get one, as the course looks like it will continue to be offered via Coursera in the future. Students interested in text mining and information retrieval should take a look at CS412 and CS510 in addition to CS410.