Introduction to Search Engines - EMbeDS-education/ComputingDataAnalysisModeling20242025 GitHub Wiki

Please use the right-sidebar to navigate the pages of interest.

Instructor: Paolo Ferragina ([email protected])

Language: English

Duration: 20h, February-March 2025.

Room: TBD (Lectures are held in person)

Description: The goal of this course is to introduce with a light technical, yet rigorous, educational approach the main methodologies, algorithms, and AI techniques that underlie the design of modern search engines and, in general, Information Retrieval systems. The lectures will be structured so that we will describe, study, and analyze the main components of a search engine: Crawler, Parser, Indexer, Query resolver, Query and Document annotator, and Results Ranker. We will also deal with the novel Generative AI tools and how they contribute to shaping a new search task. The structure of the lectures will also allow us to talk about text mining and text analytics in general, thus stimulating multi-disciplinary discussions that pertain to other applications, not limited to (web) search.

Materials:

C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008.
P. Ferragina. Pearls of Algorithm Engineering. Cambridge University Press, 2023. [see errata corrige]
Slides and other support materials for this course will be made available through this repository.

Topics of the lectures:

Topic 1 (Introduction, slides) [MRS: Chap 1, and Sect. 2.3-2.4]:
- Modern IR, not just search engines!
- Web search engine's structure.
- A history lasting 30 years: from Altavista to ChatGPT and Perplexity.
Topic 2 (Parsing, slides) [MRS: Sect 2.1, 2.2, 5.1, 19.2]
- Sketch of crawling: properties of the Web graph (a note on the Bloom Filter).
- Sketch of document parsing: tokenization, normalization, lemmatization, stemming, thesauri.
- The Zipf, Heaps, and Luhn laws.
Topic 3 (Storage, slides LSH, slides IL and docs) [F: Chap 11, Sect 5.3, 19.6]:
- Document deduplication (exact or approximate): Classic versus Locality-sensitive hashing (LSH).
- LSH: basics, shingling & Jaccard similarity (min-hashing, with proof), hamming distance (proof of the probability bounds), cosine similarity.
- LSH: use in offline and online settings.
- Posting list compression and its codes: gamma, variable bytes (t-nibble), Elias-Fano.
- Document compression: LZ-parsing, delta-compression, compressing a group of files.
Topic 4 (Indexing) [MRS: Chap 3 and Sect 5.2]
- The Inverted List (IL): dictionary + postings, building the IL.
- The boolean retrieval model. How to implement AND, OR, NOT queries, skip pointers.
- The IL with positional information: solving phrase queries and proximity queries. Zone indexes.
- Sketch of how to search over the dictionary
Topic 5 (Content-based Ranking, slides) [sect 6.2, 6.3, 7.1, 8.1, 9.1]
- Text-based ranking: tf-idf.
- Vector space model and cosine similarity doc-doc and query-doc. Sketch of fast top-k retrieval (high idf, clustering).
- Performance measures: precision, recall, F1, DCG, and NDCG.
Topic 6 (Network-based ranking, slides) [Chap 21].
- Random Walks. Link-based ranking: PageRank.
- Topic-specific PageRank. Personalized PageRank.
Topics 7 (Advanced topics, slide annotators, slide GenAI):
- Entity linking and Knowledge graphs (Tagme and other tools, Python hands-on)
- Generative AI and Retrieval Augmented Generation (RAG): definition, features, components.
- Hands-on on Weaviate: text-based and embedding-based search (Python code).

Possible projects:

Project #1 (software):
- Play a bit with reordering those file collections using fingerprint-orderings, or graph-based ideas
- Two groups of files: Java dataset and Python dataset
- May be useful to read our paper on SWH compression
- Browse also the literature to find similarity measures for files other than the ones in my paper.
- This paper could be of inspiration
Project #2 (journal club):
- Group of papers that explain how to implement Vector DBs, and which are the open problems.
Project #3 (journal club):
- A key step in vector DBs is compute the embeddings and evaluate their similarity. This survey is a good starting point to make a presentation on this subject, and indicate some open problems.
Project #4 (journal club):
- This project may originate various talks.
- A new approach to Hybrid AI is given by Retrieval Augmented Generation, which now consists of several variants.
- Look at this post on LinkedIn or this simple document by Weaviate.
- In this topic are included the Reasoners, like co-Scientist.