Introduction to Search Engines - EMbeDS-education/ComputingDataAnalysisModeling20242025 GitHub Wiki

Please use the right-sidebar to navigate the pages of interest.

Instructor: Paolo Ferragina ([email protected])

Language: English

Duration: 20h, January-May 2025.

Room: TBD (Lectures are held in presence)

Description: The goal of this course is to introduce with a light technical, yet rigorous, educational approach the main methodologies, algorithms, and AI techniques that underlie the design of modern search engines and, in general, Information Retrieval systems. The lectures will be structured so that we will describe, study, and analyze the main components of a search engine: Crawler, Parser, Indexer, Query resolver, Query and Document annotator, and Results Ranker. We will also deal with the novel Generative AI tools and how they contribute to shaping a new search task. The structure of the lectures will also allow us to talk about text mining and text analytics in general, thus stimulating multi-disciplinary discussions that pertain to other applications, not limited to (web) search.

Materials:

Topics of the lectures: This is a tentative schedule of the topics, just to give an idea of the course content.

  • Topic 1 (Introduction) [MRS: Chap 1, and Sect. 2.3-2.4]:

    • Modern IR, not just search engines! Web search engine's structure.
    • The Inverted List (IL): dictionary + postings.
    • The boolean retrieval model. How to implement AND, OR, NOT queries, skip pointers, and time complexities.
    • The IL with positional information: solving phrase queries and proximity queries. Zone indexes.
  • Topic 2 (Storage) [F: Chap 11, Sect 5.3, 19.6]:

    • Posting list compression and its codes: gamma, variable bytes (t-nibble), PForDelta, Elias-Fano.
    • Document compression: a sketch on gzip for individual docs and group of docs.
    • Document deduplication (exact or approximate): Classic versus Locality-sensitive hashing (LSH).
    • LSH: basics, hamming distance (proof of the probability bounds), shingling & Jaccard similarity (min-hashing), Cosine similarity.
    • LSH: use in offline and online settings.
  • Topic 3 (Parsing) [MRS: Sect 2.1, 2.2, 5.1, 19.2]

    • Sketch of crawling: properties of the Web graph (a note on the Bloom Filter).
    • Sketch of document parsing: tokenization, normalization, lemmatization, stemming, thesauri.
    • The Zipf, Heaps, and Luhn laws.
  • Topic 4 (Indexing) [MRS: Chap 3 and Sect 5.2]

    • Exact search over the dictionary: hashing. Prefix search: compacted trie, front coding, 2-level indexing.
    • Approximate search over the dictionary: via overlap measure with k-gram index.
    • Wild-card queries: Permuterm.
  • Topic 5 (Content-based Ranking) [sect 6.2, 6.3, 7.1, 8.1, 9.1]

    • Text-based ranking: tf-idf.
    • Vector space model and cosine similarity doc-doc and query-doc. Sketch of fast top-k retrieval (high idf, clustering).
    • Performance measures: precision, recall, F1, DCG, and NDCG.
  • Topic 6 (Network-based ranking) [Chap 21].

    • Random Walks. Link-based ranking: PageRank.
    • Topic-specific PageRank. Personalized PageRank.
  • Topics 7 (Advanced topics):