Database tables and queries - VioletGiraffe/cpp-db GitHub Wiki

URLs:
{ *uID -> URL text | *Last fetch date | Number of tokens in this document }

Tokens:
{ *tID -> Token text | Total number of token encounters }

Postings (reverse index): For each token, a list of URLs where it occurs with some attributes (location on the page etc.)
{ *tID -> [{*uID | [token_location_1, token_location_2, ...]}, ...] }

Also, two global single-integer counters are needed for normalizing the relevancy metrics:

  • Total number of documents indexed
  • Total number of tokens in all documents indexed

Crawling queries:

  1. INSERT into URLs IF UNIQUE uID (increment the global document counter)
  2. INSERT into Tokens IF UNIQUE tID
  3. UPDATE Postings by adding ARRAY of {*uID | token_location} items to the existing array for the specified *tID, INSERT if doesn't exist yet (increment the global token counter)
  4. DELETE from Postings WHERE *uID MATCHES (the only query that requires an index on *uID) (update the the document counter and token counter)

Search queries:
1.1 For each token in the search query: SELECT tID FROM Tokens MATCH WHERE Token text == tID token test
Requires index by token text!
1.2 SELECT uID FROM Postings MATCH WHERE EXISTS tid1 AND tid2 AND ... AND tidN ORDER BY total distance between all tokens
Requires index for tID.