Database tables and queries - VioletGiraffe/cpp-db GitHub Wiki
URLs:
{ *uID -> URL text | *Last fetch date | Number of tokens in this document }
Tokens:
{ *tID -> Token text | Total number of token encounters }
Postings (reverse index):
For each token, a list of URLs where it occurs with some attributes (location on the page etc.)
{ *tID -> [{*uID | [token_location_1, token_location_2, ...]}, ...] }
Also, two global single-integer counters are needed for normalizing the relevancy metrics:
Total number of documents indexedTotal number of tokens in all documents indexed
Crawling queries:
- INSERT into
URLsIF UNIQUEuID(increment the global document counter) - INSERT into
TokensIF UNIQUEtID - UPDATE
Postingsby adding ARRAY of{*uID | token_location}items to the existing array for the specified*tID, INSERT if doesn't exist yet (increment the global token counter) - DELETE from
PostingsWHERE*uIDMATCHES (the only query that requires an index on*uID) (update the the document counter and token counter)
Search queries:
1.1 For each token in the search query: SELECT tID FROM Tokens MATCH WHERE Token text == tID token test
Requires index by token text!
1.2 SELECT uID FROM Postings MATCH WHERE EXISTS tid1 AND tid2 AND ... AND tidN ORDER BY total distance between all tokens
Requires index for tID.