Database tables and queries - VioletGiraffe/cpp-db GitHub Wiki
URLs:
{ *uID -> URL text | *Last fetch date | Number of tokens in this document }
Tokens:
{ *tID -> Token text | Total number of token encounters }
Postings (reverse index):
For each token, a list of URLs where it occurs with some attributes (location on the page etc.)
{ *tID -> [{*uID | [token_location_1, token_location_2, ...]}, ...] }
Also, two global single-integer counters are needed for normalizing the relevancy metrics:
Total number of documents indexed
Total number of tokens in all documents indexed
Crawling queries:
- INSERT into
URLs
IF UNIQUEuID
(increment the global document counter) - INSERT into
Tokens
IF UNIQUEtID
- UPDATE
Postings
by adding ARRAY of{*uID | token_location}
items to the existing array for the specified*tID
, INSERT if doesn't exist yet (increment the global token counter) - DELETE from
Postings
WHERE*uID
MATCHES (the only query that requires an index on*uID
) (update the the document counter and token counter)
Search queries:
1.1 For each token in the search query: SELECT tID
FROM Tokens
MATCH WHERE Token text
== tID token test
Requires index by token text!
1.2 SELECT uID
FROM Postings
MATCH WHERE EXISTS tid1 AND tid2 AND ... AND tidN ORDER BY total distance between all tokens
Requires index for tID
.