database structure - ozzmos/searx GitHub Wiki

mindstorming for a database which would be used inside searx in future:

regarding to Issue #205

Database-Systems

SQLite3

  • minimalistic, already available inside Python

Peewee

  • can use SQLite as well as MySQL and PostgreSQL
  • more flexible for the admin

Tables

server preferences

We should also have every tables loaded at the start of the server : themes, icons, paths, locales etc.

users

if we have a login (at least for admins), we require a user-table

user

db column db options description
id AI, PRIMARY
username INDEX
password salted and hashed
is_admin finer right system would be more flexibel

Or we could bypass the users table, and use only a login/pass described in settings.yml. Anyway, there would be no sensible data accessible through the admin page, only configuration and maybe more stats.

Engines

We should have a way of having every info on every search engines directly in base

db column db options description
id AI, PRIMARY
name INDEX, NOT NULL
engine NOT NULL Name of the python file used
shortcut NOT NULL Shortcut (bang) for the search engine
base_url Base URL in case needed
number_of_results
locale
timeout
api_key
url_xpath Only for the xpath engine
title_xpath Only for the xpath engine
content_xpath Only for the xpath engine

stats

how could we represent the stats in the best way?

it would be cool if we can represent the data inside a timeline, but without too big overhead. #162

We could log every request with it timestamp, time, engine (but without the query). It would be precise, but a little bit invasive as privacy is concerned, and could be heavy on the DB.
We could also log a pondered mean for every span of time we would like (hour, day, week, month). For every week, we could have a line by engine, counting the number of queries, and the mean time of those queries. Adding a query would be simple : M = (m*n + t)/n+1 with M the new mean, m the old one, n the number of queries, and t the time of the new query.

db column db options description
id AI, PRIMARY ID of the timespan
timestamp Precise way of determining the time
readable A readable format of the time span like '{{Week}} 26'
span A way of defining the time span choosen (allow varying the time span in pref without loosing old stats)
db column db options description
id_timespan FK, PRIMARY Identify the time span
id_engine FK, PRIMARY Identify the engine
mean_time Mean time of a query
mean_nb_result Mean number of results
mean_score Mean score
mean_score_result Mean score per result
nb_query Number of queries
nb_error Number of errors

https_rewrite

if we are specify only one url_pattern for every possible url, we can use Database search instead of regex-matching. This should improve the speed well, specifically for larger datasets.

I think the most efficient implementation would be a n:m representation. We can cache the rewrite_rules inside python, and use the database to find out what rewrite_rules have to be called.

https_urls

db column db options description
id AI, PRIMARY
url_pattern UNIQE, INDEX wildcard is inside database, how to handle this?

https_url_to_rewrite

db column db options description
id AI, PRIMARY
url_id INDEX
rewrite_id

https_rewrite_rules

db column db options description
id AI, PRIMARY
rewrite_from
rewrite_to

spellchecker

spellchecker

This is the implementation of an spellchecker with a runtime of O(1), because we are using precalculated words. Because we are searching for exact matches of this strings, we can use HASH-Tables to improve the speeed to O(1).

The technique is described in the faroo blog.

Disatvantage is the huge disk-consumption because of the precalculation of queries up to 2 deleted characters for every word, which is multiplying the size of database by an factor of 20 and more, based on the length of the words. But I think the runtime of O(1) is much more important, specifically if the number of requests/second or the number of database entities is growing.

db column db options description
id AI, PRIMARY
precalc INDEX USING HASH precalculated word
correct