database structure - ozzmos/searx GitHub Wiki
mindstorming for a database which would be used inside searx in future:
regarding to Issue #205
Database-Systems
SQLite3
- minimalistic, already available inside Python
Peewee
- can use SQLite as well as MySQL and PostgreSQL
- more flexible for the admin
Tables
server preferences
We should also have every tables loaded at the start of the server : themes, icons, paths, locales etc.
users
if we have a login (at least for admins), we require a user-table
user
db column | db options | description |
---|---|---|
id | AI, PRIMARY | |
username | INDEX | |
password | salted and hashed | |
is_admin | finer right system would be more flexibel |
Or we could bypass the users table, and use only a login/pass described in settings.yml. Anyway, there would be no sensible data accessible through the admin page, only configuration and maybe more stats.
Engines
We should have a way of having every info on every search engines directly in base
db column | db options | description |
---|---|---|
id | AI, PRIMARY | |
name | INDEX, NOT NULL | |
engine | NOT NULL | Name of the python file used |
shortcut | NOT NULL | Shortcut (bang) for the search engine |
base_url | Base URL in case needed | |
number_of_results | ||
locale | ||
timeout | ||
api_key | ||
url_xpath | Only for the xpath engine | |
title_xpath | Only for the xpath engine | |
content_xpath | Only for the xpath engine |
stats
how could we represent the stats in the best way?
it would be cool if we can represent the data inside a timeline, but without too big overhead. #162
We could log every request with it timestamp, time, engine (but without the query). It would be precise, but a little bit invasive as privacy is concerned, and could be heavy on the DB.
We could also log a pondered mean for every span of time we would like (hour, day, week, month). For every week, we could have a line by engine, counting the number of queries, and the mean time of those queries. Adding a query would be simple : M = (m*n + t)/n+1 with M the new mean, m the old one, n the number of queries, and t the time of the new query.
db column | db options | description |
---|---|---|
id | AI, PRIMARY | ID of the timespan |
timestamp | Precise way of determining the time | |
readable | A readable format of the time span like '{{Week}} 26' | |
span | A way of defining the time span choosen (allow varying the time span in pref without loosing old stats) |
db column | db options | description |
---|---|---|
id_timespan | FK, PRIMARY | Identify the time span |
id_engine | FK, PRIMARY | Identify the engine |
mean_time | Mean time of a query | |
mean_nb_result | Mean number of results | |
mean_score | Mean score | |
mean_score_result | Mean score per result | |
nb_query | Number of queries | |
nb_error | Number of errors |
https_rewrite
if we are specify only one url_pattern for every possible url, we can use Database search instead of regex-matching. This should improve the speed well, specifically for larger datasets.
I think the most efficient implementation would be a n:m representation. We can cache the rewrite_rules inside python, and use the database to find out what rewrite_rules have to be called.
https_urls
db column | db options | description |
---|---|---|
id | AI, PRIMARY | |
url_pattern | UNIQE, INDEX | wildcard is inside database, how to handle this? |
https_url_to_rewrite
db column | db options | description |
---|---|---|
id | AI, PRIMARY | |
url_id | INDEX | |
rewrite_id |
https_rewrite_rules
db column | db options | description |
---|---|---|
id | AI, PRIMARY | |
rewrite_from | ||
rewrite_to |
spellchecker
spellchecker
This is the implementation of an spellchecker with a runtime of O(1), because we are using precalculated words. Because we are searching for exact matches of this strings, we can use HASH-Tables to improve the speeed to O(1).
The technique is described in the faroo blog.
Disatvantage is the huge disk-consumption because of the precalculation of queries up to 2 deleted characters for every word, which is multiplying the size of database by an factor of 20 and more, based on the length of the words. But I think the runtime of O(1) is much more important, specifically if the number of requests/second or the number of database entities is growing.
db column | db options | description |
---|---|---|
id | AI, PRIMARY | |
precalc | INDEX USING HASH | precalculated word |
correct |