DB Operation Timing - CameronD73/529renew GitHub Wiki
Start timing the import of a 529 export, 2581 rows into IDalias.
Default commit is autocommit after each row (no explicit transactions)
-
Commit mode:
- default is insert one at a time, with autocommit after each - test is idalias fill only (0.45MB)
- "1x1" is single row inserts all in a single transaction - test is idalias and ibdsegs (5.8MB total)
- "1 x/1000" is a commit every 1000 inserts- test is idalias and ibdsegs (5.8MB total, 29k rows of segs)
- Journal mode can be: DELETE (default) | TRUNCATE | PERSIST | MEMORY | OFF. (WAL mode is not supported in WASM)
- Synch mode can be: EXTRA | FULL (default) | NORMAL | OFF
commit mode | Journal mode | synch:EXTRA | synch:FULL | synch:NORMAL | synch:OFF |
---|---|---|---|---|---|
default | DELETE (dflt) | 129 | 129 | 131 | 114 |
default | PERSIST | - | 126 | - | 107 |
default | TRUNCATE | - | - | - | - |
default | MEMORY | - | 14 | - | 10 |
1 x 1 | DELETE (dflt) | - | 2 | - | 3 |
1 x 1 | PERSIST | - | 3 | - | 4 |
1 x 1 | MEMORY | - | 2 | - | - |
1 x/1000 | DELETE (dflt) | - | 52 | - | - |
1 x/1000 | MEMORY | - | 4 | - | - |
1 x/10000 | DELETE (dflt) | - | 7 | - | - |
1 x/10000 | MEMORY | - | 3 | - | - |
Unfortunately, having done all this, I later discover that all PRAGMA statements are blocked in the API, so we are stuck with the slowest process. I initially thought that I could set WAL mode independently, but the words-fail-me have decided that they will refuse to open a db in WAL mode. I could try to bundle multiple statements in a transaction, but in most/all cases that is not worth the effort.
The process of importing the old data csv file is exceptionally slow, so I did a few experiments.
- Processed the csv to awk to generate an sql script matching the plan for the javascript,
- then run it through the sqlite3 program
- edit to change journalling, etc.
- the db is zeroed and vacuumed at the start of each run.
My test export file (from 529 and you, V3.3.0), is 7,929 lines (segments) - no chromosome 100 data. This is from 1,461 individuals.
The DB build is two phases:
- entry of the `idalias` table, simple insert - 1460 rows;
- Inserting the segments, for which each insert contains two subqueries on the `idalias` table - 7922 rows
- IO rate stats are very approximate. Determined by looking at value shown by task manager and guessing an average.
- IO load likewise from viewing task manager - it is approaching being I/O bound on hard drive (non-RAID partition).
- the final DB size is 1.4MB
- total IO volume is very approximate (IO rate times elapsed time).
commit mode | Journal mode | synchronous |
phase 1 time seconds |
phase 2 time seconds |
phase 1 IO rate MB/second |
phase 2 IO rate MB/second |
IO queue load |
IO vol MBytes |
---|---|---|---|---|---|---|---|---|
default | DELETE (dflt) | full (dflt) | 110 | 750 | 1.3 | 1.4 | 87% | 1200 |
default | PERSIST | full | 140 | 810 | 1.0 | 1.2 | 85% | 1200 |
large blocks | WAL | full | <1 | 1 | too small | too small | too small | 1.5? |
large blocks | DELETE | full | <1 | 1 | too small | too small | too small | 1.5? |
default | MEMORY | full | 30 | 260 | 1.0 | 1.6 | 83% | 450 |
default | WAL | full | 32 | 190 | 1.2 | 2.0 | 80% | 420 |
default | WAL | OFF | 2 | 21 | 1.0 | 1.6 | 83% | 36 |
commit mode | Journal mode | synchronous |
phase 1 time seconds |
phase 2 time seconds |
phase 1 IO rate MB/second |
phase 2 IO rate MB/second |
IO queue load |
IO vol MBytes |
---|---|---|---|---|---|---|---|---|
default | DELETE (dflt) | full (dflt) | 15 | 97 | 13 | 13 | 20, peak at 50% | 1500 |
default | MEMORY | full | 4 | 28 | 13 | 13 | 32% | 410 |
default | WAL | full | 4 | 32 | 13 | 13 | 25% | 460 |
default | WAL | OFF | 2 | 22 | too small | <1.0 | 0% | <11 |
I doubt the various compound indexes make enough difference to warrant inclusion. This test compares results in size and timing using the same dataset in various schema.
The test query was the most complex one in the code - a 4-way join between 'idalias' twice and 'ibdsegs' twice - to list all overlapping segments for a given match segment.
Indices on idbsegs originally were:
- id1 (compound in id1_1 and id1_2 as the two halves of big integer)
- id2
- chromosome
- chromosome, id1
- chromosome, id2
- id1, id2
- unique on id1, id2, chromosome, start, end, build - this is probably just to create uniqueness constraint
Schema | indices |
DB size MBytes |
Big Join time millisec (min/max) |
---|---|---|---|
V2 | all | 3.22 | 11/ 47 |
V2 | min | 2.25 | 13/42 |
V3 int key | all | 1.31 | 9/38 |
V3 int key | min | 1.02 | 11/40 |
V3 textkey | all | 2.42 | 8/35 |
V3 textkey | min | 1.68 | 10/41 |
The use of the text ID is faster than the old split integer versions (presumably due to the extra comparisons required).
The use of text ID is no different from using single intermediate integer key, so the size advantage is not enough to justify the added complexity.