DB Operation Timing - CameronD73/529renew GitHub Wiki

Timing with Wasm

Start timing the import of a 529 export, 2581 rows into IDalias.

Default commit is autocommit after each row (no explicit transactions)

  • Commit mode:
    • default is insert one at a time, with autocommit after each - test is idalias fill only (0.45MB)
    • "1x1" is single row inserts all in a single transaction - test is idalias and ibdsegs (5.8MB total)
    • "1 x/1000" is a commit every 1000 inserts- test is idalias and ibdsegs (5.8MB total, 29k rows of segs)
  • Journal mode can be: DELETE (default) | TRUNCATE | PERSIST | MEMORY | OFF. (WAL mode is not supported in WASM)
  • Synch mode can be: EXTRA | FULL (default) | NORMAL | OFF
DB import timings to OPFS (windows 10 x64) idalias (450kB)
commit mode Journal mode synch:EXTRA synch:FULL synch:NORMAL synch:OFF
default DELETE (dflt) 129 129 131 114
default PERSIST - 126 - 107
default TRUNCATE - - - -
default MEMORY - 14 - 10
1 x 1 DELETE (dflt) - 2 - 3
1 x 1 PERSIST - 3 - 4
1 x 1 MEMORY - 2 - -
1 x/1000 DELETE (dflt) - 52 - -
1 x/1000 MEMORY - 4 - -
1 x/10000 DELETE (dflt) - 7 - -
1 x/10000 MEMORY - 3 - -

Journalling and Synch options - sqlite3 executable

Unfortunately, having done all this, I later discover that all PRAGMA statements are blocked in the API, so we are stuck with the slowest process. I initially thought that I could set WAL mode independently, but the words-fail-me have decided that they will refuse to open a db in WAL mode. I could try to bundle multiple statements in a transaction, but in most/all cases that is not worth the effort.

The process of importing the old data csv file is exceptionally slow, so I did a few experiments.

  1. Processed the csv to awk to generate an sql script matching the plan for the javascript,
  2. then run it through the sqlite3 program
  3. edit to change journalling, etc.
  4. the db is zeroed and vacuumed at the start of each run.
The default mode is journal_mode=delete, synchronous=full. The default for the webkit implementation is commit after every transaction, basically autocommit.

My test export file (from 529 and you, V3.3.0), is 7,929 lines (segments) - no chromosome 100 data. This is from 1,461 individuals.

The DB build is two phases:

  1. entry of the `idalias` table, simple insert - 1460 rows;
  2. Inserting the segments, for which each insert contains two subqueries on the `idalias` table - 7922 rows
  3. IO rate stats are very approximate. Determined by looking at value shown by task manager and guessing an average.
  4. IO load likewise from viewing task manager - it is approaching being I/O bound on hard drive (non-RAID partition).
  5. the final DB size is 1.4MB
  6. total IO volume is very approximate (IO rate times elapsed time).
DB import timings to HDD (windows 10 x64)
commit mode Journal mode synchronous phase 1 time
seconds
phase 2 time
seconds
phase 1 IO rate
MB/second
phase 2 IO rate
MB/second
IO queue
load
IO vol
MBytes
default DELETE (dflt) full (dflt) 110 750 1.3 1.4 87% 1200
default PERSIST full 140 810 1.0 1.2 85% 1200
large blocks WAL full <1 1 too small too small too small 1.5?
large blocks DELETE full <1 1 too small too small too small 1.5?
default MEMORY full 30 260 1.0 1.6 83% 450
default WAL full 32 190 1.2 2.0 80% 420
default WAL OFF 2 21 1.0 1.6 83% 36
DB import timings to SSD (M.2 PCIE) (windows 10 x64)
commit mode Journal mode synchronous phase 1 time
seconds
phase 2 time
seconds
phase 1 IO rate
MB/second
phase 2 IO rate
MB/second
IO queue
load
IO vol
MBytes
default DELETE (dflt) full (dflt) 15 97 13 13 20, peak at 50% 1500
default MEMORY full 4 28 13 13 32% 410
default WAL full 4 32 13 13 25% 460
default WAL OFF 2 22 too small <1.0 0% <11

Index selection

I doubt the various compound indexes make enough difference to warrant inclusion. This test compares results in size and timing using the same dataset in various schema.

The test query was the most complex one in the code - a 4-way join between 'idalias' twice and 'ibdsegs' twice - to list all overlapping segments for a given match segment.

Indices on idbsegs originally were:

  • id1 (compound in id1_1 and id1_2 as the two halves of big integer)
  • id2
  • chromosome
  • chromosome, id1
  • chromosome, id2
  • id1, id2
  • unique on id1, id2, chromosome, start, end, build - this is probably just to create uniqueness constraint
As a "minimal" test, I removed the 4th to 6th indices.
DB import timings and sizes (on HDD)
Schema indices DB size
MBytes
Big Join time
millisec (min/max)
V2 all 3.22 11/ 47
V2 min 2.25 13/42
V3 int key all 1.31 9/38
V3 int key min 1.02 11/40
V3 textkey all 2.42 8/35
V3 textkey min 1.68 10/41
Conclusion: the extra indices make a small difference in speed, but at the cost of substantially larger files. There is a much bigger difference in every case between first and second pass, presumably due to file caching the second time around, so reduced size might be more significant if files get much bigger. (I doubt that will ever eventuate under current conditions)

The use of the text ID is faster than the old split integer versions (presumably due to the extra comparisons required).

The use of text ID is no different from using single intermediate integer key, so the size advantage is not enough to justify the added complexity.

⚠️ **GitHub.com Fallback** ⚠️