DB Operation Timing - CameronD73/529renew GitHub Wiki

Timing with Wasm

Start timing the import of a 529 export, 2581 rows into IDalias.

Default commit is autocommit after each row (no explicit transactions)

Commit mode:
- default is insert one at a time, with autocommit after each - test is idalias fill only (0.45MB)
- "1x1" is single row inserts all in a single transaction - test is idalias and ibdsegs (5.8MB total)
- "1 x/1000" is a commit every 1000 inserts- test is idalias and ibdsegs (5.8MB total, 29k rows of segs)
Journal mode can be: DELETE (default) | TRUNCATE | PERSIST | MEMORY | OFF. (WAL mode is not supported in WASM)
Synch mode can be: EXTRA | FULL (default) | NORMAL | OFF

DB import timings to OPFS (windows 10 x64) idalias (450kB)
commit mode	Journal mode	synch:EXTRA	synch:FULL	synch:NORMAL	synch:OFF
default	DELETE (dflt)	129	129	131	114
default	PERSIST	-	126	-	107
default	TRUNCATE	-	-	-	-
default	MEMORY	-	14	-	10
1 x 1	DELETE (dflt)	-	2	-	3
1 x 1	PERSIST	-	3	-	4
1 x 1	MEMORY	-	2	-	-
1 x/1000	DELETE (dflt)	-	52	-	-
1 x/1000	MEMORY	-	4	-	-
1 x/10000	DELETE (dflt)	-	7	-	-
1 x/10000	MEMORY	-	3	-	-

Journalling and Synch options - sqlite3 executable

Unfortunately, having done all this, I later discover that all PRAGMA statements are blocked in the API, so we are stuck with the slowest process. I initially thought that I could set WAL mode independently, but the words-fail-me have decided that they will refuse to open a db in WAL mode. I could try to bundle multiple statements in a transaction, but in most/all cases that is not worth the effort.

The process of importing the old data csv file is exceptionally slow, so I did a few experiments.

Processed the csv to awk to generate an sql script matching the plan for the javascript,
then run it through the sqlite3 program
edit to change journalling, etc.
the db is zeroed and vacuumed at the start of each run.

The default mode is journal_mode=delete, synchronous=full. The default for the webkit implementation is commit after every transaction, basically autocommit.

My test export file (from 529 and you, V3.3.0), is 7,929 lines (segments) - no chromosome 100 data. This is from 1,461 individuals.

The DB build is two phases:

entry of the `idalias` table, simple insert - 1460 rows;
Inserting the segments, for which each insert contains two subqueries on the `idalias` table - 7922 rows
IO rate stats are very approximate. Determined by looking at value shown by task manager and guessing an average.
IO load likewise from viewing task manager - it is approaching being I/O bound on hard drive (non-RAID partition).
the final DB size is 1.4MB
total IO volume is very approximate (IO rate times elapsed time).

DB import timings to HDD (windows 10 x64)
commit mode	Journal mode	synchronous	phase 1 time seconds	phase 2 time seconds	phase 1 IO rate MB/second	phase 2 IO rate MB/second	IO queue load	IO vol MBytes
default	DELETE (dflt)	full (dflt)	110	750	1.3	1.4	87%	1200
default	PERSIST	full	140	810	1.0	1.2	85%	1200
large blocks	WAL	full	<1	1	too small	too small	too small	1.5?
large blocks	DELETE	full	<1	1	too small	too small	too small	1.5?
default	MEMORY	full	30	260	1.0	1.6	83%	450
default	WAL	full	32	190	1.2	2.0	80%	420
default	WAL	OFF	2	21	1.0	1.6	83%	36

DB import timings to SSD (M.2 PCIE) (windows 10 x64)
commit mode	Journal mode	synchronous	phase 1 time seconds	phase 2 time seconds	phase 1 IO rate MB/second	phase 2 IO rate MB/second	IO queue load	IO vol MBytes
default	DELETE (dflt)	full (dflt)	15	97	13	13	20, peak at 50%	1500
default	MEMORY	full	4	28	13	13	32%	410
default	WAL	full	4	32	13	13	25%	460
default	WAL	OFF	2	22	too small	<1.0	0%	<11

Index selection

I doubt the various compound indexes make enough difference to warrant inclusion. This test compares results in size and timing using the same dataset in various schema.

The test query was the most complex one in the code - a 4-way join between 'idalias' twice and 'ibdsegs' twice - to list all overlapping segments for a given match segment.

Indices on idbsegs originally were:

id1 (compound in id1_1 and id1_2 as the two halves of big integer)
id2
chromosome
chromosome, id1
chromosome, id2
id1, id2
unique on id1, id2, chromosome, start, end, build - this is probably just to create uniqueness constraint

As a "minimal" test, I removed the 4th to 6th indices.

DB import timings and sizes (on HDD)
Schema	indices	DB size MBytes	Big Join time millisec (min/max)
V2	all	3.22	11/ 47
V2	min	2.25	13/42
V3 int key	all	1.31	9/38
V3 int key	min	1.02	11/40
V3 textkey	all	2.42	8/35
V3 textkey	min	1.68	10/41

Conclusion: the extra indices make a small difference in speed, but at the cost of substantially larger files. There is a much bigger difference in every case between first and second pass, presumably due to file caching the second time around, so reduced size might be more significant if files get much bigger. (I doubt that will ever eventuate under current conditions)

The use of the text ID is faster than the old split integer versions (presumably due to the extra comparisons required).

The use of text ID is no different from using single intermediate integer key, so the size advantage is not enough to justify the added complexity.

DB Operation Timing - CameronD73/529renew GitHub Wiki

Timing with Wasm

Journalling and Synch options - sqlite3 executable

Index selection

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️