DB Schema changes - CameronD73/529renew GitHub Wiki
Schema for Sqlite-WASM
The change imposed by the demise of WebSQL requires substantial code change, so I might as well remove a few kludges and perhaps prepare for a few improvements.
Whether the added complexity makes up for the more efficient storage remains to be seen. In particular, the "chromosome 100 and 200" special entries have been shifted to the DNAmatches table.
A Profile person refers to a DNA test kit that is managed under this account. The DNArelative table contains data relating to a match between a profile person and one other. The DNAmatch table contains match data between any two testers of interest, which may or may not include a profile person.
proposed tables
profiles
IDprofile TEXT (P.K.)
pname TEXT UNIQUE not NULL
Simply the UIDs of each kit managed under this account, along with a short name the user can use to identify the kit (does not have to match the name in the idalias table)
idalias
Unique data for a single person who has done a DNA test with 23 and Me.
IDText TEXT primary key, -- the tester's unique ID
name TEXT, -- the tester's name
"date" TEXT, -- the date they were added to the DB
hapMat TEXT, -- maternal haplogroup
hapPat TEXT, -- maternal haplogroup
bYear INTEGER -- a birth year, if they have shared it.
DNArelatives
Information about relationship between a profile person and another tester (everybody who has a line on 23andMe's "DNA Relatives" page. There is exactly one record for each such pair of people. Comments entered on 23 and Me are specific to a relative in relation to a profile, so cannot be in idalias table because that table only has one record per tester.
IDprofile TEXT PK, -- the ID of the profile person
IDrelative TEXT PK, -- the unique ID of the tester who shares DNA with the profile
ICWScanned INTEGER -- boolean - if we have done a "triangulation" run of ICWs for this profile/relative pair.
datescanned TEXT, -- the date this pair's relatives in common were last checked for triangulations.
comment TEXT, -- the comment field from the import
side TEXT, -- if maternal/paternal has been indicated.
DNAmatches
data for any two people who share DNA. There is usually one record for each pair of people, but can be two if they switch between hiding and revealing their segments.
ID1 TEXT PK, -- the unique ID of tester 1
ID2 TEXT PK, -- the unique ID of tester 2
ishidden INTEGER PK, -- 0 if segments are available, 1 if segment details are hidden
pctshared REAL, -- the percent shared dna
cMtotal REAL -- the calculated or supplied centiMorgan value
nsegs INTEGER -- the number of shared segments
hasSegs INTEGER -- boolean - 1 if we have stored matching segment data
lastdate TEXT -- date of last update
ibdsegs, ibdsegsFull
Data for one shared segment of DNA between two testers. There is one record for each segment reported by 23andMe . 23andMe reports "any" and "full-match" segments separately where I use "any" to mean at least half-IBD match. Where a particular segment is reported as a "full-IBD" match, such as between siblings, there will also be a separate entry in the "any" category that covers the same range as the full segment (and is often larger). There is no plan to use the full-ibd information, but at least keeping track of them separately can avoid conflicts in primary keys and will reduce redundant information in GDAT.
ID1 and ID2 are always saved in alphabetical order to make later search easier.
ID1 TEXT PK, -- the unique ID of tester 1
ID2 TEXT PK, -- the unique ID of tester 2
chromosome INTEGER, -- chromosome number, or 23 for X
start INTEGER, -- the MBase address of the start of the shared segment region
end INTEGER, -- the MBase address of the end of the shared segment region
cM REAL -- the calculated or supplied centiMorgan value
snps INTEGER -- the number of SNPs in this shared segment
ICWsets
A set of 3-way in-common matches. there will be a single record. There is no real distinction between id2 and id3, so they are inserted in alphabetical order to make later search easier.
to be decided: how to handle if more than one is a profile person.
IDprofile TEXT, -- the ID of the profile person
ID2 TEXT, -- the ID of the first relative
ID3 TEXT, -- the ID of the 3rd tester
chromosome INTEGER -- either -1 (we don't know), or 0 (there is no overlap found), or a chromosome on which the 3-way overlap (quasi-triangulation) occurs
start INTEGER, -- the MBase address of the start of the mutual overlap region (or zero)
end INTEGER -- the MBase address of the end of the mutual overlap region (or zero)
This table records all the 3-way ICW values reported by 23 and Me, and whether there is actually any overlapping match for all 3 testers. We have to allow for the possibility of multiple records for one set of 3 people, especially when two may be closely related. Consequently, we leave ROWID as the primary key and impose no uniqueness constraints (or it would have to be p1, id2, id3, chromosome, start)
ICWSegXref
ICWset_row INTEGER -- a ROWID value in the ICWsets table
segment_row INTEGER -- a ROWID value in the ibdsegs table
This is a cross-reference table identifying all matching segment pairs in a 3-way ICW
settings
A copy of the settings normally kept in local storage.
DBversion
just a version number in case we ever need to modify the schema.
schema for 529 and You
There are two main tables: the tester ID and the segment pairs.
The crucial key lies in the tester UID (unique identification code) that 23 and me uses to reference each person (actually DNA test). This is a 64-bit unsigned integer represented in URLs as 16 hex digits, which raises the following issues...
- integers in sqlite3 are always signed.
- integers in the javascript Number do not exist - There is now the BigInt as an arbitrarily large integer but using it would not solve all the issues.
- simple operations using large integers with be converted into js double precision floating point, which is not an exact representation of the original number. Thus entering the number into the sqlite tables will rarely return the correct tester UID and near enough is not good enough.
- attempting to use bigint values is cumbersome (and unnecessary)
- the code relies on the pair of UIDs in the segment table being in order - higher value first. This seems to be purely for lookup efficiency, as the actual values appear to be arbitrary. Mixing signed and unsigned integers will probably be ok in this exact context, but is looking for trouble.
- the consequence of this numerical mess is that the original coder chose to split the ID into two 32-bit numbers. Hence id_1 and id_2 are the high and low order parts of the UID.
Table schema are described on a separate page
proposed schema - manifest V3, WebSQL
We should take this opportunity to modify the database where improvements can be made and reduncancies removed.
Main Keys
- Code could be much simplified if the UID is stored simply as a text string and a separate primary key created for the idalias table (aliased to ROWID). This can then be used as the single integer key for each tester 1 and 2 in the segment table.
- Alternatively, the UID text string itself could be used as the ID. This consumes a bit more than double the space of 64-bit integers in each row, and also seems to increase index size.
After looking at how this would be coded, it looks like using the first option will
- make the DB more fragile (e.g. any replace operation assigns a new ID if it is autoincrement - it would require some fancy foreign key constraint triggers to make safer)
- makes the coding more complex, either within the JS code or in the queries. I think any disadvantages in the lookup speed or db size are outweighed by the simpler code in the string options. This is evaluated on the timing wiki page
Other Issues
- is
company_id
any use? build
number - build 37 has been the only one used for a long time - we leave in the field in case they ever decide to update to the last decade's standard reference genome, but I would propose removing any build-36 support and not reimporting those segments.
Phase, Relationship and Comments (Ancestors)
These values do not really belong in this table due to their ambiguity, they could be
- retained and ignored,
- deleted, or
- a new table would be needed to actually store these values unambiguously, and would use less space in the db.
convert old format
This should only be needed for testing. Import should be via the csv file.
Add to the idalias
table:
"ID" INTEGER UNIQUE,
"IDtext" TEXT UNIQUE,
PRIMARY KEY("ID")
and add to the ibdsegs table:
"ID1" INTEGER,
"ID2" INTEGER
To populate the alias table
update idalias set ID = _rowid_;
update idalias set IDtext = printf("%08.8x%08.8x", id_1, id_2);
And the segment table is a bit trickier, as an update cannot be done on a join...
update ibdsegs set ID1 =( select ID from idalias where ibdsegs.id1_1 = idalias.id_1 and ibdsegs.id1_2 = idalias.id_2 );
update ibdsegs set ID2 =( select ID from idalias where ibdsegs.id2_1 = idalias.id_1 and ibdsegs.id2_2 = idalias.id_2 );
and the next gotcha is that, for efficiency of code, the IDs are always stored highest value first. The process of assigning a separate ID breaks this, so we need to swap...
update ibdsegs set (ID1, ID2) = (ID2, ID1) where ID1 < ID2;
Use the following join to check the results...
select ID1, a.name, ID2, b.name, *
from ibdsegs as s
join idalias as a on s.id1_1 = a.id_1 and s.id1_2 = a.id_2
join idalias as b on s.id2_1 = b.id_1 and s.id2_2 = b.id_2 ;
After replacing the split integers with separate index as described here, the DB size for my small sample dropped from 2.4MB to 1.5MB. Then removing all the mainly unused columns (company, phase[12], relationship[12] and comment) made almost no difference (~50kB)