DB Schema changes - CameronD73/529renew GitHub Wiki

Schema for Sqlite-WASM

The change imposed by the demise of WebSQL requires substantial code change, so I might as well remove a few kludges and perhaps prepare for a few improvements.

Whether the added complexity makes up for the more efficient storage remains to be seen. In particular, the "chromosome 100 and 200" special entries have been shifted to the DNAmatches table.

A Profile person refers to a DNA test kit that is managed under this account. The DNArelative table contains data relating to a match between a profile person and one other. The DNAmatch table contains match data between any two testers of interest, which may or may not include a profile person.

proposed tables

profiles

    IDprofile   TEXT (P.K.)
    pname       TEXT UNIQUE not NULL

Simply the UIDs of each kit managed under this account, along with a short name the user can use to identify the kit (does not have to match the name in the idalias table)

idalias

Unique data for a single person who has done a DNA test with 23 and Me.

    IDText   TEXT primary key,          -- the tester's unique ID
    name     TEXT,          -- the tester's name
    "date"   TEXT,          -- the date they were added to the DB
    hapMat   TEXT,          -- maternal haplogroup
    hapPat   TEXT,          -- maternal haplogroup
    bYear    INTEGER        -- a birth year, if they have shared it.

DNArelatives

Information about relationship between a profile person and another tester (everybody who has a line on 23andMe's "DNA Relatives" page. There is exactly one record for each such pair of people. Comments entered on 23 and Me are specific to a relative in relation to a profile, so cannot be in idalias table because that table only has one record per tester.

    IDprofile    TEXT PK,       -- the ID of the profile person
    IDrelative   TEXT  PK,      -- the unique ID of the tester who shares DNA with the profile
    ICWScanned   INTEGER        -- boolean - if we have done a "triangulation" run of ICWs for this profile/relative pair.
    datescanned  TEXT,          -- the date this pair's  relatives in common were last checked for triangulations.
    comment      TEXT,          -- the comment field from the import
    side         TEXT,          -- if maternal/paternal has been indicated.

DNAmatches

data for any two people who share DNA. There is usually one record for each pair of people, but can be two if they switch between hiding and revealing their segments.

    ID1     TEXT  PK,       -- the unique ID of tester 1
    ID2     TEXT  PK,       -- the unique ID of tester 2
    ishidden  INTEGER PK,      -- 0 if segments are available, 1 if segment details are hidden
    pctshared  REAL,        -- the percent shared dna
    cMtotal    REAL         -- the calculated or supplied centiMorgan value
    nsegs      INTEGER      -- the number of shared segments
    hasSegs    INTEGER      -- boolean - 1 if we have stored matching segment data
    lastdate     TEXT       -- date of last update

ibdsegs, ibdsegsFull

Data for one shared segment of DNA between two testers. There is one record for each segment reported by 23andMe . 23andMe reports "any" and "full-match" segments separately where I use "any" to mean at least half-IBD match. Where a particular segment is reported as a "full-IBD" match, such as between siblings, there will also be a separate entry in the "any" category that covers the same range as the full segment (and is often larger). There is no plan to use the full-ibd information, but at least keeping track of them separately can avoid conflicts in primary keys and will reduce redundant information in GDAT.

ID1 and ID2 are always saved in alphabetical order to make later search easier.

    ID1     TEXT  PK,       -- the unique ID of tester 1
    ID2     TEXT  PK,       -- the unique ID of tester 2
    chromosome INTEGER,     -- chromosome number, or 23 for X
    start      INTEGER,     -- the MBase address of the start of the shared segment region
    end        INTEGER,     -- the MBase address of the end of the shared segment region
    cM         REAL         -- the calculated or supplied centiMorgan value
    snps       INTEGER      -- the number of SNPs in this shared segment

ICWsets

A set of 3-way in-common matches. there will be a single record. There is no real distinction between id2 and id3, so they are inserted in alphabetical order to make later search easier.

to be decided: how to handle if more than one is a profile person.

    IDprofile   TEXT,          -- the ID of the profile person
    ID2         TEXT,          -- the ID of the first relative 
    ID3         TEXT,          -- the ID of the 3rd tester
    chromosome  INTEGER        -- either -1 (we don't know), or 0 (there is no overlap found), or a chromosome on which the 3-way overlap (quasi-triangulation) occurs
    start       INTEGER,       -- the MBase address of the start of the mutual overlap region (or zero)
    end         INTEGER        -- the MBase address of the end of the mutual overlap region (or zero)

This table records all the 3-way ICW values reported by 23 and Me, and whether there is actually any overlapping match for all 3 testers. We have to allow for the possibility of multiple records for one set of 3 people, especially when two may be closely related. Consequently, we leave ROWID as the primary key and impose no uniqueness constraints (or it would have to be p1, id2, id3, chromosome, start)

ICWSegXref

    ICWset_row    INTEGER      -- a ROWID value in the ICWsets table
    segment_row   INTEGER      -- a ROWID value in the ibdsegs table

This is a cross-reference table identifying all matching segment pairs in a 3-way ICW

settings

A copy of the settings normally kept in local storage.

DBversion

just a version number in case we ever need to modify the schema.

schema for 529 and You

There are two main tables: the tester ID and the segment pairs.

The crucial key lies in the tester UID (unique identification code) that 23 and me uses to reference each person (actually DNA test). This is a 64-bit unsigned integer represented in URLs as 16 hex digits, which raises the following issues...

  • integers in sqlite3 are always signed.
  • integers in the javascript Number do not exist - There is now the BigInt as an arbitrarily large integer but using it would not solve all the issues.
  • simple operations using large integers with be converted into js double precision floating point, which is not an exact representation of the original number. Thus entering the number into the sqlite tables will rarely return the correct tester UID and near enough is not good enough.
  • attempting to use bigint values is cumbersome (and unnecessary)
  • the code relies on the pair of UIDs in the segment table being in order - higher value first. This seems to be purely for lookup efficiency, as the actual values appear to be arbitrary. Mixing signed and unsigned integers will probably be ok in this exact context, but is looking for trouble.
  • the consequence of this numerical mess is that the original coder chose to split the ID into two 32-bit numbers. Hence id_1 and id_2 are the high and low order parts of the UID.

Table schema are described on a separate page

proposed schema - manifest V3, WebSQL

We should take this opportunity to modify the database where improvements can be made and reduncancies removed.

Main Keys

  • Code could be much simplified if the UID is stored simply as a text string and a separate primary key created for the idalias table (aliased to ROWID). This can then be used as the single integer key for each tester 1 and 2 in the segment table.
  • Alternatively, the UID text string itself could be used as the ID. This consumes a bit more than double the space of 64-bit integers in each row, and also seems to increase index size.

After looking at how this would be coded, it looks like using the first option will

  1. make the DB more fragile (e.g. any replace operation assigns a new ID if it is autoincrement - it would require some fancy foreign key constraint triggers to make safer)
  2. makes the coding more complex, either within the JS code or in the queries. I think any disadvantages in the lookup speed or db size are outweighed by the simpler code in the string options. This is evaluated on the timing wiki page

Other Issues

  • is company_id any use?
  • build number - build 37 has been the only one used for a long time - we leave in the field in case they ever decide to update to the last decade's standard reference genome, but I would propose removing any build-36 support and not reimporting those segments.

Phase, Relationship and Comments (Ancestors)

These values do not really belong in this table due to their ambiguity, they could be

  • retained and ignored,
  • deleted, or
  • a new table would be needed to actually store these values unambiguously, and would use less space in the db.

convert old format

This should only be needed for testing. Import should be via the csv file. Add to the idalias table:

	"ID"	INTEGER UNIQUE,
	"IDtext"	TEXT UNIQUE,
	PRIMARY KEY("ID")

and add to the ibdsegs table:

	"ID1"	INTEGER,
	"ID2"	INTEGER

To populate the alias table

update idalias set ID = _rowid_;
update idalias set IDtext = printf("%08.8x%08.8x", id_1, id_2);

And the segment table is a bit trickier, as an update cannot be done on a join...

update ibdsegs set ID1 =( select  ID from   idalias where ibdsegs.id1_1 = idalias.id_1 and ibdsegs.id1_2 = idalias.id_2 );  
update ibdsegs set ID2 =( select  ID from   idalias where ibdsegs.id2_1 = idalias.id_1 and ibdsegs.id2_2 = idalias.id_2 );  

and the next gotcha is that, for efficiency of code, the IDs are always stored highest value first. The process of assigning a separate ID breaks this, so we need to swap...

update ibdsegs set (ID1, ID2) = (ID2, ID1) where ID1 < ID2;

Use the following join to check the results...

select ID1, a.name, ID2, b.name, *
	from ibdsegs as s
		join idalias as a on s.id1_1 = a.id_1 and s.id1_2 = a.id_2
		join idalias as b on s.id2_1 = b.id_1 and s.id2_2 = b.id_2 ;

After replacing the split integers with separate index as described here, the DB size for my small sample dropped from 2.4MB to 1.5MB. Then removing all the mainly unused columns (company, phase[12], relationship[12] and comment) made almost no difference (~50kB)