DB Schema changes - CameronD73/529renew Wiki

current schema

There are two main tables: the tester ID and the segment pairs.

The crucial key lies in the tester UID (unique identification code) that 23 and me uses to reference each person (actually DNA test). This is a 64-bit unsigned integer represented in URLs as 16 hex digits, which raises the following issues...

  • integers in sqlite3 are always signed.
  • integers in the javascript Number do not exist - There is now the BigInt as an arbitrarily large integer but using it would not solve all the issues.
  • simple operations using large integers with be converted into js double precision floating point, which is not an exact representation of the original number. Thus entering the number into the sqlite tables will rarely return the correct tester UID and near enough is not good enough.
  • attempting to use bigint values is cumbersome (and unnecessary)
  • the code relies on the pair of UIDs in the segment table being in order - higher value first. This seems to be purely for lookup efficiency, as the actual values appear to be arbitrary. Mixing signed and unsigned integers will probably be ok in this exact context, but is looking for trouble.
  • the consequence of this numerical mess is that the original coder chose to split the ID into two 32-bit numbers. Hence id_1 and id_2 are the high and low order parts of the UID.

Table schema are described on a separate page

proposed new schema

We should take this opportunity to modify the database where improvements can be made and reduncancies removed.

Main Keys

  • Code could be much simplified if the UID is stored simply as a text string and a separate primary key created for the idalias table (aliased to ROWID). This can then be used as the single integer key for each tester 1 and 2 in the segment table.
  • Alternatively, the UID text string itself could be used as the ID. This consumes a bit more than double the space of 64-bit integers in each row, and also seems to increase index size.

After looking at how this would be coded, it looks like using the first option will

  1. make the DB more fragile (e.g. any replace operation assigns a new ID if it is autoincrement - it would require some fancy foreign key constraint triggers to make safer)
  2. makes the coding more complex, either within the JS code or in the queries. I think any disadvantages in the lookup speed or db size are outweighed by the simpler code in the string options. This is evaluated on the timing wiki page

Other Issues

  • is company_id any use?
  • build number - build 37 has been the only one used for a long time - we leave in the field in case they ever decide to update to the last decade's standard reference genome, but I would propose removing any build-36 support and not reimporting those segments.

Phase, Relationship and Comments (Ancestors)

These values do not really belong in this table due to their ambiguity, they could be

  • retained and ignored,
  • deleted, or
  • a new table would be needed to actually store these values unambiguously, and would use less space in the db.

convert old format

This should only be needed for testing. Import should be via the csv file. Add to the idalias table:

	"ID"	INTEGER UNIQUE,
	"IDtext"	TEXT UNIQUE,
	PRIMARY KEY("ID")

and add to the ibdsegs table:

	"ID1"	INTEGER,
	"ID2"	INTEGER

To populate the alias table

update idalias set ID = _rowid_;
update idalias set IDtext = printf("%08.8x%08.8x", id_1, id_2);

And the segment table is a bit trickier, as an update cannot be done on a join...

update ibdsegs set ID1 =( select  ID from   idalias where ibdsegs.id1_1 = idalias.id_1 and ibdsegs.id1_2 = idalias.id_2 );  
update ibdsegs set ID2 =( select  ID from   idalias where ibdsegs.id2_1 = idalias.id_1 and ibdsegs.id2_2 = idalias.id_2 );  

and the next gotcha is that, for efficiency of code, the IDs are always stored highest value first. The process of assigning a separate ID breaks this, so we need to swap...

update ibdsegs set (ID1, ID2) = (ID2, ID1) where ID1 < ID2;

Use the following join to check the results...

select ID1, a.name, ID2, b.name, *
	from ibdsegs as s
		join idalias as a on s.id1_1 = a.id_1 and s.id1_2 = a.id_2
		join idalias as b on s.id2_1 = b.id_1 and s.id2_2 = b.id_2 ;

After replacing the split integers with separate index as described here, the DB size for my small sample dropped from 2.4MB to 1.5MB. Then removing all the mainly unused columns (company, phase[12], relationship[12] and comment) made almost no difference (~50kB)