File Formats - orenlivne/ober GitHub Wiki

In the examples we assume the data set's file prefix is hutt.

PLINK Formats

See the PLINK documentation. The input to phasing is a TPED format (requires hutt.tped, hutt.tfam files).

Identity Coefficient File

hutt.id. Contains condensed identity coefficients for all pairs of individuals in the genotyped data set. Row format:

id1 id2 lam delta1 ... delta9

where delta1,...,delta9 are the 9 condensed identity coefficients as defined in Lang's Book, and lam is the estimated recombination transition rate defined in the IBDLD paper.

Example:

2592 2592 6.522090e-01 1.812744e-02 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 9.818726e-01 0.000000e+00 0.000000e+00
2592 2882 2.904150e-01 1.264513e-04 5.991831e-04 2.589047e-03 1.481276e-02 5.379498e-03 3.295737e-02 1.942925e-03 1.433675e-01 7.982252e-01
...

PRIMAL PLINK Format

PRIMAL's phasing program accepts two types of inputs: a PRIMAL NPZ file hutt.npz (which is a numpy-compressed format of the Problem python object holding the data set's information) or a PRIMAL TPED format, which is a PLINK TPED set plus the following additional files:

  • hutt.mnr - minor allele frequency file. Format: snv_identifier minor_allele_letter. E.g.,
rs9605923 T
rs5747999 G
rs5746679 T
rs11089263 A
rs11089264 A
  • hutt.frm - SNV frame 0-based index within the list of PLINK SNVs. Row format:
chromosome index1 ... indexN

E.g., our data set has 3218 SNPs on chromosome 22. Therefore the hutt.frm file for that chromosome looks like this:

22 0 1 10 26 31 ... 3201 3216 3217
22 0 1 10 26 30 40 ... 3201 3210 3217
...

A frame is a subset of the SNVs that are in pairwise LD r^2 < threshold (we use < 0.3). Frames are not necessarily mutually exclusive.

  • hutt.lam - a file with the average values of lam in the population vs. the kinship coefficient f at certain discrete points. Example:
5.000000000000000104e-03 6.660384666666666620e-01
1.499999999999999944e-02 6.942750542168673045e-01
2.500000000000000139e-02 6.961791467391308386e-01
3.500000000000000333e-02 6.611896463414631553e-01
4.499999999999999833e-02 6.141153714285715326e-01
5.500000000000000028e-02 5.607391297297296129e-01
6.500000000000000222e-02 5.789846129032256705e-01
7.500000000000001110e-02 5.076148750000001320e-01
8.499999999999999223e-02 4.967449999999999366e-01
9.500000000000000111e-02 4.817949999999999733e-01

The impute.io python directory includes useful modules for converting a . $OBER/impute/batch contains useful programs for splitting a PLINK TPED data set into chromosomal data sets and preparing the additional files.