File Formats - orenlivne/ober GitHub Wiki
In the examples we assume the data set's file prefix is hutt
.
PLINK Formats
See the PLINK documentation. The input to phasing is a TPED format (requires hutt.tped, hutt.tfam
files).
Identity Coefficient File
hutt.id
. Contains condensed identity coefficients for all pairs of individuals in the genotyped data set. Row format:
id1 id2 lam delta1 ... delta9
where delta1,...,delta9
are the 9 condensed identity coefficients as defined in Lang's Book, and lam
is the estimated recombination transition rate defined in the IBDLD paper.
Example:
2592 2592 6.522090e-01 1.812744e-02 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 9.818726e-01 0.000000e+00 0.000000e+00
2592 2882 2.904150e-01 1.264513e-04 5.991831e-04 2.589047e-03 1.481276e-02 5.379498e-03 3.295737e-02 1.942925e-03 1.433675e-01 7.982252e-01
...
PRIMAL PLINK Format
PRIMAL's phasing program accepts two types of inputs: a PRIMAL NPZ file hutt.npz
(which is a numpy-compressed format of the Problem
python object holding the data set's information) or a PRIMAL TPED format, which is a PLINK TPED set plus the following additional files:
hutt.mnr
- minor allele frequency file. Format:snv_identifier minor_allele_letter
. E.g.,
rs9605923 T
rs5747999 G
rs5746679 T
rs11089263 A
rs11089264 A
hutt.frm
- SNV frame 0-based index within the list of PLINK SNVs. Row format:
chromosome index1 ... indexN
E.g., our data set has 3218 SNPs on chromosome 22. Therefore the hutt.frm
file for that chromosome looks like this:
22 0 1 10 26 31 ... 3201 3216 3217
22 0 1 10 26 30 40 ... 3201 3210 3217
...
A frame is a subset of the SNVs that are in pairwise LD r^2 < threshold (we use < 0.3). Frames are not necessarily mutually exclusive.
hutt.lam
- a file with the average values oflam
in the population vs. the kinship coefficientf
at certain discrete points. Example:
5.000000000000000104e-03 6.660384666666666620e-01
1.499999999999999944e-02 6.942750542168673045e-01
2.500000000000000139e-02 6.961791467391308386e-01
3.500000000000000333e-02 6.611896463414631553e-01
4.499999999999999833e-02 6.141153714285715326e-01
5.500000000000000028e-02 5.607391297297296129e-01
6.500000000000000222e-02 5.789846129032256705e-01
7.500000000000001110e-02 5.076148750000001320e-01
8.499999999999999223e-02 4.967449999999999366e-01
9.500000000000000111e-02 4.817949999999999733e-01
The impute.io
python directory includes useful modules for converting a . $OBER/impute/batch
contains useful programs for splitting a PLINK TPED data set into chromosomal data sets and preparing the additional files.