IBD Segments - orenlivne/ober GitHub Wiki
- Phasing NPZ output
hutt.phased.npz
(for a single chromosome). - Pairwise kinship file for all individuals in the pedigree (both genotyped and untyped). Independent of the chromosome.
- Segment file
segments.out
. Each row corresponds to a segment between two haplotypes in the format
snv_index_start snv_index_end bp_start bp_end id1 allele1 id2 allele2
The framework SNVs are numbered 0..m-1. bp_start
is the base-pair location of SNV number snv_index_start
(similarly end
). A haplotype is identified by a pair (id,allele)
, where id
is the 0-based index of the individual in the Problem
file (individuals are automatically numbered serially this way when the Problem
object is created), and an allele index, where 0=paternal and 1=maternal. For instance, (100,1)
represents the maternal haplotype of individual number 101 in the original TFAM file.
Example of segments.out:
188 244 18896301 19736003 0 1 0 0 <-- a short HBD segment between individual 0's haplotypes
1036 1159 28165003 29828751 0 1 0 0
188 244 18896301 19736003 0 1 0 0
1036 1159 28165003 29828751 0 1 0 0
1186 1232 30088671 30735908 0 0 1 1
0 142 16484792 18369916 2 0 0 0
239 279 19582066 20742450 0 0 2 1
2560 2993 45115579 48742097 0 1 2 0
275 287 20075858 20764994 0 1 2 1
2560 2799 45115579 47500783 0 0 3 1 <-- segment between 0's paternal haplotype and 3's maternal haplotype
...
Main script: $OBER/code/impute/batch/pipeline-phase (the same as for phasing). Just make sure to run it with the -i
flag.
Contains the useful splitting of the data set into chromosomes and conversion into the PRIMAL npz format.
Usage: pipeline-phase <work-dir> <architecture> <plink-data-set>
The full phasing and imputation pipeline. Creates temp and submission files
under work-dir. Architecture can be beagle or cri. plink-data set
is the absolute path to the common prefix of the input PLINK binary files
(.bed,.bim).
Optional flags:
-s start-chr Start processing from this chromosome index. Default: 22
-e stop-chr Stop processing at this chromosome index. Default: 22
-p Run phasing stage.
-i Run IBD segment creation.
-t Run IBD segment post-processing.
-n Run IBD segment indexing.
-o Run parent-of-origin alignment.
-c Clean the output directory first.
-g Generate the pipeline.
-r Run the pipeline.```
Example:
pipeline-phase -i $OBER/out/impute_cgi_work beagle $OBER/data/hutt
If your cluster doesn't have custom architecture like Beagle, use the value cri
.
If your machine can run num_processes
parallel threads (use 1 for serial or 4 if you have a 4-core machine), use the following command to generate all IBD segments for a single chromosome.
python $OBER/code/impute/batch/ibd_segments_threaded.py -p <num_processes> hutt.phased.npz hutt.kinship > segments.out
The module ``im.ibd.dist.ibd_distant_hap` provides several calls:
def sample_segments((problem, i, j, params, lock)):
'''Return IBD segments between phased samples i,j.'''
def among_samples_segments(problem, s, params):
'''Return IBD segments among samples in the set s. Only supports a serial run for now.'''
def between_samples_segments(problem, s, t, params, num_processes=1):
'''Return IBD segments between samples in the set s and samples in the set t.
s,t should be disjoint (if not, duplicate pairs will be present in the result set).
If num_processes > 1, uses a multiprocessing pool with num_processes parallel processes.'''
def paired_samples_segments(problem, s, params):
'''Return IBD segments between pairs of samples in the tuple list s.'''
def hap_segments_from_problem(problem, hap1, hap2, params):
'''A utility method that conveniently accepts a problem object directly. Useful in a non-parallel
environment where we don''t need to serialize method arguments.'''
hap_segments(IbdProblem(problem, hap1, hap2, None, params))
def hap_segments(h):
'''Return IBD segments between two phased sample haplotypes (i,a),(j,b) encapsulated by
the IbdProblem object ''h''.'''
The most commonly used method in batch mode is hap_segments
, and many IbdProblem
s instances are generated and passed to it. The other methods are more for debugging purposes.