IBD Segments - orenlivne/ober GitHub Wiki

Input

Phasing NPZ output hutt.phased.npz (for a single chromosome).
Pairwise kinship file for all individuals in the pedigree (both genotyped and untyped). Independent of the chromosome.

Output

Segment file segments.out. Each row corresponds to a segment between two haplotypes in the format

snv_index_start snv_index_end bp_start bp_end id1 allele1 id2 allele2

The framework SNVs are numbered 0..m-1. bp_start is the base-pair location of SNV number snv_index_start (similarly end). A haplotype is identified by a pair (id,allele), where id is the 0-based index of the individual in the Problem file (individuals are automatically numbered serially this way when the Problem object is created), and an allele index, where 0=paternal and 1=maternal. For instance, (100,1) represents the maternal haplotype of individual number 101 in the original TFAM file.

Example of segments.out:

188 244 18896301 19736003 0 1 0 0      <-- a short HBD segment between individual 0's haplotypes
1036 1159 28165003 29828751 0 1 0 0
188 244 18896301 19736003 0 1 0 0
1036 1159 28165003 29828751 0 1 0 0
1186 1232 30088671 30735908 0 0 1 1
0 142 16484792 18369916 2 0 0 0
239 279 19582066 20742450 0 0 2 1
2560 2993 45115579 48742097 0 1 2 0
275 287 20075858 20764994 0 1 2 1
2560 2799 45115579 47500783 0 0 3 1    <-- segment between 0's paternal haplotype and 3's maternal haplotype
...

Parallel Pipeline

Main script: $OBER/code/impute/batch/pipeline-phase (the same as for phasing). Just make sure to run it with the -i flag.

Contains the useful splitting of the data set into chromosomes and conversion into the PRIMAL npz format.

Usage: pipeline-phase <work-dir> <architecture> <plink-data-set>

The full phasing and imputation pipeline. Creates temp and submission files
under work-dir. Architecture can be beagle or cri. plink-data set
is the absolute path to the common prefix of the input PLINK binary files
(.bed,.bim).

Optional flags:
        -s start-chr    Start processing from this chromosome index. Default: 22
        -e stop-chr     Stop processing at this chromosome index. Default: 22
        -p              Run phasing stage.
        -i              Run IBD segment creation.
        -t              Run IBD segment post-processing.
        -n              Run IBD segment indexing.
        -o              Run parent-of-origin alignment.
        -c              Clean the output directory first.
        -g              Generate the pipeline.
        -r              Run the pipeline.```

Example:

pipeline-phase -i $OBER/out/impute_cgi_work beagle $OBER/data/hutt

If your cluster doesn't have custom architecture like Beagle, use the value cri.

Serial Mode

If your machine can run num_processes parallel threads (use 1 for serial or 4 if you have a 4-core machine), use the following command to generate all IBD segments for a single chromosome.

python $OBER/code/impute/batch/ibd_segments_threaded.py -p <num_processes> hutt.phased.npz hutt.kinship > segments.out

Python API

The module ``im.ibd.dist.ibd_distant_hap` provides several calls:

def sample_segments((problem, i, j, params, lock)):
    '''Return IBD segments between phased samples i,j.'''

def among_samples_segments(problem, s, params):
    '''Return IBD segments among samples in the set s. Only supports a serial run for now.'''

def between_samples_segments(problem, s, t, params, num_processes=1):
    '''Return IBD segments between samples in the set s and samples in the set t.
    s,t should be disjoint (if not, duplicate pairs will be present in the result set).
    If num_processes > 1, uses a multiprocessing pool with num_processes parallel processes.'''

def paired_samples_segments(problem, s, params):
    '''Return IBD segments between pairs of samples in the tuple list s.'''

def hap_segments_from_problem(problem, hap1, hap2, params):
   '''A utility method that conveniently accepts a problem object directly. Useful in a non-parallel
    environment where we don''t need to serialize method arguments.'''
    hap_segments(IbdProblem(problem, hap1, hap2, None, params))

def hap_segments(h):
    '''Return IBD segments between two phased sample haplotypes (i,a),(j,b) encapsulated by
    the IbdProblem object ''h''.'''

The most commonly used method in batch mode is hap_segments, and many IbdProblems instances are generated and passed to it. The other methods are more for debugging purposes.