Understanding makeDesignFiles.py - ShubhankarLondhe/diff-chick GitHub Wiki
makeDesignFiles.py is used to generate the required files to run CHiCAGO. These files contain information on the regions around the baits that have been specified in the .baitmap file.
The following parameters are given to the code:
- Minimum fragment length (minFragLen) - Only consider the restriction fragments which are longer than this parameter.
- Maximum fragment length (maxFragLen) - Only consider the restriction fragments which are shorter than this parameter.
- Maximum length for Brownian estimation (maxLBrownEst) - The Brownian estimation of a bait fragment is only done with other fragments that are at a distance less than this parameter (since at large distance, the contribution to the estimate is very low).
- Bin size (binsize) - Sizes of the bins which will be created around the bait fragment.
- Remove bait-to-bait comparison (removeb2b) - A boolean parameter to exclude bait-to-bait interactions from the calculation of the estimates.
- Remove adjacent (removeAdjacent) - A boolean parameter to exclude interaction between bait fragments and their adjacent fragments from the calculation of the estimates.
- Restriction map file (rmapfile) - path to the .rmap file.
- Bait map file (baitmapfile) - path to the .baitmap file.
- Prefix for output files (outfilePrefix).
- Design directory address (designDir).
The code is called using the command:
python makeDesignFiles.py [--minFragLen = ]
[--maxFragLen = ]
[--maxLBrownEst = ]
[--binsize = ]
[--removeb2b = True]
[--removeAdjacent = True]
[--rmapfile = /.rmap]
[--baitmapfile = /.baitmap]
[--outfilePrefix = ]
[--designDir = .]
There are 2 input files to the code: the .rmap file and the .baitmap file.
Description of the .rmap file:
The .rmap file contains information about the restriction fragments that have been produced during the experiment. The format of the file is as follows:
20 1 60024 403460
20 60025 62654 403461
20 62655 66475 403462
20 66476 71941 403463
20 71942 73977 403464
Each line describes a fragment.
Column 1: describes the chromosome on which the fragment is found.
Column 2: describes the position of the first nucleotide of the fragment.
Column 3: describes the position of the last nucleotide of the fragment.
Column 4: describes the fragment ID.
Note: All the fragments that are present in this file will be called "fragments".
Description of the .baitmap file:
The bait fragments are a subset of the restriction fragments that are of interest. Studying the interaction between these baits to the rest of the genome is the objective of Capture Hi-C. The .baitmap file gives information on these bait fragments. The format of the file is as follows:
20 66476 71941 403463 DEFB125
20 119103 138049 403475 DEFB126
20 161620 170741 403482 DEFB128
20 206075 209203 403491 DEFB129
20 233983 239479 403499 DEFB132
Each line describes a bait fragment.
Column 1: the chromosome on which the bait fragment is found.
Column 2: the position of the first nucleotide of the bait fragment.
Column 3: the position of the last nucleotide of the bait fragment.
Column 4: the fragment ID corresponding to the .rmap file.
Column 5: the bait ID.
Note: All the fragments that are present in this file will be called "baits".
The code outputs 3 files: .poe file, .npb file, and .nbpb file.
Description of the .poe file:
# minFragLen=150 maxFragLen=40000 maxLBrownEst=1500000 binsize=20000 removeb2b=True removeAdjacent=True rmapfile=./h19_chr20and21.rmap baitmapfile=./h19_chr20and21.baitmap
403463 403461 7869
403463 403465 5827
403463 403466 11771
403463 403467 19559
The first line of the file describes the parameters given to the code.
Each subsequent line describes a bait-fragment pair. The columns are as follows:
- Column 1 - describes the bait ID.
- Column 2 - describes the fragment ID.
- Column 3 - describes the distance between these 2 IDs.
It is important to note that not all bait-fragment pair is considered. The distance between a bait and a fragment is calculated and written to this file only if these conditions are satisfied:
- The length of the fragment is more than the minimum length given by minFragLen.
- The length of the fragment is less than the maximum length given by maxFragLen.
- If removeb2b is True, then the fragment should not be a bait itself. (As stated earlier, baits are a subset of fragments. Hence all the baits are also found in fragments list)
- If removeAdjacent is True, then the fragment adjacent to the bait is not considered.
- If the distance between bait and fragment is more than the value specified in maxLBrownEst, then the fragment is not considered.
Note: If a bait-fragment pair satisfy all these 5 conditions, then I shall call the pair valid bait-fragment pair.
The distance between the fragment and the bait is defined by the distance between their midpoints. From the .rmap and .baitmap files, we get the information about the starting nucleotide and end nucleotide position for the fragments and baits. Let f_st and f_end denote the starting and end position of the fragment and b_st and b_end denote the starting and end position of the bait. Then the distance is given by:
Distance = Absolute value of { [(f_st + f_end)/2] - [(b_st + b_end)/2] }
Description of the .npb file:
# minFragLen=150 maxFragLen=40000 maxLBrownEst=1500000 binsize=20000 removeb2b=True removeAdjacent=True rmapfile=./h19_chr20and21.rmap baitmapfile=./h19_chr20and21.baitmap
403463 4 4 3 4 1 4 4 6 6 1 1 3 4 2 1 2 2 2 1 4 5 4 4 3 3 2 3 3 1 3 4 7 2 2 4 5 2 2 2 1 8 2 4 2 7 1 5 8 6 6 3 3 8 4 5 2 5 5 5 8 5 3 0 1 4 6 2 7 9 4 6 5 8 4 5
403475 6 6 7 6 6 6 1 1 3 4 2 1 2 2 2 1 3 5 5 4 3 3 2 3 3 1 3 4 7 2 2 4 5 2 2 2 1 8 2 4 2 6 2 5 8 6 6 3 3 7 5 5 2 5 5 5 8 5 3 0 1 4 6 2 6 10 4 6 5 8 4 4 6 4 5
403482 4 7 8 10 6 2 4 4 1 2 2 1 3 0 4 5 4 5 2 4 2 3 2 2 3 3 8 2 2 3 6 1 2 3 1 8 2 3 3 4 3 5 8 6 7 2 3 8 5 5 2 5 4 6 5 7 4 0 1 3 7 1 5 11 4 7 4 7 6 3 7 3 5 5 5
403491 8 9 4 3 7 9 4 4 2 2 2 1 3 5 4 5 3 3 2 3 3 1 3 4 7 2 2 3 6 2 2 2 1 8 2 3 3 5 3 5 8 5 7 2 4 7 5 5 2 5 4 6 6 6 4 0 1 4 6 2 5 11 4 6 5 7 5 4 6 4 5 4 5 4 4
The first line of the file describes the parameters given to the code.
In order to understand the output in the rest of the lines, the implementation of bins has to be understood first. Bins are created with respect to a bait. The number of bins created around a bait is given by the formula:
Number of bins = round ( maxLBrownEst / binsize )
As we can see from the header of the .npb file, maxLBrownEst = 1500000 and binsize = 20000. Hence the number of bins should be 75. Now we go to the output file. The first column of the output file describes the bait ID. There are 75 columns after the bait ID column each representing a bin. Column 2 represents the bin closest to the bait and the consecutive columns represent bins with increasing distance from the bait. So column 76 has the bin which is farthest away from the bait.
Before talking about the numbers that are in these columns, we need to be clear about the directional invariance. Let's consider 2 fragments. Both these fragments are at a distance of 10,000 from the bait. But one fragment (let's call it A) is to the right of the bait and the other (let's call it B) is to the left. Therefore these two fragments are in different directions with respect to the bait. But they will still be put in the same bin since what matters is their distance from the bait and not the direction in which they are found.
Now, in the output file, the numbers that are in column 2 - 76 are the number of fragments that are in the bins which each column corresponds to. A fragment goes to the bin n (bin 1 is closest and bin 76 is farthest) where:
n = floor ( d / binsize )
where d is the distance between bait and fragment.
Note: The fragment will only be considered for the given bait if the bait-fragment pair is a valid bait-fragment pair (Conditions for this is given in "Description of the .poe file").