unambiguous_codes - SimonHegele/SSfSBT GitHub Wiki
Replacing ambiguous codes in FASTA/FASTQ-files.
The genomic nucleotides are adenine, cytosine, guanine and thymine, denoted as A, C, G and T. When the identity of a nucleotide is uncertain, other characters can be used to denote them (see IUPAC Codes. However, many bioinformatics tools do not accept FASTA/FASTQ-files using these codes.
Ambiguous codes are replaced with a randomly selected base from A, C, G, T. Within a sequence, identical ambiguity codes are replaced with the same base code The total number of replacements as well as the number of replacements per ambiguous code are reported. Additionally, a "uncertainty" is reported which is calculated as:
$$ u = \frac{\Sigma_{ac} (ap(ac)*(1-\frac{1}{pb(ac)}))}{total~bases} $$
- ac ≙ Ambiguous code
- ap(ac) ≙ Appearances of ac in all sequences
- pb(ac) ≙ Number of possible bases represented by ac
Assuming all non-ambiguous bases are correct, this is equivalent to the expected error rate.
usage: unambiguous_codes [-h] [-t THREADS] in_file out_file
Replacing ambigouity codes in FASTA/FASTQ with A,C,G or T.
positional arguments:
in_file
out_file
options:
-h, --help show this help message and exit
-t THREADS, --threads THREADS
Number of parallel threads [default: 1]
Exemplary report (stdout)
Ambiguity code Bases Replaced
0 R [A, G] 6977
1 Y [C, T] 6323
2 S [G, C] 2798
3 W [A, T] 3806
4 K [G, T] 7836
5 M [A, C] 6576
6 B [C, G, T] 626
7 D [A, G, T] 533
8 H [A, C, T] 496
9 V [A, C, G] 749
10 N [A, C, G, T] 0
11 All [A, C, G, T] 36720
# Sequences: 1000000
# Bases: 1439072372
Uncertainty: 0.0000130366