unambiguous_codes - SimonHegele/SSfSBT GitHub Wiki

Replacing ambiguous codes in FASTA/FASTQ-files.

The genomic nucleotides are adenine, cytosine, guanine and thymine, denoted as A, C, G and T. When the identity of a nucleotide is uncertain, other characters can be used to denote them (see IUPAC Codes. However, many bioinformatics tools do not accept FASTA/FASTQ-files using these codes.

Ambiguous codes are replaced with a randomly selected base from A, C, G, T. Within a sequence, identical ambiguity codes are replaced with the same base code The total number of replacements as well as the number of replacements per ambiguous code are reported. Additionally, a "uncertainty" is reported which is calculated as:

$$ u = \frac{\Sigma_{ac} (ap(ac)*(1-\frac{1}{pb(ac)}))}{total~bases} $$

ac ≙ Ambiguous code
ap(ac) ≙ Appearances of ac in all sequences
pb(ac) ≙ Number of possible bases represented by ac

Assuming all non-ambiguous bases are correct, this is equivalent to the expected error rate.

usage: unambiguous_codes [-h] [-t THREADS] in_file out_file

Replacing ambigouity codes in FASTA/FASTQ with A,C,G or T.

positional arguments:
  in_file
  out_file

options:
  -h, --help            show this help message and exit
  -t THREADS, --threads THREADS
                        Number of parallel threads [default: 1]

Exemplary report (stdout)

    Ambiguity code         Bases  Replaced
0               R        [A, G]      6977
1               Y        [C, T]      6323
2               S        [G, C]      2798
3               W        [A, T]      3806
4               K        [G, T]      7836
5               M        [A, C]      6576
6               B     [C, G, T]       626
7               D     [A, G, T]       533
8               H     [A, C, T]       496
9               V     [A, C, G]       749
10              N  [A, C, G, T]         0
11            All  [A, C, G, T]     36720

# Sequences: 1000000
# Bases:     1439072372
Uncertainty: 0.0000130366