dsIUPAC - MetabolicEngineeringGroupCBMA/MetabolicEngineeringGroupCBMA.github.io GitHub Wiki
This document defines an extension to the IUPAC DNA alphabet for double stranded DNA. This extension is called dsIUPAC and allow unambiguous description of a double stranded DNA molecule with single stranded regions using a single sequence of characters.
IUPAC
The IUPAC DNA alphabet is a set of symbols designated by the International Union of Pure and Applied Chemistry (IUPAC) to represent nucleotide bases in DNA sequences, including ambiguity codes for cases where multiple nucleotides are possible at a particular position. Here are the symbols and their meanings:
- A - Adenine
- T - Thymine
- C - Cytosine
- G - Guanine
Ambiguity codes (representing multiple possible nucleotides):
- R - Purine (A or G)
- Y - Pyrimidine (C or T)
- S - Strong interaction (G or C)
- W - Weak interaction (A or T)
- K - Keto group (T or G)
- M - Amino group (A or C)
- B - Not A (C, G, or T)
- D - Not C (A, G, or T)
- H - Not G (A, C, or T)
- V - Not T (A, C, or G)
- N - Any nucleotide (A, T, C, or G)
These symbols allow for flexibility in representing DNA sequences, especially when there is uncertainty in base composition at specific positions.
dsIUPAC
Alphabet | Symbol | Complement | Bases |
---|---|---|---|
IUPAC | G | C | G |
" | A | T | A |
" | T | A | T |
" | C | G | C |
" | R | Y | G or A |
" | Y | R | T or C |
" | M | K | A or C |
" | K | M | G or T |
" | S | S | G or C |
" | W | W | A or T |
" | H | D | A or C or T |
" | B | V | G or T or C |
" | V | B | G or C or A |
" | D | H | G or A or T |
" | N | N | G or A or T or C |
RNA | U | A | U |
dsIUPAC | E | F | A in top strand, complementary strand empty |
" | I | J | C " |
" | P | Q | G " |
" | X | Z | T " |
" | Z | X | A in complementary strand, top strand empty |
" | Q | P | C " |
" | J | I | G " |
" | F | E | T " |
The choice of symbols for the dsIUPAC extension facilitate intuitive recognition of compatible single stranded regions, i.e. sticky-ends.
Example
Two double stranded DNA molecules with compatible terminal 5'- single strand overhangs:
GATCaaa GATCaaa ad-hoc representation
tttCTAG tttCTAG
PEXIaaaQFZJ PEXIaaaQFZJ representation using dsIUPAC
We can easily recognize that alphabetically, P
is followed by Q
, E
by F
and I
by J
.
This symetry is only broken by the X
, Z
pair of necessity since Y
is already used in the IUPAC alphabet.
DNA molecules with compatible terminal 3'- single strand overhangs:
aaaGATC aaaGATC ad-hoc representation
CTAGttt CTAGttt
QFZJaaaPEXI QFZJaaaPEXI representation using dsIUPAC
alphabets
ASCII CAPS = ABCDEFGHIJKLMNOPQRSTUVWXYZ
IUPAC = ABCD GH K MN RST VW Y
RNA = U
dsIUPAC = EF IJ PQ X Z + IUPAC
still free = L O
Representations of double stranded DNA
>format1 two strings & space
GATCaaa
tttCTAG
>format2 two strings & pipe
GATCCaaaA||||
||||GTTTTCTAG
>format3 two strings & hyphen
GATCCaaaA----
----GTTTTCTAG
>format4 three strings, pipe & hyphen
GATCCaaaA----
|||||||||||||
----GTTTTCTAG