sequence_formats - MetabolicEngineeringGroupCBMA/MetabolicEngineeringGroupCBMA.github.io GitHub Wiki
FASTA format
The FASTA format is the simplest text format for biological sequences that still allow some metadata for the sequence. NCBI
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data.
The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column, followed by a name that can not have spaces:
>myDNAsequence
agctactactgagtcatcgtgtatgcgtatgatcatctatgcgtagtcgtacgtatctattcgatcgt
>myprotein 255 amino acids
MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ
LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK
A FASTA file can represent either a nucleotide or protein sequence.
Sequences are expected to be represented in the standard IUPAC amino acid or nucleic acid codes, with these exceptions:
Lower-case letters are accepted and assumed the same as upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; in amino acid sequences, U and * are acceptable letters (see below). any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue). The nucleic acid codes are:
A --> adenosine M --> A C (amino)
C --> cytidine S --> G C (strong)
G --> guanine W --> A T (weak)
T --> thymidine B --> G T C
U --> uridine D --> G A T
R --> G A (purine) H --> A C T
Y --> T C (pyrimidine) V --> G C A
K --> G T (keto) N --> A G C T (any)
- gap of indeterminate length
The accepted amino acid codes are:
A ALA alanine P PRO proline
B ASX aspartate or asparagine Q GLN glutamine
C CYS cystine R ARG arginine
D ASP aspartate S SER serine
E GLU glutamate T THR threonine
F PHE phenylalanine U selenocysteine
G GLY glycine V VAL valine
H HIS histidine W TRP tryptophan
I ILE isoleucine Y TYR tyrosine
K LYS lysine Z GLX glutamate or glutamine
L LEU leucine X any
M MET methionine * translation stop
N ASN asparagine - gap of indeterminate length
Genbank format
GenBank format (GenBank Flat File Format) consists of an annotation section and a sequence section. The start of the annotation section is marked by a line beginning with the word "LOCUS". The start of sequence section is marked by a line beginning with the word "ORIGIN" and the end of the section is marked by a line with only "//".
Example:
LOCUS AF068625 200 bp mRNA linear ROD 06-DEC-1999
DEFINITION Mus musculus DNA cytosine-5 methyltransferase 3A (Dnmt3a) mRNA,
complete cds.
ACCESSION AF068625 REGION: 1..200
VERSION AF068625.2 GI:6449467
KEYWORDS .
SOURCE Mus musculus (house mouse)
ORGANISM Mus musculus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
Sciurognathi; Muroidea; Muridae; Murinae; Mus.
REFERENCE 1 (bases 1 to 200)
AUTHORS Okano,M., Xie,S. and Li,E.
TITLE Cloning and characterization of a family of novel mammalian DNA
(cytosine-5) methyltransferases
JOURNAL Nat. Genet. 19 (3), 219-220 (1998)
PUBMED 9662389
REFERENCE 2 (bases 1 to 200)
AUTHORS Xie,S., Okano,M. and Li,E.
TITLE Direct Submission
JOURNAL Submitted (28-MAY-1998) CVRC, Mass. Gen. Hospital, 149 13th Street,
Charlestown, MA 02129, USA
REFERENCE 3 (bases 1 to 200)
AUTHORS Okano,M., Chijiwa,T., Sasaki,H. and Li,E.
TITLE Direct Submission
JOURNAL Submitted (04-NOV-1999) CVRC, Mass. Gen. Hospital, 149 13th Street,
Charlestown, MA 02129, USA
REMARK Sequence update by submitter
COMMENT On Nov 18, 1999 this sequence version replaced gi:3327977.
FEATURES Location/Qualifiers
source 1..200
/organism="Mus musculus"
/mol_type="mRNA"
/db_xref="taxon:10090"
/chromosome="12"
/map="4.0 cM"
gene 1..>200
/gene="Dnmt3a"
ORIGIN
1 gaattccggc ctgctgccgg gccgcccgac ccgccgggcc acacggcaga gccgcctgaa
61 gcccagcgct gaggctgcac ttttccgagg gcttgacatc agggtctatg tttaagtctt
121 agctcttgct tacaaagacc acggcaattc cttctctgaa gccctcgcag ccccacagcg
181 ccctcgcagc cccagcctgc
//