sequence_formats - MetabolicEngineeringGroupCBMA/MetabolicEngineeringGroupCBMA.github.io GitHub Wiki

FASTA format

The FASTA format is the simplest text format for biological sequences that still allow some metadata for the sequence. NCBI

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data.

The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column, followed by a name that can not have spaces:

>myDNAsequence
agctactactgagtcatcgtgtatgcgtatgatcatctatgcgtagtcgtacgtatctattcgatcgt


>myprotein 255 amino acids
MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ
LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK

A FASTA file can represent either a nucleotide or protein sequence.

Sequences are expected to be represented in the standard IUPAC amino acid or nucleic acid codes, with these exceptions:

Lower-case letters are accepted and assumed the same as upper-case
A single hyphen or dash can be used to represent a gap of indeterminate length;
For amino acid sequences, U and * are acceptable letters (see below).
any numerical digits in the query sequence should either be removed or replaced by N for unknown nucleic acid or X for unknown amino acid.

The IUPAC nucleic acid codes are:

A --> adenosine           M --> A C (amino)
C --> cytidine            S --> G C (strong)
G --> guanine             W --> A T (weak)
T --> thymidine           B --> G T C
U --> uridine             D --> G A T
R --> G A (purine)        H --> A C T
Y --> T C (pyrimidine)    V --> G C A
K --> G T (keto)          N --> A G C T (any)
						  -  gap of indeterminate length

The accepted amino acid codes are:

A ALA alanine                         P PRO proline
B ASX aspartate or asparagine         Q GLN glutamine
C CYS cystine                         R ARG arginine
D ASP aspartate                       S SER serine
E GLU glutamate                       T THR threonine
F PHE phenylalanine                   U     selenocysteine
G GLY glycine                         V VAL valine
H HIS histidine                       W TRP tryptophan
I ILE isoleucine                      Y TYR tyrosine
K LYS lysine                          Z GLX glutamate or glutamine
L LEU leucine                         X     any
M MET methionine                      *     translation stop
N ASN asparagine                      -     gap of indeterminate length

Genbank format

GenBank format (GenBank Flat File Format) consists of an annotation section and a sequence section. The start of the annotation section is marked by a line beginning with the word "LOCUS". The start of sequence section is marked by a line beginning with the word "ORIGIN" and the end of the section is marked by a line with only "//".

Example:

LOCUS       AF068625                 200 bp    mRNA    linear   ROD 06-DEC-1999
DEFINITION  Mus musculus DNA cytosine-5 methyltransferase 3A (Dnmt3a) mRNA,
			complete cds.
ACCESSION   AF068625 REGION: 1..200
VERSION     AF068625.2  GI:6449467
KEYWORDS    .
SOURCE      Mus musculus (house mouse)
  ORGANISM  Mus musculus
			Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
			Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
			Sciurognathi; Muroidea; Muridae; Murinae; Mus.
REFERENCE   1  (bases 1 to 200)
  AUTHORS   Okano,M., Xie,S. and Li,E.
  TITLE     Cloning and characterization of a family of novel mammalian DNA
			(cytosine-5) methyltransferases
  JOURNAL   Nat. Genet. 19 (3), 219-220 (1998)
   PUBMED   9662389
REFERENCE   2  (bases 1 to 200)
  AUTHORS   Xie,S., Okano,M. and Li,E.
  TITLE     Direct Submission
  JOURNAL   Submitted (28-MAY-1998) CVRC, Mass. Gen. Hospital, 149 13th Street,
			Charlestown, MA 02129, USA
REFERENCE   3  (bases 1 to 200)
  AUTHORS   Okano,M., Chijiwa,T., Sasaki,H. and Li,E.
  TITLE     Direct Submission
  JOURNAL   Submitted (04-NOV-1999) CVRC, Mass. Gen. Hospital, 149 13th Street,
			Charlestown, MA 02129, USA
  REMARK    Sequence update by submitter
COMMENT     On Nov 18, 1999 this sequence version replaced gi:3327977.
FEATURES             Location/Qualifiers
	 source          1..200
					 /organism="Mus musculus"
					 /mol_type="mRNA"
					 /db_xref="taxon:10090"
					 /chromosome="12"
					 /map="4.0 cM"
	 gene            1..>200
					 /gene="Dnmt3a"
ORIGIN
		1 gaattccggc ctgctgccgg gccgcccgac ccgccgggcc acacggcaga gccgcctgaa
	   61 gcccagcgct gaggctgcac ttttccgagg gcttgacatc agggtctatg tttaagtctt
	  121 agctcttgct tacaaagacc acggcaattc cttctctgaa gccctcgcag ccccacagcg
	  181 ccctcgcagc cccagcctgc
//