February 2025

Egon A. Ozer, MD PhD ([email protected])
Ramon Lorenzo Redondo, PhD ([email protected])

There are a certain file types you can expect to encounter consistently when working with sequencing and genomics data. This is not intended to be an exhaustive list, but more if a quick introduction and a reference.

Section 1 - FASTA

FASTA format is the most widely used file format for storing sequence data.

The basic format of a fasta file is a header line that starts with the greater-than sign ">" followed by identifier information, there can be as much or as little information on this line as you want. Spaces and most special characters are allowed, though be aware some programs will trim after spaces.

The next line and all subsequent lines after the header line contain the sequence associated with the header line. The sequence can all be on one line or split across several lines. Spaces and skipped lines are also allowed.

FASTA Example 1:

>AB040536.1:142-1209 Streptococcus pyogenes fbaA gene for fibronectin-binding protein, complete cds
ATGCGTAGAGCAGAAAATAACAAACACAGCCGCTATTCCATTCGCAAACTGAGCGTTGGGGTAACGAGTA
TAGCAATTGCGAGTCTCTTTTTAGGAAAGGTTGCCTATGCCGTAGATGGCATCCCTCCAATCTCTCTTAC
TCAAAAGACTACAGCCACTACATCAGAAAATTGGCATCATATTGATAAGGATGGCCTTATTCCTTTAGGT
ATAAGCTTAGAAGCTGCCAAAGAGGAATTTAAAAAAGAAGTAGAAGAATCACGTTTATCTGAAGCACAAA
AAGAAACGTATAAACAAAAAATTAAAACTGCACCAGACAAAGATAAGCTATTATTCACGTATCATAGTGA
GTATATGACAGCCGTTAAGGATCTTCCAGCGTCTACTGAGTCTACTACTCAGCCAGTTGAGGCACCCGTG
CAGGAGACACAGGCATCAGCTTCAGATTCGATGGTGACAGGTGATTCAACATCAGTTACGACTGATTCTC
CTGAGGAAACCCCATCTTCGGAAAGTCCAGTGGCCCCAGCTTTATCTGAGGCTCCAGCTCAACCAGCTGA
GAGTGAGGAACCTTCAGTAGCAGCATCTTCTGAGGAAACCCCATCTCCATCAACTCCAGCAGCCCCATCA
ACTCCAGCGGCTCCAGAAACTCCTGAAGAACCAGCAGCTCCATCTCAACCAGCTGAGAGTGAAGAATCTT
CAGTAGCAGCTACGACAAGCCCGTCTCCATCAACTCCAGCTGAATCAGAGACTCAGACGCCACCAGCTGT
TACTAAAGACTCTGATAAGCCATCTTCAGCAGCTGAAAAACCAGCAGCCTCTTCACTTGTTTCAGAACAA
ACCGTTCAACAACCAACTTCAAAGAGATCTTCTGATAAAAAAGAAGAGCAAGAACAGTCTTACTCTCCAA
ATCGCTCATTGTCAAGACAGGTTAGGGCCCATGAGTCAGGTAAGTACTTGCCTTCAACAGGTGAAAAAGC
ACAGCCACTCTTTATAGCTACTATGACTTTGATGTCTCTATTTGGCAGTCTTTTAGTCACAAAACGCCAA
AAAGAAACTAAAAAATAG

A single fasta file can contain multiple separate sequences as long as they are separated by distinct header lines.

FASTA Example 2:

>NODE_29_length_255_cov_42.625000
TTTAATTTGCGTTTGAACTTACTCGTTCCTTCTGTCGCTGACAGATTTATTTCTCGTTTC
TTGACGGGTAATATGTCTCCATATCACCCTCACGTTTGGTTCGTCTTATTCAGTTCTCAA
AGGTCTTCTAATCGGGAAGACAGGATTCGAACCTGCGACACCTTGGTCCCAAACCAAGTA
CTCTACCAAGCTGAGCTACTTCCCGAACTGATGCACCCTAGAGGAGTCGAACCTCTAACC
GCCTGATTCGTAGTC
>NODE_30_length_205_cov_19.102564
TCGAACCCGTGTTACCGCCGTGAAAAGGCGGTGTCTTAACCCCTTGACCAACGGACCATA
ATAATATAATTATAGATAATGGGCACGAGTGGACTCGAACCACCGACCTCACGCTTATCA
GGCGTGCGCTCTAACCACCTGAGCTACGCGCCCAAGCTTTTATTGATATAGCTTGGGAAA
ACTATAAAGCGGGTGACGAGAATCG

FASTA sequences can contain either nucleotide or amino acid sequences.

FASTA Example 3:

>BAB62098.1 fibronectin-binding protein [Streptococcus pyogenes]
MRRAENNKHSRYSIRKLSVGVTSIAIASLFLGKVAYAVDGIPPISLTQKTTATTSENWHHIDKDGLIPLG
ISLEAAKEEFKKEVEESRLSEAQKETYKQKIKTAPDKDKLLFTYHSEYMTAVKDLPASTESTTQPVEAPV
QETQASASDSMVTGDSTSVTTDSPEETPSSESPVAPALSEAPAQPAESEEPSVAASSEETPSPSTPAAPS
TPAAPETPEEPAAPSQPAESEESSVAATTSPSPSTPAESETQTPPAVTKDSDKPSSAAEKPAASSLVSEQ
TVQQPTSKRSSDKKEEQEQSYSPNRSLSRQVRAHESGKYLPSTGEKAQPLFIATMTLMSLFGSLLVTKRQ
KETKK

Section 2 - FASTQ

FASTQ files usually contain sequence read data.

Each individual sequence record in a FASTQ file consists of four lines:

Sequence identifier, starts with @ character
Nucleotide sequence
Separator line, starts with + character
Quality values for the sequence in line 2

FASTQ Example (Two sequence reads):

@M01915:185:000000000-JLJ5N:1:1101:12139:1000 1:N:0:TAAGGCGA+GTAAGGAG
NGTTAGTCTAGCTGGAGAAAAGTCCAGACCGGTTAAACTAAAAGATGTGGATAATATTAGTTATCACAGAACACAGACTG
+
#8ACCGGGGGGGGGGFGGGGGGGGGGGFGC@EGGGGFGGGGGFGFGGGFCGGGFGGGGGGGGFGGGGFGGGGGGGGFFGG
@M01915:185:000000000-JLJ5N:1:1101:17584:1000 1:N:0:TAAGGCGA+GTAAGGAG
NTAACGTCAATTTCCTGCTGATTTCCAAAACGGATAACCATCTGATTTTCTGGAAGATATGGGTCAATAATCGGATCTCC
+
#8ACCGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGG<FGGGGGGGGGDG?FFGFGCFFEGGGFGG

Each letter in line 4 represents the "Phred" quality score for its corresponding base in line 2. The letters can be translated to values indicating the probability that the corresponding base is incorrectly called. These values are assigned by the sequencing instrument during the sequencing and base-calling process. For more info on Phred scores, see here.

Section 3 - GenBank

GenBank-formatted files(.gbk), sometimes referred to as GenBank Flat Files (.gbff), are used for storing detailed annotation data and other metadata about a sequence as well as often the nucleotide sequence itself. These files can contain information about a single gene or an entire genome.

GenBank Example 1 (Two sequence reads):

LOCUS       NZ_CP010450          1791401 bp    DNA     circular CON 15-JAN-2022
DEFINITION  Streptococcus pyogenes strain NGAS638 chromosome, complete genome.
ACCESSION   NZ_CP010450
VERSION     NZ_CP010450.1
DBLINK      BioProject: PRJNA224116
            BioSample: SAMN03274510
            Assembly: GCF_001267845.1
KEYWORDS    RefSeq.
SOURCE      Streptococcus pyogenes
  ORGANISM  Streptococcus pyogenes
            Bacteria; Firmicutes; Bacilli; Lactobacillales; Streptococcaceae;
            Streptococcus.
REFERENCE   1  (bases 1 to 1791401)
  AUTHORS   Fittipaldi,N.
  TITLE     Direct Submission
  JOURNAL   Submitted (05-JAN-2015) Public Health Ontario, 661 University Ave,
            Toronto, Ontario M5G 1M1, Canada
FEATURES             Location/Qualifiers
     source          1..1791401
                     /organism="Streptococcus pyogenes"
                     /mol_type="genomic DNA"
                     /strain="NGAS638"
                     /db_xref="taxon:1314"
     gene            join(1790598..1791401,1..3)
                     /locus_tag="SD90_RS00005"
                     /old_locus_tag="SD90_00005"
     CDS             join(1790598..1791401,1..3)
                     /locus_tag="SD90_RS00005"
                     /old_locus_tag="SD90_00005"
                     /inference="COORDINATES: similar to AA
                     sequence:RefSeq:WP_010922805.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="ParB/RepB/Spo0J family partition protein"
                     /protein_id="WP_011106931.1"
                     /translation="MFLILSRNSLMTKELLIDLPIEDIITNPYQPRIQFNQRELQDLA
                     TSIKSNGLIQPIIVRKSDIFGYELVAGERRLKASKMAGLKKVPAIIKKISTLESMQQA
                     IVENLQRSNLNAIEEAKAYQLLVEKKHMTHDEIAKYMGKSRPYISNTLRLLQLPAPII
                     KAIEEGKISAGHARALLTLSDDKQQLYLTHKIQNEGLSVRQIEQLVTSTPSSKLSKKT
                     KNIFATSLEKQLAKSLGLSVNMKLTANHSGYLQISFSNDDELNRIINKLK"
     gene            236..1591
                     /gene="dnaA"
                     /locus_tag="SD90_RS00010"
                     /old_locus_tag="SD90_00010"
     CDS             236..1591
                     /gene="dnaA"
                     /locus_tag="SD90_RS00010"
                     /old_locus_tag="SD90_00010"
                     /inference="COORDINATES: similar to AA
                     sequence:RefSeq:WP_012657571.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="chromosomal replication initiator protein DnaA"
                     /protein_id="WP_002987659.1"
                     /translation="MTENEQIFWNRVLELAQSQLKQATYEFFVHDARLLKVDKHIATI
                     YLDQMKELFWEKNLKDVILTAGFEVYNAQISVDYVFEEDLMIEQNQTKINQKPKQQAL
                     NSLPTVTSDLNSKYSFENFIQGDENRWAVAASIAVANTPGTTYNPLFIWGGPGLGKTH
                     LLNAIGNSVLLENPNARIKYITAENFINEFVIHIRLDTMDELKEKFRNLDLLLIDDIQ
                     SLAKKTLSGTQEEFFNTFNALHNNNKQIVLTSDRTPDHLNDLEDRLVTRFKWGLTVNI
                     TPPDFETRVAILTNKIQEYNFIFPQDTIEYLAGQFDSNVRDLEGALKDISLVANFKQI
                     DTITVDIAAEAIRARKQDGPKMTVIPIEEIQAQVGKFYGVTVKEIKATKRTQNIVLAR
                     QVAMFLAREMTDNSLPKIGKEFGGRDHSTVLHAYNKIKNMISQDESLRIEIETIKNKI
                     K"
ORIGIN
        1 tagcttgttg atattctgtt ttttcttttt tagttttcca cataaaaaat agttgaaaac
       61 aatagcggtg tcaccttaaa atgacttttc cacaggttgt ggagaaccca aattaacagt
      121 gttaatttat tttccacaga ttgtggaaaa actaactatt atccattgtt ctgtggaaaa
      181 ctagaatagt ttgtggtaga atagttctag aattatccac aagaaggaac ctagtatgac
      241 tgaaaatgaa caaatttttt ggaacagggt cttggaatta gctcagagtc aattaaaaca
      301 ggcaacttat gaattttttg ttcatgatgc ccgtctatta aaggtcgata agcatattgc
      361 aactatttac ttagatcaaa tgaaagaact cttttgggaa aaaaatctta aagatgttat
      421 tcttactgct ggttttgaag tttataacgc tcaaatttct gttgactatg ttttcgaaga
      481 agacctaatg attgagcaaa atcagaccaa aatcaatcaa aaacctaagc agcaagcctt
      541 aaattctttg cctactgtta cttcagattt aaactcgaaa tatagttttg aaaactttat
      601 tcaaggagat gaaaatcgtt gggctgttgc tgcttcaata gcagtagcta atactcctgg
      661 aactacctat aatcctttgt ttatttgggg tggccctggg cttggaaaaa cccatttatt
      721 aaatgctatt ggtaattctg tactattaga aaatccaaat gctcgaatta aatatatcac
      781 agctgaaaac tttattaatg agtttgttat ccatattcgc cttgatacca tggatgaatt
      841 gaaagaaaaa tttcgtaatt tagatttact ccttattgat gatatccaat ctttagctaa
      901 aaaaacgctc tctggaacac aagaagagtt ctttaatact tttaatgcac ttcataataa
      961 taacaaacaa attgtcctaa caagcgaccg tacaccagat catctcaatg atttagaaga
     1021 tcgattagtt actcgtttta aatggggatt aacagtcaat atcacacctc ctgattttga
     1081 aacacgagtg gctattttga caaataaaat tcaagaatat aactttattt ttcctcaaga
     1141 taccattgag tatttggctg gtcaatttga ttctaatgtc agagatttag aaggtgcctt
     1201 aaaagatatt agtctggttg ctaatttcaa acaaattgac acgattactg ttgacattgc
     1261 tgccgaagct attcgcgcca gaaagcaaga tggacctaaa atgacagtta ttcccatcga
     1321 agaaattcaa gcgcaagttg gaaaatttta cggtgttacc gtcaaagaaa ttaaagctac
     1381 taaacgaaca caaaatattg ttttagcaag acaagtagct atgtttttag cacgtgaaat
     1441 gacagataac agtcttccta aaattggaaa agaatttggt ggcagagacc attcaacagt
     1501 actccatgcc tataataaaa tcaaaaacat gatcagccag gacgaaagcc ttaggatcga
     1561 aattgaaacc ataaaaaaca aaattaaata acatgtggaa aagaatatct tttatgaaat
//

Like FASTA files, GenBank files can also contain information about multiple separate sequences. Information belonging to separate sequences are separated by lines containing two forward slashes // There is nice example GenBank file and detailed explainer at NCBI here.

Section 4 - SAM / BAM

SAM-formatted files are used to store read alignment information.

In SAM files, each line corresopnds to a single sequence read. Information about the read is organized into columns:

Col	Field	Description
1	QNAME	Query template name, i.e. read ID
2	FLAG	Bitwise flag with read mapping information. See below for more information.
3	RNAME	Reference sequence name
4	POS	Leftmost mapping position of the read on the reference sequence
5	MAPQ	Mapping quality score
6	CIGAR	Information about which portions of the read mapped. More detail on CIGAR strings can be found in the SAM manual or this brief explainer
7	RNEXT	Read ID of the paired read (if using paired reads)
8	PNEXT	Leftmost mapping position of the paired read
9	TLEN	Template length, i.e. distance between the aligned read pairs
10	SEQ	Sequence of the read
11	QUAL	Quality score of the read

Many alignment programs will include extra data about the alignment in columns 12 and up. You should refer to the alignment software's manual for help interpreting these data, if interested.

SAM Flags:

The second column in a SAM aligment file is an integer that represents the flag value. This value can be decoded to identify several properties of the alignment for an individual read such as whether it's mapped on the forward or reverse strand of reference sequence or whether it is part of a pair of mapped reads. There is a useful website that allows you to quickly decode individual flag values: Decoding SAM Flags

SAM Example:

M01915:185:000000000-JLJ5N:1:2117:4014:11709    147     NZ_CP010450     2743    60      85M     =       2743    -85     ACCTTATTGAGTCTTTAAAAGCTATTAAAAGTGAAACAGTAAAAATTCATTTCTTATCACCAGTTCGACCATTCACCCTAACACC   AGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC       NM:i:0  MD:Z:85 AS:i:85 XS:i:0  RG:Z:GAS_alignment      MQ:i:60 MC:Z:85M        ms:i:3207
M01915:185:000000000-JLJ5N:1:1115:16084:21280   163     NZ_CP010450     2747    60      100M    =       2747    100     TATTGAGTCTTTAAAAGCTATTAAAAGTGAAACAGTAAAAATTCATTTCTTATCACCAGTTCGACCATTCACCCTAACACCAGGCGATGAGGAAGAAAGT    CCCCCGGGGGGGGFFGGGGGGGFGFGGGGGGGGGGGGGFGFFGGGFGGFGGGGGGFGGGFFFGGGDFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFC        NM:i:0  MD:Z:100        AS:i:100        XS:i:0  RG:Z:GAS_alignment      MQ:i:60 MC:Z:100M       ms:i:3766
M01915:185:000000000-JLJ5N:1:1115:16084:21280   83      NZ_CP010450     2747    60      100M    =       2747    -100    TATTGAGTCTTTAAAAGCTATTAAAAGTGAAACAGTAAAAATTCATTTCTTATCACCAGTTCGACCATTCACCCTAACACCAGGCGATGAGGAAGAAAGT    GGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFFFFGGGGGGGGGGFEGGGGGGGGGGGGDGGFGGGGGGFFGGGGGGGGGGGGGGGGCCCCC        NM:i:0  MD:Z:100        AS:i:100        XS:i:0  RG:Z:GAS_alignment      MQ:i:60 MC:Z:100M       ms:i:3758

BAM files contain the same data as SAM files, but are converted to a binary format to reduce storage requirements as well as for more rapid access by programs that read alignment data. Unlike a SAM file, you won't be able to open and read a BAM file in a text editing program.

Section 5 - Pileup

SAM files represent alignments one read at a time. To summarize the base calls of the alignment per position along the reference sequence, the SAM alignment can be converted to the Pileup format by programs such as samtools.

Pileup columns:

Col	Description
1	Sequence identifier
2	Position in sequence (1-based)
3	Reference base
4	Number of aligned reads
5	Bases at that position from aligned reads
6	Base quality scores, in Phred format

Pileup example:

NC_009089.1	2210	A	32	..,..,,,.,.,,,,...,,,..,,.,.C,,.	A>GDFGGGDGFDFD,GGGGEFGGGEF,F,GCF
NC_009089.1	2211	A	32	..,..,,,.,.,,,,...,,,.,,,.,..,,.	ECGDFGGGCGFDDEGGGGFFEGGE=F=FCGFG
NC_009089.1	2212	T	32	CCcCCcccCcCccccCCCcccCcccCcC.ccC	E7GFCGGG,GG>66EGGGGECGDCGE+E,GEG
NC_009089.1	2212	G	32	..,..,,,.,.,,,,...,,,.,,,.,..,,.	E:GADGGGCG:GFGEGGGGFFGGFFG<F,GFG
NC_009089.1	2214	A	32	..,..,,,.,.,,,,...,,,.,,,.,..,,.	F5GFGGGFEGFGDEGGGGG@GGGGEG:E;G=C

Bases column interpretation:

. (dot): Base matches the reference on the forward strand
, (comma): Base matches the reference on the reverse strand
AGTCN (upper case): Base that did not match the reference on the forward strand
agtcn (lower case): Base that did not match the reference on the reverse strand
^ (caret): Start of a read segment followed by mapping quality
$ (dollar): End of a read segment

Section 6 - Variant Call Format (VCF)

File containing sequence variation data, i.e. base calls at each position based on sequence alignments. Also contains supporting information.

VCF columns:

Col	Field	Description
1	CHROM	The name of the sequence
2	POS	Position of the variation on the sequence (1-based)
3	ID	Identifier of the variation; "." if unknown
4	REF	Reference base
5	ALT	Alternative base(s) based on alignment
6	QUAL	Quality score of the base call(s)
7	FILTER	Variation filter results; "." if unknown
8	INFO	Descriptons of the variation. See here for more information.
9	FORMAT	Descriptions of the samples. See link above.

| SAMPLEs | Per-sample balues for each of the fields listed in FORMAT

VCF example:

NC_009089.1	2210	.	A	.	117	.	DP=32;AF1=0;AC1=0;DP=13,16,0,0;MQ=60;FQ=-114	PL:DP:SP	0:29:0
NC_009089.1	2211	.	A	.	126	.	DP=32;AF1=0;AC1=0;DP=14,18,0,0;MQ=60;FQ=-123	PL:DP:SP	0:32:0
NC_009089.1	2212	.	T	C	222	.	DP=32;VDB=9.43e-02;AF1=1;AC1=2;DP=4,0,11,15;MQ=60;FQ=-105	PL:DP:SP	0:29:0
NC_009089.1	2213	.	G	.	123	.	DP=32;AF1=0;AC1=0;DP=13,18,0,0;MQ=60;FQ=-120	PL:DP:SP	0:31:0
NC_009089.1	2214	.	A	.	120	.	DP=32;AF1=0;AC1=0;DP=13,17,0,0;MQ=60;FQ=-117	PL:DP:SP	0:30:0

Section 7 - Tree files

There are a number of differnt formats for storing phylogenetic trees or other relational data. One of the most common and simplest formats is Newick format. A relative of the Newick tree files is the Nexus format which contains a newick-formatted tree but can also encode other data associated with the tree.

Common suffixes for tree files are .tre and .nwk, but there are others out there.

Newick Tree File Example:

(COV0536:0.000228603,(COV0118:0.000114239,(COV0074:0.000038079,COV0420:0.000076177)0.997:0.000342946)1.000:0.000342930,(COV1650:0.000114265,COV0415:0.000000005)0.438:0.000000005);

Back to table of contents

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

08 Sequencing File Types - NU-CPGME/quest_genomics_2025 GitHub Wiki

Section 1 - FASTA

Section 2 - FASTQ

Section 3 - GenBank

Section 4 - SAM / BAM

Section 5 - Pileup

Section 6 - Variant Call Format (VCF)

Section 7 - Tree files

Back to table of contents

⚠️ GitHub.com Fallback ⚠️

08 Sequencing File Types - NU-CPGME/quest_genomics_2025 GitHub Wiki

Section 1 - FASTA

Section 2 - FASTQ

Section 3 - GenBank

Section 4 - SAM / BAM

Section 5 - Pileup

Section 6 - Variant Call Format (VCF)

Section 7 - Tree files

Back to table of contents

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️