File Formats - NBISweden/workshop-genome_assembly GitHub Wiki

Common File Formats in De Novo Assembly

Fastq

Fastq format is a common format to store sequencing data along with quality information. Each sequence record spans four lines, and a Fastq file can contain from millions to billions of sequence records. The first line of the sequence record is the header line. This starts with an @ symbol followed by a unique string that identifies the record uniquely. The second line contains the sequence. The third line starts with a + symbol and optionally contains the unique header string again. The last line is ASCII encoded quality scores, one for each base. Each symbol encodes a number, the quality score, which translates into a scaled probability score that is the likelihood of that base being incorrect. Quality scores are generally expected by most tools to be Phred scaled, however older formats have used other quality score value ranges.

For paired data, sequences can be either interleaved (read2 follows read1), or in separate files. If the read1 and read2 are in separate files, then the sequence records are expected to be in the same order within both files.

@sequence_specific_information
ACGTACGTACGTACGTACGTACGTACGTACGT
+
!"#$%&'()*+,-./0123456789:;<=>?@

Possible problems:

  • Unexpected quality score encoding.
  • Empty sequence.
  • Varying sequence length for fixed length sequencing (prior processing).
  • Quality score sequence length does not match sequence length
  • Missing mate record for paired data.
  • Merged data (e.g. from multiple runs, multiplexed samples)

Illumina meta data

Illumina data includes lots of meta data in the header than can be helpful to determining things like merged samples, or barcode mixups.

@HWI-ST486:212:D0C8BACXX:6:1101:2365:1998 1:N:0:ATTCCT 
CTTATCGGATCGATCCCAGTTTGGGCTTGTAAACGGTGAATCCTCAAAGACCACCAATGTTG 
 
+
 
CCCFFFFFHHHHHJJJJJJHIJIIJGGJGFEGIGHIBFGHJIJIICHIIIDHGGIGIGHEFG 
 
@HWI-ST486:212:D0C8BACXX:6:1101:2365:1998 2:N:0:ATTCCT 
 
TAACCGAGCAAACAAAAGTTGGTTGTCACAAATTGTAATGACCTGATTAAACTTGATTTTTT 
 
+
 
CCCFFFFFHHHHHJIIIJHIJJHIJJJJJJJJJJJIJJJIJJJJJIIIJJIJJJJGIJJJJH

The header information can be broken down as follows:

Header part Description
HWI-ST486 The unique instrument name
212 The run ID
D0C8BACXX The flowcell ID
6 The flowcell lane
1101 The tile number on the flowcell
2365 The x-coordinate within the tile
1998 The y-coordinate with the tile
1 The member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y Y if the read fails Illumina's quality filter, N otherwise
0 Control bits
ATTCCT The adapter barcode index

PacBio meta data

PacBio headers also include a lot of meta data

@m140415_143853_42175_c100635972550000001823121909121417_s1_p0/533/3100_11230

The header information can be broken down as follows:

Header part Description
m movie
140415_143853 The run start time yymmdd_hhmmss
42175 The Instrument serial number
c100635972550000001823121909121417 The SMRT Cell barcode
s1 Set number (deprecated)
p0 Part number
553 ZMW hole number
3100_11230 The subread region (start_stop)

Fasta

Fasta files are another common format to store sequences, however without any accompanying quality information. A sequence record can span any number of lines (known as multi-fasta), but contains two parts. The first part is the unique sequence header, which starts with an > character. The second part is the sequence, which can span multiple lines, as long sequences are often folded to a fixed width for display purposes.

>sequence_id optional_data
ACGTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGT

Possible problems:

  • Illegal characters in the header (tool specific, e.g. | or whitespace )
  • Empty sequence

SAM/BAM

SAM and it's binary encoded brother BAM is the sequence alignment format. It's most often used used to store alignment information, but more recently also sequence data with metadata other than quality scores, without any alignment information.

@HD VN:1.0  SO:coordinate
@SQ SN:1    LN:249250621    AS:NCBI37   UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ SN:2    LN:243199373    AS:NCBI37   UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e
@SQ SN:3    LN:198022430    AS:NCBI37   UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:fdfd811849cc2fadebc929bb925902e5
@RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L001 LB:80   DT:2010-05-05T20:00:00-0400 SM:SD37743  CN:UMCORE
@RG ID:UM0098:2 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L002 LB:80   DT:2010-05-05T20:00:00-0400 SM:SD37743  CN:UMCORE
@PG ID:bwa  VN:0.5.4
1:497:R:-272+13M17D24M  113 1   497 37  37M 15  100338662   0  CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG  0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>>   XT:A:U   NM:i:0  SM:i:37 AM:i:0  X0:i:1  X1:i:0  XM:i:0  XO:i:0  XG:i:0  MD:Z:37 
19:20389:F:275+18M2D19M 99  1   17644   0   37M =   17919   314  TATGACTGCTAATAATACCTACACATGTTAGAACCAT   >>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9 RG:Z:UM0098:1   XT:A:R  NM:i:0  SM:i:0  AM:i:0  X0:i:4  X1:i:0  XM:i:0  XO:i:0  XG:i:0   MD:Z:37
19:20389:F:275+18M2D19M 147 1   17919   0   18M2D19M    =   17644   -314   GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT   ;44999;499<8<8<<<8<<><<<<><7<;<<<>><<   XT:A:R    NM:i:2  SM:i:0  AM:i:0  X0:i:4  X1:i:0  XM:i:0  XO:i:1  XG:i:2  MD:Z:18^CA19
9:21597+10M2I25M:R:-209 83  1   21678   0   8M2I27M =   21469   -244    CACCACATCACATATACCAAGCCTGGCTGTGTCTTCT   <;9<<5><<<<><<<>><<><>><9>><>>>9>>><>   XT:A:R    NM:i:2  SM:i:0  AM:i:0  X0:i:5  X1:i:0  XM:i:0  XO:i:1  XG:i:2  MD:Z:35

There are two parts to a SAM file. The first is the header section, that contains metadata records. Each record starts with the @ symbol, followed by a two-letter code for the information that will follow. The information that follows are tab separated tuples that hold extra data. The second part of a SAM file are the alignment records. These are also tab separated values that follow the following format:

Column Description
1 Sequence identifier
2 Bitwise flag encoding several alignment properties
3 Reference sequence name
4 1-based (SAM) or 0-based (BAM) leftmost alignment position
5 Alignment quality score
6 CIGAR string - run-length encoded alignment description
7 Sequence identifier of it's mate/next read
8 Alignment position of the mate/next read
9 Observed DNA fragment length
10 Read sequence
11 Phred-scaled base quality scores
12+ Optional TAG:TYPE:VALUE data descriptors

Possible problems:

  • Missing fields (e.g. pulse-field data for PacBio reads)
  • Missing metadata
  • Incorrect bitwise flag settings (column 2)

Interpreting a bitwise flag

A bitwise flag is the sum of the values below, encoding multiple properties for a read.

For example, 67 = 1 + 2 + 64.

Description Decimal
Read paired 1
Read mapped in proper pair 2
Read unmapped 4
Mate unmapped 8
Read reverse strand 16
Mate reverse strand 32
First in pair 64
Second in pair 128
Not primary alignment 256
Read fails platform/vendor quality checks 512
Read is PCR or optical duplicate 1024
Supplementary alignment 2048

A bitwise flag of 67 for example is written in binary as 000001000011. This is read from the right, saying the first column is true, the second column is true, and the seventh column is true, which correspond to the 1st, 2nd, and 7th rows in the table above. This translates a score of 67 as meaning: the read is paired, the read is mapped in a proper pair, and the read is the first in the pair. As the third, fourth, and ninth bits are also false, this means that a score of 67 also states the read is mapped, and it's mate is also mapped, and it is a primary alignment.

HDF5

HDF5 is the Hierarchical Data Format version 5. It's a flexible binary format designed to store large amounts of data. The structure of the data is described within the file in it's own format.

Oxford Nanopore uses a variant called Fast5. Pacfic Biosciences RSII platform uses variants in bax.h5 and bas.h5.

These data formats are used to store the signal data generated by the platform. Additional processing is often required to then add the base call and quality score information into the file, however these are often translated directly into Fastq or BAM format now.

HDF5 tools is a package that lets you view the data and hierarchy within the data.

Possible problems:

  • Unable to extract the sequence and quality scores. The files need to be run through a base-caller first, after which sequence and quality scores should be available.

GFA

GFA is the Graphical Fragment Assembly format. It is commonly used to store relationships between sequences, often as an intermediate format of assemblers.

Each record in a GFA file is one line long, and starts with a one-letter character to describe the information that follows.

Record Type
H VN:Z:1.0 Header
S 11 ACCTT Segment
S 12 TCAAGG Segment
S 13 CTTGATT Segment
L 11 + 12 - 4M Link
L 12 - 13 + 5M Link
L 11 + 13 + 3M Link
P 14 11+,12-,13+ 4M,5M Path

A header line (starts with H) encodes metadata as TAG:TYPE:VALUE data descriptors.

A segment line (starts with S) contains a unique numeric ID for the segment and is followed by the sequence.

A link line (starts with L) contains the ID's of the segments that are linked together, and the strand. The 5th column is an optional CIGAR string describing the alignment in the link.

A path line (starts with P) contains another unique ID, and how segments should be tiled to make the sequence, optionally describing the alignment as a CIGAR string as well.

For example, the path above is as follows

11   ACCTT
12    CCTTGA
13     CTTGATT
14   ACCTTGATT

Possible problems:

  • GFA segments are not 1-to-1 with Fasta sequences (Path entries should be 1-to-1 with Fasta in this case, but then path sequences need to be inferred).
  • Segment sequence is * instead of nucleotides (There should be a corresponding fasta where segments are 1-to-1 with sequences in the fasta).
  • Graph visualization of the link relationships can be misleading when paths correspond 1-to-1 with the fasta, rather than the segments.
  • Neither GFA segments nor GFA paths correspond 1-to-1 with Fasta. In this case, GFA path identifiers are usually the contig names followed by an underscore and then a number. Multiple GFA segments have been collapsed into a single contig, and the segment ID's are obtained by removing the underscore and number and merging obtained path IDs.
⚠️ **GitHub.com Fallback** ⚠️