File Formats - NBISweden/workshop-genome_assembly GitHub Wiki
Fastq format is a common format to store sequencing data along with quality information. Each sequence record
spans four lines, and a Fastq file can contain from millions to billions of sequence records. The first line
of the sequence record is the header line. This starts with an @
symbol followed by a unique string that
identifies the record uniquely. The second line contains the sequence. The third line starts with a +
symbol
and optionally contains the unique header string again. The last line is ASCII encoded quality scores, one for each
base. Each symbol encodes a number, the quality score, which translates into a scaled probability score that is the
likelihood of that base being incorrect. Quality scores are generally expected by most tools to be Phred scaled, however
older formats have used other quality score value ranges.
For paired data, sequences can be either interleaved (read2 follows read1), or in separate files. If the read1 and read2 are in separate files, then the sequence records are expected to be in the same order within both files.
@sequence_specific_information
ACGTACGTACGTACGTACGTACGTACGTACGT
+
!"#$%&'()*+,-./0123456789:;<=>?@
Possible problems:
- Unexpected quality score encoding.
- Empty sequence.
- Varying sequence length for fixed length sequencing (prior processing).
- Quality score sequence length does not match sequence length
- Missing mate record for paired data.
- Merged data (e.g. from multiple runs, multiplexed samples)
Illumina data includes lots of meta data in the header than can be helpful to determining things like merged samples, or barcode mixups.
@HWI-ST486:212:D0C8BACXX:6:1101:2365:1998 1:N:0:ATTCCT
CTTATCGGATCGATCCCAGTTTGGGCTTGTAAACGGTGAATCCTCAAAGACCACCAATGTTG
+
CCCFFFFFHHHHHJJJJJJHIJIIJGGJGFEGIGHIBFGHJIJIICHIIIDHGGIGIGHEFG
@HWI-ST486:212:D0C8BACXX:6:1101:2365:1998 2:N:0:ATTCCT
TAACCGAGCAAACAAAAGTTGGTTGTCACAAATTGTAATGACCTGATTAAACTTGATTTTTT
+
CCCFFFFFHHHHHJIIIJHIJJHIJJJJJJJJJJJIJJJIJJJJJIIIJJIJJJJGIJJJJH
The header information can be broken down as follows:
Header part | Description |
---|---|
HWI-ST486 | The unique instrument name |
212 | The run ID |
D0C8BACXX | The flowcell ID |
6 | The flowcell lane |
1101 | The tile number on the flowcell |
2365 | The x-coordinate within the tile |
1998 | The y-coordinate with the tile |
1 | The member of a pair, 1 or 2 (paired-end or mate-pair reads only) |
Y | Y if the read fails Illumina's quality filter, N otherwise |
0 | Control bits |
ATTCCT | The adapter barcode index |
PacBio headers also include a lot of meta data
@m140415_143853_42175_c100635972550000001823121909121417_s1_p0/533/3100_11230
The header information can be broken down as follows:
Header part | Description |
---|---|
m | movie |
140415_143853 | The run start time yymmdd_hhmmss
|
42175 | The Instrument serial number |
c100635972550000001823121909121417 | The SMRT Cell barcode |
s1 | Set number (deprecated) |
p0 | Part number |
553 | ZMW hole number |
3100_11230 | The subread region (start_stop ) |
Fasta files are another common format to store sequences, however without any accompanying quality information. A
sequence record can span any number of lines (known as multi-fasta), but contains two parts. The first part is the
unique sequence header, which starts with an >
character. The second part is the sequence, which can span multiple
lines, as long sequences are often folded to a fixed width for display purposes.
>sequence_id optional_data
ACGTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGT
Possible problems:
- Illegal characters in the header (tool specific, e.g.
|
or whitespace ) - Empty sequence
SAM and it's binary encoded brother BAM is the sequence alignment format. It's most often used used to store alignment information, but more recently also sequence data with metadata other than quality scores, without any alignment information.
@HD VN:1.0 SO:coordinate
@SQ SN:1 LN:249250621 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ SN:2 LN:243199373 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:a0d9851da00400dec1098a9255ac712e
@SQ SN:3 LN:198022430 AS:NCBI37 UR:file:/data/local/ref/GATK/human_g1k_v37.fasta M5:fdfd811849cc2fadebc929bb925902e5
@RG ID:UM0098:1 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L001 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 CN:UMCORE
@RG ID:UM0098:2 PL:ILLUMINA PU:HWUSI-EAS1707-615LHAAXX-L002 LB:80 DT:2010-05-05T20:00:00-0400 SM:SD37743 CN:UMCORE
@PG ID:bwa VN:0.5.4
1:497:R:-272+13M17D24M 113 1 497 37 37M 15 100338662 0 CGGGTCTGACCTGAGGAGAACTGTGCTCCGCCTTCAG 0;==-==9;>>>>>=>>>>>>>>>>>=>>>>>>>>>> XT:A:U NM:i:0 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
19:20389:F:275+18M2D19M 99 1 17644 0 37M = 17919 314 TATGACTGCTAATAATACCTACACATGTTAGAACCAT >>>>>>>>>>>>>>>>>>>><<>>><<>>4::>>:<9 RG:Z:UM0098:1 XT:A:R NM:i:0 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:0 XG:i:0 MD:Z:37
19:20389:F:275+18M2D19M 147 1 17919 0 18M2D19M = 17644 -314 GTAGTACCAACTGTAAGTCCTTATCTTCATACTTTGT ;44999;499<8<8<<<8<<><<<<><7<;<<<>><< XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:4 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:18^CA19
9:21597+10M2I25M:R:-209 83 1 21678 0 8M2I27M = 21469 -244 CACCACATCACATATACCAAGCCTGGCTGTGTCTTCT <;9<<5><<<<><<<>><<><>><9>><>>>9>>><> XT:A:R NM:i:2 SM:i:0 AM:i:0 X0:i:5 X1:i:0 XM:i:0 XO:i:1 XG:i:2 MD:Z:35
There are two parts to a SAM file. The first is the header section, that contains metadata records. Each
record starts with the @
symbol, followed by a two-letter code for the information that will follow. The information
that follows are tab separated tuples that hold extra data. The second part of a SAM file are the alignment records.
These are also tab separated values that follow the following format:
Column | Description |
---|---|
1 | Sequence identifier |
2 | Bitwise flag encoding several alignment properties |
3 | Reference sequence name |
4 | 1-based (SAM) or 0-based (BAM) leftmost alignment position |
5 | Alignment quality score |
6 | CIGAR string - run-length encoded alignment description |
7 | Sequence identifier of it's mate/next read |
8 | Alignment position of the mate/next read |
9 | Observed DNA fragment length |
10 | Read sequence |
11 | Phred-scaled base quality scores |
12+ | Optional TAG:TYPE:VALUE data descriptors |
Possible problems:
- Missing fields (e.g. pulse-field data for PacBio reads)
- Missing metadata
- Incorrect bitwise flag settings (column 2)
A bitwise flag is the sum of the values below, encoding multiple properties for a read.
For example, 67 = 1 + 2 + 64.
Description | Decimal |
---|---|
Read paired | 1 |
Read mapped in proper pair | 2 |
Read unmapped | 4 |
Mate unmapped | 8 |
Read reverse strand | 16 |
Mate reverse strand | 32 |
First in pair | 64 |
Second in pair | 128 |
Not primary alignment | 256 |
Read fails platform/vendor quality checks | 512 |
Read is PCR or optical duplicate | 1024 |
Supplementary alignment | 2048 |
A bitwise flag of 67 for example is written in binary as 000001000011
. This is read from the right, saying the
first column is true, the second column is true, and the seventh column is true, which correspond to the 1st, 2nd, and
7th rows in the table above. This translates a score of 67 as meaning: the read is paired, the read is mapped in a proper
pair, and the read is the first in the pair. As the third, fourth, and ninth bits are also false, this means that a score of
67 also states the read is mapped, and it's mate is also mapped, and it is a primary alignment.
HDF5 is the Hierarchical Data Format version 5. It's a flexible binary format designed to store large amounts of data. The structure of the data is described within the file in it's own format.
Oxford Nanopore uses a variant called Fast5. Pacfic Biosciences RSII platform uses variants in bax.h5 and bas.h5.
These data formats are used to store the signal data generated by the platform. Additional processing is often required to then add the base call and quality score information into the file, however these are often translated directly into Fastq or BAM format now.
HDF5 tools is a package that lets you view the data and hierarchy within the data.
Possible problems:
- Unable to extract the sequence and quality scores. The files need to be run through a base-caller first, after which sequence and quality scores should be available.
GFA is the Graphical Fragment Assembly format. It is commonly used to store relationships between sequences, often as an intermediate format of assemblers.
Each record in a GFA file is one line long, and starts with a one-letter character to describe the information that follows.
Record | Type |
---|---|
H VN:Z:1.0 |
Header |
S 11 ACCTT |
Segment |
S 12 TCAAGG |
Segment |
S 13 CTTGATT |
Segment |
L 11 + 12 - 4M |
Link |
L 12 - 13 + 5M |
Link |
L 11 + 13 + 3M |
Link |
P 14 11+,12-,13+ 4M,5M |
Path |
A header line (starts with H) encodes metadata as TAG:TYPE:VALUE data descriptors.
A segment line (starts with S) contains a unique numeric ID for the segment and is followed by the sequence.
A link line (starts with L) contains the ID's of the segments that are linked together, and the strand. The 5th column is an optional CIGAR string describing the alignment in the link.
A path line (starts with P) contains another unique ID, and how segments should be tiled to make the sequence, optionally describing the alignment as a CIGAR string as well.
For example, the path above is as follows
11 ACCTT
12 CCTTGA
13 CTTGATT
14 ACCTTGATT
Possible problems:
- GFA segments are not 1-to-1 with Fasta sequences (Path entries should be 1-to-1 with Fasta in this case, but then path sequences need to be inferred).
- Segment sequence is
*
instead of nucleotides (There should be a corresponding fasta where segments are 1-to-1 with sequences in the fasta). - Graph visualization of the link relationships can be misleading when paths correspond 1-to-1 with the fasta, rather than the segments.
- Neither GFA segments nor GFA paths correspond 1-to-1 with Fasta. In this case, GFA path identifiers are usually the contig names followed by an underscore and then a number. Multiple GFA segments have been collapsed into a single contig, and the segment ID's are obtained by removing the underscore and number and merging obtained path IDs.