Quality - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki
Section: Data [4/5].
For many reasons, most of which are covered here (slide 3). Mostly, because we need to know better our data before deciding whether to confidently proceeding any further (or not).
Note: it is normal to get "Quality reports" now, delivered as companion to PacBio or ONT sequencing.
We will first assess our fasta/q reads, in order to get a feeling for:
- total bp sum
- theoretical coverage
- N50 read length
You can come up with some solution yourself or use a prepared script:
The script below accepts input in fasta
format. Some awk magic and it's done:
cat <file.fastq> | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > <file.fasta>
wget https://drive.switch.ch/index.php/s/BIUdQPc2wSGG5vk/download -O assesV4.2.pl
chmod +x assesV4.2.pl
./assesV4.2.pl -s 1 -t 1 -i <input.fasta>
The part that is interesting for us now is the "overall stats" part.
Sample output:
overall stats
GC ratio (in %): 38.90
Shortest sequence length: 75
Longest sequence length: 64553
bp-sum of all sequences : 73760116
Number of sequences >1: 6355
N50 sequence length : 19664
NG50 sequence length : 55118
lower Case (#/bps/%): 0 0 0.00
gaps (#/bps/%): 0 0 0.00
What can you conclude?
BBMap has some utilities to check on (any!) fasta files. Of course it will be "too much" info for us at this point, as it was also thought to give stats on fasta containing genome assemblies already. But that's a pretty good start.
If you want to try it:
no need to compile!
wget https://sourceforge.net/projects/bbmap/files/BBMap_38.29.tar.gz
tar -xvzf BBMap_38.29.tar.gz
The utility of interest for us now is stats.sh
(e.g. in bbmap/stats.sh
), usage as per bbmap/stats.sh -h
Sample output:
training@af85a354ae34:~$ bbmap/stats.sh in=PBbam_subset.fasta
A C G T N IUPAC Other GC GC_stdev
0.2520 0.2503 0.2286 0.2692 0.0000 0.0000 0.0000 0.4788 0.0697
Main genome scaffold total: 200
Main genome contig total: 200
Main genome scaffold sequence total: 0.592 MB
Main genome contig sequence total: 0.592 MB 0.000% gap
Main genome scaffold N/L50: 58/3.111 KB
Main genome contig N/L50: 58/3.111 KB
Main genome scaffold N/L90: 148/1.818 KB
Main genome contig N/L90: 148/1.818 KB
Max scaffold length: 8.639 KB
Max contig length: 8.639 KB
Number of scaffolds > 50 KB: 0
% main genome in scaffolds > 50 KB: 0.00%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 200 200 592,068 592,068 100.00%
50 200 200 592,068 592,068 100.00%
100 198 198 591,921 591,921 100.00%
250 196 196 591,515 591,515 100.00%
500 187 187 588,161 588,161 100.00%
1 KB 181 181 583,362 583,362 100.00%
2.5 KB 128 128 490,678 490,678 100.00%
5 KB 30 30 199,701 199,701 100.00%
If you've done some NGS data before, you might have heard of FastQC, as quality control tool on fastq files.
FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.
The reports look like this:
- PacBio, subreads
- PacBio, CCS
- ONT, 2D
- ONT, 1D2
Pretty terrible huh? Yes, definitely. Because FastQC was thought with Illumina data in mind. It's not really meaningful for long reads data.
You can also visualize your aligned data with IGV. This is especially meaningful if you have high-coverage of a gene/fragment on interest and you want to explore.
You'll have to:
- download the executable suitable for your machine
- import an indexed and sorted (hint:
samtools
) alignment file (hint:minimap2
) - the corresponding gene of interest (known fasta and annotation)
We will anyway try this later on for the "Transcriptome assessment" part!