Quality - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki

Quality check on your reads

Section: Data [4/5].

Why is this important?

For many reasons, most of which are covered here (slide 3). Mostly, because we need to know better our data before deciding whether to confidently proceeding any further (or not).

Note: it is normal to get "Quality reports" now, delivered as companion to PacBio or ONT sequencing.

Let's get some stats!

We will first assess our fasta/q reads, in order to get a feeling for:

  • total bp sum
  • theoretical coverage
  • N50 read length

You can come up with some solution yourself or use a prepared script:

Careful!

The script below accepts input in fasta format. Some awk magic and it's done:

cat <file.fastq> | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > <file.fasta>

Get the script and execute

wget https://drive.switch.ch/index.php/s/BIUdQPc2wSGG5vk/download -O assesV4.2.pl
chmod +x assesV4.2.pl
./assesV4.2.pl -s 1 -t 1 -i <input.fasta>

The part that is interesting for us now is the "overall stats" part.

Sample output:

overall stats

    GC ratio (in %):                            38.90
    Shortest sequence length:                      75
    Longest sequence length:                    64553
    bp-sum of all sequences :                73760116
    Number of sequences >1:                      6355
    N50 sequence length :                       19664
    NG50 sequence length :                      55118
    lower Case (#/bps/%):                           0          0    0.00
   gaps (#/bps/%):                                  0          0    0.00

What can you conclude?

Alternative: BBMap!

BBMap has some utilities to check on (any!) fasta files. Of course it will be "too much" info for us at this point, as it was also thought to give stats on fasta containing genome assemblies already. But that's a pretty good start.

If you want to try it:

Get BBMap suite

no need to compile!

wget https://sourceforge.net/projects/bbmap/files/BBMap_38.29.tar.gz 
tar -xvzf BBMap_38.29.tar.gz

The utility of interest for us now is stats.sh (e.g. in bbmap/stats.sh), usage as per bbmap/stats.sh -h

Sample output:

training@af85a354ae34:~$ bbmap/stats.sh in=PBbam_subset.fasta
A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.2520  0.2503  0.2286  0.2692  0.0000  0.0000  0.0000  0.4788  0.0697
 
Main genome scaffold total:             200
Main genome contig total:               200
Main genome scaffold sequence total:    0.592 MB
Main genome contig sequence total:      0.592 MB        0.000% gap
Main genome scaffold N/L50:             58/3.111 KB
Main genome contig N/L50:               58/3.111 KB
Main genome scaffold N/L90:             148/1.818 KB
Main genome contig N/L90:               148/1.818 KB
Max scaffold length:                    8.639 KB
Max contig length:                      8.639 KB
Number of scaffolds > 50 KB:            0
% main genome in scaffolds > 50 KB:     0.00%
 
 
Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                    200             200         592,068         592,068   100.00%
     50                    200             200         592,068         592,068   100.00%
    100                    198             198         591,921         591,921   100.00%
    250                    196             196         591,515         591,515   100.00%
    500                    187             187         588,161         588,161   100.00%
   1 KB                    181             181         583,362         583,362   100.00%
 2.5 KB                    128             128         490,678         490,678   100.00%
   5 KB                     30              30         199,701         199,701   100.00%

Why not FastQC?

If you've done some NGS data before, you might have heard of FastQC, as quality control tool on fastq files.

FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis.

The reports look like this:

  • PacBio, subreads

pacbiosubreads

  • PacBio, CCS

pacbioccs

  • ONT, 2D

ont2d

  • ONT, 1D2

ont1d2

Pretty terrible huh? Yes, definitely. Because FastQC was thought with Illumina data in mind. It's not really meaningful for long reads data.

Another suggestion: IGV!

You can also visualize your aligned data with IGV. This is especially meaningful if you have high-coverage of a gene/fragment on interest and you want to explore.

You'll have to:

  • download the executable suitable for your machine
  • import an indexed and sorted (hint: samtools) alignment file (hint: minimap2)
  • the corresponding gene of interest (known fasta and annotation)

We will anyway try this later on for the "Transcriptome assessment" part!

Next

Back

⚠️ **GitHub.com Fallback** ⚠️