Polymorphism detection from fastq files on a linux terminal - SouthGreenPlatform/tutorials GitHub Wiki


Description

This page describes a serie of tools and linux commands used to detect polymorphism from raw data (fastq files) to polymorphism file (vcf).

We need, in this tutorial:

  • 2 fastq files paired
  • a reference file used for the mapping step.

Author(s)

Authors Christine Tranchant-Dubreuil
Research Unit UMR DIADE
Institut

Keywords

fastq-stats, fastqc, cutadapt, bwa, samtools, gatk, picardtools

Files format

fastq, sam, bam, vcf

Date

10/03/2017


Summary


Getting basic informations about fastq files (file size, sequences number)

How to get fastq file size

[tranchant@master0 ~]$ du -sh *fastq 
311M    C3KB2ACXX_5_12_11_debar.fastq
525M    C3KB2ACXX_5_12_12_debar.fastq

How to get sequences number by fastq files (uncompressed files only)

[tranchant@master0 ~]$ wc -l *.fastq | awk '{ print $2" \t "$1/4}'
C3KB2ACXX_5_12_11_debar.fastq 	 1354891
C3KB2ACXX_5_12_12_debar.fastq 	 2249353

Getting a report with various statistics about fastq file with ea-utils

This software runs quickly, faster than fastqc and the output can be parsed and formatted with some basics linux command https://code.google.com/p/ea-utils/

How to use fastq-stats with one fastq file

[tranchant@node11 FASTQ]$ fastq-stats -D PdFIE94_R1.fq.gz 
reads	15034838
len	144
len mean	144.0000
len stdev	0.0000
len min	144
phred	33
window-size	2000000
cycle-max	35
qual min	2
qual max	42
qual mean	40.2191
qual stdev	4.8129
%A	25.5314
%C	26.3730
%G	22.4570
%T	25.6209
%N	0.0178
total bases	2165016672

How to use fastq-stats on several fastq files

[tranchant@node2 tranchant]$ ls
DV.UNMAPPED_1.fastq              DV.UNMAPPED_2.fastq
DT.UNMAPPED_1.fastq              DT.UNMAPPED_2.fastq

# you can execute every fastq-stats command by using for loop (bash)
[tranchant@node2 tranchant]$ for file in *fastq; do fastq-stats -D $file > $file.fastq-stats ; done;

#Check that the files have been created
[tranchant@node2 tranchant]$ ls -l *stats
-rw-r--r-- 1 tranchant ggr 246  8 juin  14:55 DT.UNMAPPED_1.fastq.fastq-stats
-rw-r--r-- 1 tranchant ggr 247  8 juin  14:55 DT.UNMAPPED_2.fastq.fastq-stats
-rw-r--r-- 1 tranchant ggr 246  8 juin  14:55 DV.UNMAPPED_1.fastq.fastq-stats
-rw-r--r-- 1 tranchant ggr 246  8 juin  14:55 DV.UNMAPPED_2.fastq.fastq-stats

#check one file		
[tranchant@node2 tranchant]$ cat DT.UNMAPPED_1.fastq.fastq-stats
reads	11043989
len	101
len mean	99.3377
len stdev	7.4780
len min	30
phred	33
window-size	2000000
cycle-max	35
qual min	2
qual max	41
qual mean	37.7341
qual stdev	3.8379
%A	29.2366
%C	20.8381
%G	20.7269
%T	29.1967
%N	0.0017
total bases	1097084447

How to get a tab file from all files generated by the command ls | cut | sed | sh

# To generate a header
[tranchant@node11 FASTQ]$ cut -f1 PdFIE94_R1.fq.gz.fastq-stats | paste -s - - > all.stats

# To display header file
[tranchant@node11 FASTQ]$ head all.stats 
reads	len	len mean	len stdev	len min	phred	window-size	cycle-max	qual min	qual max	qual mean	qual stdev	%A	%C	%G	%T	%N	total bases

# To parse every file and to create a tab-delimited file
[tranchant@node11 FASTQ]$ awk 'BEGIN { num=0;  }  { num++; if (num == 1) { print FILENAME }  print $NF; if (num == 18) {print "\n"; num=0; } } END {  } ' *fastq-stats   |   awk ' BEGIN { RS="\n\n"; ;} { print $1,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19;}' >> all.stats

# Display 10 first lines of the file all.stats
[tranchant@node11 FASTQ]$ head all.stats 
reads	len	len mean	len stdev	len min	phred	window-size	cycle-max	qual min	qual max	qual mean	qual stdev	%A	%C	%G	%T	%N	total bases
PdFIE94_R1.fq.gz.fastq-stats 144 144.0000 0.0000 144 33 2000000 35 2 42 40.2191 4.8129 25.5314 26.3730 22.4570 25.6209 0.0178 2165016672
PdFIE94_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 34.8602 10.4600 26.6998 23.6108 22.8715 26.4631 0.3548 2270260538
PdFIE95_R1.fq.gz.fastq-stats 143 143.0000 0.0000 143 33 2000000 35 2 42 40.1422 4.9354 25.2075 26.6116 22.5737 25.5869 0.0202 1873211197
PdFIE95_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 35.0387 10.3787 26.5003 24.0084 23.0377 26.0990 0.3546 1978006229
PdFIE96_R1.fq.gz.fastq-stats 143 143.0000 0.0000 143 33 2000000 35 2 42 40.1133 4.9931 25.0246 26.6680 22.6115 25.6756 0.0204 1488540911
PdFIE96_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 34.9494 10.4096 26.5905 24.0838 23.1924 25.7780 0.3554 1571815927
PdFIE98_R1.fq.gz.fastq-stats 143 143.0000 0.0000 143 33 2000000 35 2 42 40.2191 4.8173 25.3874 26.4649 22.4244 25.7031 0.0201 1443785200
PdFIE98_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 35.0267 10.3824 26.6316 23.8662 22.8668 26.2791 0.3564 1524556400

Getting various statistics about fastq and performing a quality control check with fastqc

FastQC perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in data which may affect how user can usefully use it.

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

fastqc command

[tranchant@master0 ~]$ fastqc *.fastq -o .
[tranchant@master0 ~]$ ls 
total 543577236
drwxr-xr-x 4 tranchant ggr                4096  9 mars  22:11 PdFIE94_R1.fq_fastqc
-rw-r--r-- 1 tranchant ggr              225516  9 mars  22:11 PdFIE94_R1.fq_fastqc.zip

Using cutadapt to remove adapters and to trim reads based on quality

Cutadapt is a tool specifically designed to remove adapters from NGS data. https://code.google.com/p/cutadapt/

cutadadapt command

cutadapt  -q 30,30 -m 35  -B GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -B GTTCGTCTTCTGCCGTATGCTCTAGCACTACACTGACCTCAAGTCTGCACACGAGAAGGCTAG -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -b GTTCGTCTTCTGCCGTATGCTCTAGCACTACACTGACCTCAAGTCTGCACACGAGAAGGCTAG-o P1_R1.CUTADAPT.fastq.gz -p P1_R2.CUTADAPT.fastq.gz P1_R1.fq.gz P1_R2.fq.gz

Cutadapt supports trimming of multiple types of adapters:

Adapter type Command-line option
3’ adapter -a ADAPTER
5’ adapter -g ADAPTER
5’ or 3’ (both possible) -b ADAPTER

-q 30, 30 : by default, only the 3’ end of each read is quality-trimmed. If you want to trim the 5’ end as well, use the -q option with two comma-separated cutoffs

-p is the short form of --paired-output. The option -B is used here to specify an adapter sequence that cutadapt should remove from the second read in each pair.

Mapping reads with bwa

http://bio-bwa.sourceforge.net/

There are several steps involved in mapping sequence reads.

Creating an index of the reference genome with bwa index

bwa index -a is

is for short genome and bwtsw for genome >2Gb

Performing mapping with bwa aln and bwa sampe/samse

Getting sai files with bwa aln

bwa aln reference file_forward.fastq > file_forward.sai
bwa aln reference file_reverse.fastq > file_reverse.sai

Getting sam file with bwa sampe (paired)

Here is a description for the contents of the SAM file: https://samtools.github.io/hts-specs/SAMv1.pdf

bwa sampe reference file_forward.sai file_reverse.sai  file_forward.fastq file_reverse.fastq -f file.sam  -r '@RG        ID:RG  SM:RG  PL:Illumina'

Getting sam file with bwa samse (single)

bwa samse reference file.sai file.fastq file_reverse.fastq -f file.sam  -r '@RG        ID:RG  SM:RG  PL:Illumina'

Performing mapping with bwa mem

bwa mem  reference file.fastq  file_reverse.fastq  -R '@RG\tID:RC3\tSM:RC3\tPL:Illumina' > file.sam 

Processing sam file with picardtools and samtools

https://broadinstitute.github.io/picard/ http://samtools.sourceforge.net/

Creating Reference dictionary and indexes

with picardtools CreateSequenceDictionary

/usr/bin/java -Xmx8g  -jar /us/local/picard-tools-2.5.0/picard.jar CreateSequenceDictionary REFERENCE=Reference.fasta OUTPUT=Reference.dict

with samtools faidx

samtools faidx Rreference.fasta

Converting the sam file into a bam file and sorting the bam file

/usr/bin/java -Xmx12g -jar /usr/local/picard-tools-2.5.0/picard.jar SortSam  CREATE_INDEX=TRUE VALIDATION_STRINGENCY=SILENT SORT_ORDER=coordinate  INPUT=file.BWASAMPE.sam OUTPUT=file.PICARDTOOLSSORT.bam

The bam index (.bai) is created auomtically by picardtools (samtools index command unnecessary)

Getting some stats from bam file

samtools flagstat file.PICARDTOOLSSORT.bam
14524094 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplimentary
0 + 0 duplicates
11219302 + 0 mapped (77.25%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
samtools idxstats file.PICARDTOOLSSORT.bam

The output is TAB-delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads. It is written to stdout.

Chromosome_8.1	7978604	2274010	0
Chromosome_8.2	8319966	2274086	0
Chromosome_8.3	6606598	1774006	0
Chromosome_8.4	5546968	1483879	0
Chromosome_8.5	4490059	1217779	0
Chromosome_8.6	4133993	1132535	0
Chromosome_8.7	3415785	945505	0
Chromosome_8.8	535760	117502	0
*	0	0	3304792

Remove multimapping and improperly paired reads with samtools view

https://broadinstitute.github.io/picard/explain-flags.html https://ppotato.wordpress.com/2010/08/25/samtool-bitwise-flag-paired-reads/

# Reads correctly mapped extracted
samtools view -h -b -f=0*02 -o file.SAMTOOLSVIEW-MAPPED.bam file.PICARDTOOLSSORT.bam

# Unmapped reads extracted
samtools view -h -b -F=0*02 -o file.SAMTOOLSVIEW-UNMAPPED.bam file.PICARDTOOLSSORT.bam

Creating index of the last bam generated with samtools index

samtools index file.SAMTOOLSVIEW-MAPPED.bam

Adding read group with picardtools AddOrReplaceReadGroups

... if they haven't been added precedently

/usr/bin/java -Xmx12g -jar /usr/local/picard-tools-2.5.0/picard.jar AddOrReplaceReadGroups INPUT=20135.SAMTOOLSVIEW-MAPPED.bam OUTPUT=20135.PICARDTOOLSADDORREPLACEREADGROUPS.bam RGPL=Illumina RGLB=20135 RGPU=20135 RGSM=20135 CREATE_INDEX=True VALIDATION_STRINGENCY=LENIENT > 20135.PICARDTOOLSADDORREPLACEREADGROUPS.log
  • RGPL= Read Group Platform (i.e. Illumina)
  • RGLB= Read Group Library (i.e. DNA preparation library identity i.e "HG19")
  • RGPU= Read Group Platform Unit (i.e Barcode)
  • RGSM= Read Group Sample Name (i.e. Sample Name)

Marking duplicates with picardtools AddOrReplaceReadGroups

... if it is necessary

/usr/bin/java -Xmx12g -jar /usr/local/picard-tools-2.5.0/picard.jar MarkDuplicates INPUT=20135.SAMTOOLSVIEW-MAPPED.bam OUTPUT=20135.PICARDTOOLSMARKDUPLICATES.bam METRICS_FIL=metrics.txt CREATE_INDEX=True VALIDATION_STRINGENCY=LENIENT > 20135.PICARDTOOLSADDORREPLACEREADGROUPS.log

Local realignment around INDELS using GATK

https://www.broadinstitute.org/gatk/

Firstly, creating a target list of intervals which need to be realigned with gatk realignerTargetCreator

/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -T RealignerTargetCreator  -R reference -I file.bam -o file.GATKREALIGNERTARGETCREATOR.intervals 

Then, perform realignment of the target intervals with gatk indelRealigner

/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -T IndelRealigner  -R reference -I file.bam -targetIntervals file.GATKREALIGNERTARGETCREATOR.intervals -o file.GATKINDELREALIGNER.bam

Variant calling using GATK unifiedGenotyper or GATK haplotypeCaller

GATK haplotypeCaller

/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -rf BadCigar  -T HaplotypeCaller  -R reference -I file-ind1.PICARDTOOLSMARKDUPLICATES.bam -I  file-ind2.PICARDTOOLSMARKDUPLICATES.bam  -o file.GATKHAPLOTYPECALLER.vcf

GATK unifiedGenotyper

/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -rf BadCigar   -T UnifiedGenotyper -R reference -I file.bam  -o file.GATKUNIFIEDGENOTYPER.vcf

Basic Filters vcf files using GATK

gatk variantFiltration

/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -T VariantFiltration --filterName 'FILTER-DP' --filterExpression 'DP<10 || DP>600' --filterName 'LowQual' --filterExpression 'QUAL<30'   -R /reference -o file.GATKVARIANTFILTRATION.vcf --variant file.GATKHAPLOTYPECALLER.vcf

gatk selectVariants

/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -selectType SNP -T SelectVariants  -R reference --variant file.GATKVARIANTFILTRATION.vcf -o file.GATKSELECTVARIANT.vcf

More informations

cf. NGS trainings and linux trainings

⚠️ **GitHub.com Fallback** ⚠️