Polymorphism detection from fastq files on a linux terminal - SouthGreenPlatform/tutorials GitHub Wiki
This page describes a serie of tools and linux commands used to detect polymorphism from raw data (fastq files) to polymorphism file (vcf).
We need, in this tutorial:
- 2 fastq files paired
- a reference file used for the mapping step.
Authors | Christine Tranchant-Dubreuil |
---|---|
Research Unit | UMR DIADE |
Institut |
fastq-stats, fastqc, cutadapt, bwa, samtools, gatk, picardtools
fastq, sam, bam, vcf
10/03/2017
- Getting basic informations about fastq files
- Getting a report with various statistics about file with
ea-utils
- Getting various statistics about fastq and performing a quality control check with
fastqc
- Using
cutadapt
to remove adapters and to trim reads based on quality - Mapping reads with
bwa
- Processing sam file with
picardtools
andsamtools
- Local realignment around INDELS using
GATK
- Variant calling using
GATK unifiedGenotyper
orGATK haplotypeCaller
- Basic Filters vcf files using
GATK
[tranchant@master0 ~]$ du -sh *fastq
311M C3KB2ACXX_5_12_11_debar.fastq
525M C3KB2ACXX_5_12_12_debar.fastq
[tranchant@master0 ~]$ wc -l *.fastq | awk '{ print $2" \t "$1/4}'
C3KB2ACXX_5_12_11_debar.fastq 1354891
C3KB2ACXX_5_12_12_debar.fastq 2249353
This software runs quickly, faster than fastqc and the output can be parsed and formatted with some basics linux command https://code.google.com/p/ea-utils/
[tranchant@node11 FASTQ]$ fastq-stats -D PdFIE94_R1.fq.gz
reads 15034838
len 144
len mean 144.0000
len stdev 0.0000
len min 144
phred 33
window-size 2000000
cycle-max 35
qual min 2
qual max 42
qual mean 40.2191
qual stdev 4.8129
%A 25.5314
%C 26.3730
%G 22.4570
%T 25.6209
%N 0.0178
total bases 2165016672
[tranchant@node2 tranchant]$ ls
DV.UNMAPPED_1.fastq DV.UNMAPPED_2.fastq
DT.UNMAPPED_1.fastq DT.UNMAPPED_2.fastq
# you can execute every fastq-stats command by using for loop (bash)
[tranchant@node2 tranchant]$ for file in *fastq; do fastq-stats -D $file > $file.fastq-stats ; done;
#Check that the files have been created
[tranchant@node2 tranchant]$ ls -l *stats
-rw-r--r-- 1 tranchant ggr 246 8 juin 14:55 DT.UNMAPPED_1.fastq.fastq-stats
-rw-r--r-- 1 tranchant ggr 247 8 juin 14:55 DT.UNMAPPED_2.fastq.fastq-stats
-rw-r--r-- 1 tranchant ggr 246 8 juin 14:55 DV.UNMAPPED_1.fastq.fastq-stats
-rw-r--r-- 1 tranchant ggr 246 8 juin 14:55 DV.UNMAPPED_2.fastq.fastq-stats
#check one file
[tranchant@node2 tranchant]$ cat DT.UNMAPPED_1.fastq.fastq-stats
reads 11043989
len 101
len mean 99.3377
len stdev 7.4780
len min 30
phred 33
window-size 2000000
cycle-max 35
qual min 2
qual max 41
qual mean 37.7341
qual stdev 3.8379
%A 29.2366
%C 20.8381
%G 20.7269
%T 29.1967
%N 0.0017
total bases 1097084447
How to get a tab file from all files generated by the command ls | cut | sed | sh
# To generate a header
[tranchant@node11 FASTQ]$ cut -f1 PdFIE94_R1.fq.gz.fastq-stats | paste -s - - > all.stats
# To display header file
[tranchant@node11 FASTQ]$ head all.stats
reads len len mean len stdev len min phred window-size cycle-max qual min qual max qual mean qual stdev %A %C %G %T %N total bases
# To parse every file and to create a tab-delimited file
[tranchant@node11 FASTQ]$ awk 'BEGIN { num=0; } { num++; if (num == 1) { print FILENAME } print $NF; if (num == 18) {print "\n"; num=0; } } END { } ' *fastq-stats | awk ' BEGIN { RS="\n\n"; ;} { print $1,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19;}' >> all.stats
# Display 10 first lines of the file all.stats
[tranchant@node11 FASTQ]$ head all.stats
reads len len mean len stdev len min phred window-size cycle-max qual min qual max qual mean qual stdev %A %C %G %T %N total bases
PdFIE94_R1.fq.gz.fastq-stats 144 144.0000 0.0000 144 33 2000000 35 2 42 40.2191 4.8129 25.5314 26.3730 22.4570 25.6209 0.0178 2165016672
PdFIE94_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 34.8602 10.4600 26.6998 23.6108 22.8715 26.4631 0.3548 2270260538
PdFIE95_R1.fq.gz.fastq-stats 143 143.0000 0.0000 143 33 2000000 35 2 42 40.1422 4.9354 25.2075 26.6116 22.5737 25.5869 0.0202 1873211197
PdFIE95_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 35.0387 10.3787 26.5003 24.0084 23.0377 26.0990 0.3546 1978006229
PdFIE96_R1.fq.gz.fastq-stats 143 143.0000 0.0000 143 33 2000000 35 2 42 40.1133 4.9931 25.0246 26.6680 22.6115 25.6756 0.0204 1488540911
PdFIE96_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 34.9494 10.4096 26.5905 24.0838 23.1924 25.7780 0.3554 1571815927
PdFIE98_R1.fq.gz.fastq-stats 143 143.0000 0.0000 143 33 2000000 35 2 42 40.2191 4.8173 25.3874 26.4649 22.4244 25.7031 0.0201 1443785200
PdFIE98_R2.fq.gz.fastq-stats 151 151.0000 0.0000 151 33 2000000 35 2 42 35.0267 10.3824 26.6316 23.8662 22.8668 26.2791 0.3564 1524556400
FastQC
perform some simple quality control checks to ensure that the raw data looks good and there are no problems or biases in data which may affect how user can usefully use it.
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
[tranchant@master0 ~]$ fastqc *.fastq -o .
[tranchant@master0 ~]$ ls
total 543577236
drwxr-xr-x 4 tranchant ggr 4096 9 mars 22:11 PdFIE94_R1.fq_fastqc
-rw-r--r-- 1 tranchant ggr 225516 9 mars 22:11 PdFIE94_R1.fq_fastqc.zip
Cutadapt
is a tool specifically designed to remove adapters from NGS data.
https://code.google.com/p/cutadapt/
cutadapt -q 30,30 -m 35 -B GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -B GTTCGTCTTCTGCCGTATGCTCTAGCACTACACTGACCTCAAGTCTGCACACGAGAAGGCTAG -b GATCGGAAGAGCACACGTCTGAACTCCAGTCACATCACGATCTCGTATGCCGTCTTCTGCTTG -b GTTCGTCTTCTGCCGTATGCTCTAGCACTACACTGACCTCAAGTCTGCACACGAGAAGGCTAG-o P1_R1.CUTADAPT.fastq.gz -p P1_R2.CUTADAPT.fastq.gz P1_R1.fq.gz P1_R2.fq.gz
Cutadapt
supports trimming of multiple types of adapters:
Adapter type | Command-line option |
---|---|
3’ adapter | -a ADAPTER |
5’ adapter | -g ADAPTER |
5’ or 3’ (both possible) | -b ADAPTER |
-q 30, 30 : by default, only the 3’ end of each read is quality-trimmed. If you want to trim the 5’ end as well, use the -q option with two comma-separated cutoffs
-p is the short form of --paired-output. The option -B is used here to specify an adapter sequence that cutadapt should remove from the second read in each pair.
http://bio-bwa.sourceforge.net/
There are several steps involved in mapping sequence reads.
bwa index -a is
is
for short genome and bwtsw
for genome >2Gb
Getting sai files with bwa aln
bwa aln reference file_forward.fastq > file_forward.sai
bwa aln reference file_reverse.fastq > file_reverse.sai
Getting sam file with bwa sampe
(paired)
Here is a description for the contents of the SAM file: https://samtools.github.io/hts-specs/SAMv1.pdf
bwa sampe reference file_forward.sai file_reverse.sai file_forward.fastq file_reverse.fastq -f file.sam -r '@RG ID:RG SM:RG PL:Illumina'
Getting sam file with bwa samse
(single)
bwa samse reference file.sai file.fastq file_reverse.fastq -f file.sam -r '@RG ID:RG SM:RG PL:Illumina'
bwa mem reference file.fastq file_reverse.fastq -R '@RG\tID:RC3\tSM:RC3\tPL:Illumina' > file.sam
https://broadinstitute.github.io/picard/ http://samtools.sourceforge.net/
with picardtools CreateSequenceDictionary
/usr/bin/java -Xmx8g -jar /us/local/picard-tools-2.5.0/picard.jar CreateSequenceDictionary REFERENCE=Reference.fasta OUTPUT=Reference.dict
with samtools faidx
samtools faidx Rreference.fasta
/usr/bin/java -Xmx12g -jar /usr/local/picard-tools-2.5.0/picard.jar SortSam CREATE_INDEX=TRUE VALIDATION_STRINGENCY=SILENT SORT_ORDER=coordinate INPUT=file.BWASAMPE.sam OUTPUT=file.PICARDTOOLSSORT.bam
The bam index (.bai) is created auomtically by picardtools
(samtools index
command unnecessary)
samtools flagstat file.PICARDTOOLSSORT.bam
14524094 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplimentary
0 + 0 duplicates
11219302 + 0 mapped (77.25%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
samtools idxstats file.PICARDTOOLSSORT.bam
The output is TAB-delimited with each line consisting of reference sequence name, sequence length, # mapped reads and # unmapped reads. It is written to stdout.
Chromosome_8.1 7978604 2274010 0
Chromosome_8.2 8319966 2274086 0
Chromosome_8.3 6606598 1774006 0
Chromosome_8.4 5546968 1483879 0
Chromosome_8.5 4490059 1217779 0
Chromosome_8.6 4133993 1132535 0
Chromosome_8.7 3415785 945505 0
Chromosome_8.8 535760 117502 0
* 0 0 3304792
https://broadinstitute.github.io/picard/explain-flags.html https://ppotato.wordpress.com/2010/08/25/samtool-bitwise-flag-paired-reads/
# Reads correctly mapped extracted
samtools view -h -b -f=0*02 -o file.SAMTOOLSVIEW-MAPPED.bam file.PICARDTOOLSSORT.bam
# Unmapped reads extracted
samtools view -h -b -F=0*02 -o file.SAMTOOLSVIEW-UNMAPPED.bam file.PICARDTOOLSSORT.bam
samtools index file.SAMTOOLSVIEW-MAPPED.bam
... if they haven't been added precedently
/usr/bin/java -Xmx12g -jar /usr/local/picard-tools-2.5.0/picard.jar AddOrReplaceReadGroups INPUT=20135.SAMTOOLSVIEW-MAPPED.bam OUTPUT=20135.PICARDTOOLSADDORREPLACEREADGROUPS.bam RGPL=Illumina RGLB=20135 RGPU=20135 RGSM=20135 CREATE_INDEX=True VALIDATION_STRINGENCY=LENIENT > 20135.PICARDTOOLSADDORREPLACEREADGROUPS.log
- RGPL= Read Group Platform (i.e. Illumina)
- RGLB= Read Group Library (i.e. DNA preparation library identity i.e "HG19")
- RGPU= Read Group Platform Unit (i.e Barcode)
- RGSM= Read Group Sample Name (i.e. Sample Name)
... if it is necessary
/usr/bin/java -Xmx12g -jar /usr/local/picard-tools-2.5.0/picard.jar MarkDuplicates INPUT=20135.SAMTOOLSVIEW-MAPPED.bam OUTPUT=20135.PICARDTOOLSMARKDUPLICATES.bam METRICS_FIL=metrics.txt CREATE_INDEX=True VALIDATION_STRINGENCY=LENIENT > 20135.PICARDTOOLSADDORREPLACEREADGROUPS.log
https://www.broadinstitute.org/gatk/
Firstly, creating a target list of intervals which need to be realigned with gatk realignerTargetCreator
/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -T RealignerTargetCreator -R reference -I file.bam -o file.GATKREALIGNERTARGETCREATOR.intervals
/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -T IndelRealigner -R reference -I file.bam -targetIntervals file.GATKREALIGNERTARGETCREATOR.intervals -o file.GATKINDELREALIGNER.bam
/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -rf BadCigar -T HaplotypeCaller -R reference -I file-ind1.PICARDTOOLSMARKDUPLICATES.bam -I file-ind2.PICARDTOOLSMARKDUPLICATES.bam -o file.GATKHAPLOTYPECALLER.vcf
/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -rf BadCigar -T UnifiedGenotyper -R reference -I file.bam -o file.GATKUNIFIEDGENOTYPER.vcf
/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -T VariantFiltration --filterName 'FILTER-DP' --filterExpression 'DP<10 || DP>600' --filterName 'LowQual' --filterExpression 'QUAL<30' -R /reference -o file.GATKVARIANTFILTRATION.vcf --variant file.GATKHAPLOTYPECALLER.vcf
/usr/bin/java -Xmx12g -jar /usr/local/gatk-3.7/GenomeAnalysisTK.jar -selectType SNP -T SelectVariants -R reference --variant file.GATKVARIANTFILTRATION.vcf -o file.GATKSELECTVARIANT.vcf
cf. NGS trainings and linux trainings