II. Illumina data quality - barrettlab/2021-Genomics-bootcamp GitHub Wiki

1. Illumina short-read data

Illumina data are in a format called FASTQ

Illumina's explanation of FASTQ format

The file extension is typically '.fastq' or 'fq'. General FASTQ format:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

A more realistic example:

@HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
TTAATTGGTAAATAAATCTCCTAATAGCTTAGATNTTACCTTNNNNNNNNNNTAGTTTCTTGAGATTTGTTGGGGGAGACATTTTTGTGATTGCCTTGAT
+HWI-EAS209_0006_FC706VJ:5:58:5894:21141#ATCACG/1
efcfffffcfeefffcffffffddf`feed]`]_Ba_^__[YBBBBBBBBBBRTT\]][]dddd`ddd^dddadd^BBBBBBBBBBBBBBBBBBBBBBBB

From Wikipedia: Line 1 begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). Line 2 is the raw sequence letters. Line 3 begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. Notice the 6-letter barcode, unique to each sequencing library, in the @ and + headers PHRED quality scores are on a log scale

2. Downloading FASTQ data from the Sequence Read Archive using SRAtools

SRA

fastq-dump -X 2 SRR001666 --split-3
### Read 2 spots for SRR001666
### Written 2 spots for SRR001666

From the Wikipedia page:

head SRR001666_1.fastq  SRR001666_2.fastq
###  Because SRA stores the read names as their own format, you can specify the original format:
fastq-dump -X 2 SRR001666 --split-3 --origfmt

3. Assessing data quality with FastQC

Here is a useful link

fastqc --help                   # gives the options for running fastqc
fastqc *.fq                     # use a wildcard to open all .fq files in a directory (DON'T do this for lots of files!!!)
### example
cd /data/cbarrett/2020_07_06_C_striata_ISSRseq/FASTQ2x100_Generation_2020-07-03_20_49_50Z-276174951/
ls -l
fastqc 0103a*.fastq.gz
### This will create a few files that you can now view, e.g. with firefox or a web browser (the .html file)

you can use a 'for' loop in UNIX to run FASTQC on all of your files, if they are in the same directory:

### make an output dir. I'm using the absolute path here, but if you are already in the taget dir, you can use the relative path
mkdir /data/cbarrett/2020_07_06_C_striata_ISSRseq/FASTQ2x100_Generation_2020-07-03_20_49_50Z-276174951/fastqc_output

### use an alias to identify the output dir
output=/data/cbarrett/2020_07_06_C_striata_ISSRseq/FASTQ2x100_Generation_2020-07-03_20_49_50Z-276174951/fastqc_output

### run the loop
for file in /data/cbarrett/2020_07_06_C_striata_ISSRseq/FASTQ2x100_Generation_2020-07-03_20_49_50Z-276174951/*.fastq
do
fastqc -f fastq -o ${output} ${file}
done

4. Quality and adapter trimming/filtering with bbduk

BBDUK Running bbduk:


### Adapter trimming (do this first). Requires input of the adapter sequence, OR (here) a fasta file of the adapter(s)
bbduk.sh in=reads_R1.fq in2=reads_R2.fq  out=clean_R1.fq out2=clean_R2.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo

### This is fine for one pair of reads, but what if you want to trime/filter an entire directory of reads? 
### Use a FOR Loop!

for i in `ls -1 *_1.fastq | sed 's/_1.fastq//'`
do
bbduk.sh -Xmx1g in1=$i\_R1.fastq in2=$i\_R2.fastq out1=$i\_clean_R1.fastq out2=$i\_clean_R2.fastq ref=adapters.fa qtrim=rl trimq=10
done

### Move the cleaned reads to a new dir
mkdir clean_output
mv *clean* clean_output

### Now, you can run FastQC again on the cleaned reads to compare with unprocessed reads

### If we want to get fancy like above when we used FASTQC:

### make an output dir. I'm using the absolute path here, but if you are already in the target dir, you can use the relative path
mkdir /data/cbarrett/2020_07_06_C_striata_ISSRseq/FASTQ2x100_Generation_2020-07-03_20_49_50Z-276174951/clean_output

### use an alias to identify the output dir
output=/data/cbarrett/2020_07_06_C_striata_ISSRseq/FASTQ2x100_Generation_2020-07-03_20_49_50Z-276174951/clean_output

### Disclaimer: I have not tried this yet!!!!!
for i in /data/cbarrett/2020_07_06_C_striata_ISSRseq/FASTQ2x100_Generation_2020-07-03_20_49_50Z-276174951/*.fastq
do
bbduk.sh -Xmx1g in1=$i\_R1.fastq in2=$i\_R2.fastq out1=${output}/$i\_clean_R1.fastq out2=${output}/$i\_clean_R2.fastq ref=adapters.fa qtrim=rl trimq=10
echo ""${sample}" reads processed"
done

### Getting fancier, we can make a 'samples' text file and use that with a loop to run a bunch of commands:

### Get all the filenames of all files (stored in separate subdirectories) 
find . -name "*.fastq.gz" > names.txt
head names.txt


./0114a_UT_L001-ds.c84cc551e3174952a130cf70267b1c80/0114a-UT_S72_L001_R2_001.fastq.gz
./0116e_UT_L001-ds.6106adb48ecf4639b36188159441c31c/0116e-UT_S80_L001_R2_001.fastq.gz
./0116e_UT_L001-ds.6106adb48ecf4639b36188159441c31c/0116e-UT_S80_L001_R1_001.fastq.gz
./0110a_AZ_L001-ds.25f61b606ce84ac195ff4ead1578034c/0110a-AZ_S39_L001_R1_001.fastq.gz
./0110a_AZ_L001-ds.25f61b606ce84ac195ff4ead1578034c/0110a-AZ_S39_L001_R2_001.fastq.gz
./312a_CA_L001-ds.2a83413665cd4c0e98bfd18bb886ff7d/312a-CA_S18_L001_R1_001.fastq.gz
./312a_CA_L001-ds.2a83413665cd4c0e98bfd18bb886ff7d/312a-CA_S18_L001_R2_001.fastq.gz
./0120e_UT_L001-ds.606fce4a62a2462b89e1605fab4305b7/0120e-UT_S64_L001_R2_001.fastq.gz
./0120e_UT_L001-ds.606fce4a62a2462b89e1605fab4305b7/0120e-UT_S64_L001_R1_001.fastq.gz


### We want: '0114a_UT_R1'  --   Use awk to save these fields, using underscores as delimiters
awk -F '_' '{print $1"_"$2"_"$6}' < names.txt > newnames.txt

head newnames.txt

./0114a_UT_R1
./0114a_UT_R2
./0116e_UT_R2
./0116e_UT_R1
./0110a_AZ_R1
./0110a_AZ_R2
./312a_CA_R1
./312a_CA_R2
./0120e_UT_R2
./0120e_UT_R1


### Now, use sed to get rid of the leading './'
sed 's/\.\///g' < newnames.txt > newernames.txt


0114a_UT_R1
0114a_UT_R2
0116e_UT_R2
0116e_UT_R1
0110a_AZ_R1
0110a_AZ_R2
312a_CA_R1
312a_CA_R2
0120e_UT_R2
0120e_UT_R1

### Combine it all with a pipe (|)

find . -name "*.fastq.gz" | awk -F '_' '{print $1"_"$2"_"$6}' | sed 's/\.\///g' > new_names.txt