1. Parental short read sequence data processing - USDA-ARS-GBRU/Pepper_TrioBinning GitHub Wiki

Quality control of short read Illumina paired end sequencing data on parental lines HDA149 and HDA330. Raw sequencing data can be downloaded from NCBI SRA accessions SRR21710630 for HDA149 and SRR21710629 for HDA330.

Raw File details

HDA149

HDA149_BDPL200001952-1A_HJWJNDSXY_L3_1.fq.gz (12 Gb)
HDA149_BDPL200001952-1A_HJWJNDSXY_L3_2.fq.gz (12 Gb)
HDA149_BDPL200001952-1A_HJWKKDSXY_L2_1.fq.gz (11 Gb)
HDA149_BDPL200001952-1A_HJWKKDSXY_L2_2.fq.gz (11 Gb)
HDA149_BDPL200001952-1A_HJWKKDSXY_L3_1.fq.gz (9.8 Gb)
HDA149_BDPL200001952-1A_HJWKKDSXY_L3_2.fq.gz (11 Gb)
HDA149_BDPL200001952-1A_HJWKKDSXY_L4_1.fq.gz (11 Gb)
HDA149_BDPL200001952-1A_HJWKKDSXY_L4_2.fq.gz (11 Gb)

HDA330

HDA330_FDPL210204928-1a_H3VYNDSX2_L4_1.fq.gz (4.6 Gb)
HDA330_FDPL210204928-1a_H3VYNDSX2_L4_2.fq.gz (5.0 Gb)
HDA330_FDPL210204928-1a_H3Y2LDSX2_L3_1.fq.gz (32 Gb)
HDA330_FDPL210204928-1a_H3Y2LDSX2_L3_2.fq.gz (34 Gb)

FastQC

Quality of raw sequence data was checked with FastQC with the following Slurm script.

#!/bin/bash
#SBATCH --job-name=fastqc
#SBATCH -N 1
#SBATCH -n 20
#SBATCH -o "%x_%j.o"
#SBATCH -e "%x_%j.e"

module load java
module load fastqc

# -t specifies number of threads (#SBATCH -n 20)
fastqc /rawdata/*.fq.gz -t 20 -outdir /fastqc/

-t 20 sets the number of threads to 20 /rawdata/*.fq.gz calls all gzipped fastq files in the rawdata directory -outdir fastqc writes the corresponding output files in a new directory Examine the resulting html files to find where trimming should made. In this case, the first 12 bp of reads needed to be trimmed.

fastp

Reads were trimmed and filtered with fastp v.023.4 using the following Slurm script. The first file of HDA149 sequences is shown as an example for brevity. Thanks fastp! See fastp's manual.

#!/bin/bash
#SBATCH --job-name=fastp_HDA149
#SBATCH -N 1
#SBATCH -n 20
#SBATCH -o "%x_%j.o"
#SBATCH -e "%x_%j.e"


date

# download and execute pre-compiled binary for fastp
# wget http://opengene.org/fastp/fastp
# chmod a+x ./fastp
# From <https://github.com/OpenGene/fastp>
# in /software/fastp

fastp='/software/fastp'
IN='/rawdata/HDA149'
OUT='/fastp/HDA149'
file='HDA149_BDPL200001952-1A_HJWJNDSXY_L3'  # _1.fq.gz
########################################################

echo 'Running fastp for...'
echo ${S1}

${fastp} 
-i ${IN}/${file}_1.fq.gz -I ${IN}/${file}_2.fq.gz 
-o ${OUT}/${file}_1.fp.fq.gz -O ${OUT}/${file}_2.fp.fq.gz 
--json ${OUT}/${file}.json --html ${OUT}/${file}.html 
--length_required 50 
--detect_adapter_for_pe 
--trim_poly_g 
--trim_front1 12 --trim_front2 12

echo 'fastp complete for...'
echo ${S1}

date

input files are read 1 (file_R1.fq.gz) and read 2 (file_R1.fp.fq.gz) trim polyG in 3' ends with --trim_poly_g trim first 12bp from all reads of both pairs with --trim_front1 12 --trim_front2 12