Test_2_Sanskriti - sansuach/Sanskritiacharya GitHub Wiki

First, I logged into the sphinx and navigated to the analysis directory using the following command:

cd /pickett_sphinx/teaching/EPP622_2024/test2/analysis/

I created a new directory named sachar10 within the analysis directory:

mkdir sachar10
cd sachar10

Inside sachar10, I set up a directory structure to store various analysis results. The first subdirectory I created was 1_fastqc, which will be used to store FastQC results and entered the directory:

mkdir 1_fastqc
cd 1_fastqc/

I created symbolic links to each of the raw FastQ files stored in the raw_data directory using the ln -s command :

ln -s ../../../raw_data/SRR6922141_1.fastq .
ln -s ../../../raw_data/SRR6922185_1.fastq .
ln -s ../../../raw_data/SRR6922187_1.fastq .
ln -s ../../../raw_data/SRR6922236_1.fastq .

To analyze the FastQ files, I loaded the FastQC software using Spack:

spack load fastqc

I then ran FastQC on the linked FastQ files to generate quality reports:

fastqc SRR6922141_1.fastq 
fastqc SRR6922185_1.fastq 
fastqc SRR6922187_1.fastq 
fastqc SRR6922236_1.fastq 

After generating the reports, I used scp (secure copy) to transfer the FastQC HTML files to my local Desktop for review:

scp [email protected]:/pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/1_fastqc/SRR6922141_1_fastqc.html /home/mobaxterm/Desktop/
scp [email protected]:/pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/1_fastqc/SRR6922185_1_fastqc.html /home/mobaxterm/Desktop/
scp [email protected]:/pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/1_fastqc/SRR6922187_1_fastqc.html /home/mobaxterm/Desktop/
scp [email protected]:/pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/1_fastqc/SRR6922236_1_fastqc.html /home/mobaxterm/Desktop/

I went back to the sachar10 directory and created a new directory called 2_skewer for running the Skewer, which is used for adapter trimming and quality filtering of FastQ files and cd into it:

mkdir 2_skewer
cd 2_skewer

I then created symbolic links to the same raw data files in this directory:

ln -s /pickett_sphinx/teaching/EPP622_2024/test2/raw_data/SRR6922141_1.fastq .
ln -s /pickett_sphinx/teaching/EPP622_2024/test2/raw_data/SRR6922185_1.fastq .
ln -s /pickett_sphinx/teaching/EPP622_2024/test2/raw_data/SRR6922187_1.fastq .
ln -s /pickett_sphinx/teaching/EPP622_2024/test2/raw_data/SRR6922236_1.fastq .

Next, I moved on to trimming the raw sequence data using skewer. To trim all my sequence files in one go, I used a simple for loop. Here’s the command I used:

for filename in SRR69* 
do
  BASE=$( basename $filename | sed 's/.fastq//g') 
  echo $BASE 
  /sphinx_local/software/skewer/skewer 
  -t 2 -l 95 -Q 30 
  -x AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG 
  $filename -o $BASE 
done

After trimming the raw reads with skewer, I wanted to check the quality of the trimmed files using fastqc. To do this for all the trimmed files in the directory at once, I set up a simple for loop:

for filename in SRR69*-trimmed.fastq
do
  echo "Running fastqc on $filename"
  fastqc $filename
done

Then,I used scp (secure copy) to transfer the trimmed HTML files to my local Desktop for review:

scp [email protected]:/pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/2_skewer/SRR692141_1-trimmed_fastqc.html /home/mobaxterm/Desktop/
scp [email protected]:/pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/2_skewer/SRR6922185_1-trimmed_fastqc.html /home/mobaxterm/Desktop/
scp [email protected]:/pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/2_skewer/SRR6922187_1-trimmed_fastqc.html /home/mobaxterm/Desktop/
scp [email protected]:/pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/2_skewer/SRR6922236_1-trimmed_fastqc.html /home/mobaxterm/Desktop/

I went back to the sachar10 directory and created a new directory called 3_bwa to organize all the files and results related to BWA (Burrows-Wheeler Aligner) mapping:

mkdir 3_bwa
cd 3_bwa/

Then, I navigated to the directory containing the reference genome file which was downloaded and indexed and was provided in the lab wiki and came back to the 3_bwa directory. The reference genome acts as a template for mapping my reads:

cd /pickett_sphinx/teaching/EPP622_2024/raw_data/genome

First, I created a new directory called solenopsis_genome_index within my 3_bwa directory and I created symbolic links to the necessary genome files using for loops.This ensures that my workflow has access to the reference genome files without duplicating them:

cd ../../test2/analysis/sachar10/3_bwa/
mkdir solenopsis_genome_index
cd solenopsis_genome_index/
for filename in /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz*
do
    ln -s "$filename"
    echo "Created symbolic link for $filename"
done

Next, I changed back into my 3_bwa directory and symbolically linked all my trimmed fastq files from 2_skewer directory:

cd ../

for filename in ../2_skewer/SRR69*-trimmed.fastq
do
    ln -s "$filename"
done

Now, that all the trimmed fastq files are linked in my 3_bwa directory, for each of the linked trimmed FASTQ files, I perform mapping and sorting by installing BWA and Samtools:

spack load bwa
spack load [email protected]

Here, I run BWA to align my trimmed reads (SRR69*.fastq) to the Solenopsis invicta reference genome.The bwa mem command performs the mapping, and samtools sort sorts the resulting alignments by genomic coordinates, creating a .bam file (Binary Alignment/Map) for each input FASTQ file:

for filename in SRR69*.fastq
do
  basename=$(echo "$filename" | sed 's/.fastq//')
  echo $filename
  echo $basename
  bwa mem -t 3 \
  /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz \
  ${basename}.fastq \
  | samtools sort \
  -@ 3 -m 4G \
  -o ${basename}_sorted.bam
done

Next, I used samtools flagstat to calculate basic statistics for each BAM file. The .stats files summarize key metrics like:

  • Total number of reads
  • Number of primary mapped reads and supplementary reads
  • Percentage of properly paired reads
for filename in SRR69*_sorted.bam
do
  samtools flagstat "$filename">"$filename.stats"
  echo $filename.stats
done
samtools flagstat SRR6922141_1-trimmed_sorted.bam
samtools flagstat SRR6922185_1-trimmed_sorted.bam
samtools flagstat SRR6922187_1-trimmed_sorted.bam
samtools flagstat SRR6922236_1-trimmed_sorted.bam

Then, I added read group information to my sorted BAM files using GATK's AddOrReplaceReadGroups tool for downstream analyses, such as variant calling, as it allows me to keep track of the origin of each read.Before running the command, I loaded the required software.Then, I looped through each sorted.bam file in the current directory, extracted the sample name from the file name, and then used GATK to add the read group information. The command below also indexes the resulting BAM file with Samtools for efficient access:

spack load [email protected]_10
spack load [email protected]
for BAM in *_sorted.bam; do
    SAMPLENAME=$(basename "$BAM" | sed 's/_sorted.bam$//')
    
    /pickett_shared/software/gatk-4.2.6.1/gatk AddOrReplaceReadGroups \
        --INPUT $BAM \
        --OUTPUT ${SAMPLENAME}_sorted.RG.bam \
        --RGSM $SAMPLENAME \
        --RGLB $SAMPLENAME \
        --RGPL ILLUMINA \
        --RGPU $SAMPLENAME

    samtools index ${SAMPLENAME}_sorted.RG.bam
done

I created the 4_gatk directory within sachar10 to organize the sorted BAM files and reference genome files needed for GATK analysis. Using symbolic links (ln -s), I connected the necessary files to the 4_gatk directory and navigated into it:

mkdir 4_gatk
cd 4_gatk

I used a for loop to iterate over the sorted BAM files (_sorted.RG.) from the ../3_bwa directory. Using ln -s, I created symbolic links to these files within the 4_gatk directory:

for filename in ../3_bwa/SRR69*_sorted.RG.* 
do
    ln -s "$filename" .
done

Next, I linked four key reference genome files from the /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/ directory. This includes the compressed genome sequence, its index, and the sequence dictionary file.The .fa.gz file contains the genome sequence, while the .gzi, .fai, and .dict files provide the necessary indexing and dictionary for rapid access during analysis:

for ref_file in /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz \
                /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz.gzi \
                /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.gz.fai \
                /pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.dict; do
    ln -s "$ref_file" .
done

I used GATK’s HaplotypeCaller tool to identify variants for multiple read files (SRR69*_sorted.RG.bam).The HaplotypeCaller is a core variant calling tool in GATK, useful for identifying SNPs and Indels. The command repeats over each BAM file, calling variants against a reference genome:

for filename in SRR69*_sorted.RG.bam; do
    base=$(basename "$filename" _sorted.RG.bam)  
    /pickett_shared/software/gatk-4.2.6.1/gatk \
   --java-options "-Xmx4G" \
   HaplotypeCaller \
    -R solenopsis_invicta_genome.fa.gz \
    -I "$filename" \
   *O "${base}_NC_052664.1.vcf" \
    *bamout "${base}_sorted_NC_052664.1.RG.realigned.bam" \
    -L NC_052664.1
done

Once the VCF files were generated, I used bcftools to count the number of SNPs and Indels separately in each VCF file:

spack load bcftools
for filename in SRR69*.1.vcf; do  bcftools view -v snps "$filename" | grep -v "^#" | wc -l; done
for filename in SRR69*.1.vcf; do  bcftools view -v indels "$filename" | grep -v "^#" | wc -l; done 

To visually inspect the identified variants, I extracted the first 5 SNPs and Indels from each VCF file for quick review and also to fill up file and line in question 6 and 7 in the answer sheet for test 2:

for filename in SRR69*_NC_052664.1.vcf
do
    echo "SNPs in $filename:"
    bcftools view -v snps "$filename" | grep -v '^##' | head -n 5 
done
for filename in SRR69*_NC_052664.1.vcf
do
     echo "Indels in $filename:"
    bcftools view -v indels "$filename" | grep -v '^##' | head -n 5 
done

I used GATK’s HaplotypeCaller again to produce GVCF files, which can be used for multi-sample joint genotyping. This was done with the following command:

for filename in SRR69*_sorted.RG.bam
do
        /pickett_shared/software/gatk-4.2.6.1/gatk \
	--java-options "-Xmx4G" \
	HaplotypeCaller \
	-R solenopsis_invicta_genome.fa.gz \
	-I "$filename" \
	-O "$filename".g.vcf \
	-ERC GVCF
done

I used GATK (Genome Analysis Toolkit) and BCFTools for variant filtration and statistical analysis.First, I created a directory to store the GVCF files and navigated into it:

mkdir GVCFs
cd GVCFs

I copied all the GVCF files (SRR69*.g.vcf) created earlier into the GVCFs directory using a loop:

for filename in SRR69*.g.vcf
do
  cp "$filename" /pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/GVCFs
done                                                                                    

Next, I created a 5_gatk_combine directory and moved into it:

mkdir 5_gatk_combine
cd 5_gatk_combine

Then I created symbolic links to the genome reference files and the GVCF files to prepare for GATK's CombineGVCFs and GenotypeGVCFs steps:


ln -s $(readlink -e ../4_gatk/solenopsis_invicta_genome.*) .

ln -s /pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/GVCFs/SRR69*.g.vcf .

I generated a list called sample.list of all the GVCF files for further processing.GATK will use this file for the combination step:

ls SRR69*.g.vcf > sample.list

I ran the GATK's CombineGVCFs tool to merge the individual GVCF files into one combined file for further variant calling:

/pickett_shared/software/gatk-4.2.6.1/gatk \
	CombineGVCFs \
	-R solenopsis_invicta_genome.fa.gz \
	--variant sample.list \
	-O solenopsis_combined.g.vcf

After combining, I ran GenotypeGVCFs to generate a VCF file from the combined GVCFs:

/pickett_shared/software/gatk-4.2.6.1/gatk \
	GenotypeGVCFs \
	-R solenopsis_invicta_genome.fa.gz  \
	-V solenopsis_combined.g.vcf \
	-O solenopsis_combined.vcf

To get the statistics, I loaded BCFTools and generated a stats file for the combined VCF:

spack load bcftools

bcftools stats solenopsis_combined.vcf > solenopsis_combined.vcf.stats.txt

less solenopsis_combined.vcf.stats.txt

Next, I created a directory 6_filter for filtering and linked the necessary files:

mkdir 6_filter
cd 6_filter
ln -s $(readlink -e ../5_gatk_combine/solenopsis_invicta_genome*) .
ln -s $(readlink -e ../5_gatk_combine/solenopsis_combined.vcf*) . 

For basic filtering, I used GATK’s VariantFiltration to apply a filter based on QUAL and DP:

/pickett_shared/software/gatk-4.2.6.1/gatk \
> VariantFiltration \
> -R solenopsis_invicta_genome.fa.gz  \
> -V solenopsis_combined.vcf \
> -O solenopsis_combined.Basicfilters.vcf \
> --filter-name "basic_filter" --filter-expression "QUAL > 20.0 && DP > 5"

I then ran GATK to apply more advanced filtering with several additional filters:

/pickett_shared/software/gatk-4.2.6.1/gatk \
	VariantFiltration \
	-R solenopsis_invicta_genome.fa.gz  \
	-V solenopsis_combined.vcf \
	-O solenopsis_combined.GATKfilters.vcf \
	--filter-name "QD_filter" --filter-expression "QD < 2.0" \
	--filter-name "FS_filter" --filter-expression "FS > 60.0" \
	--filter-name "MQ_filter" --filter-expression "MQ < 40.0" \
	--filter-name "SOR_filter" --filter-expression "SOR > 4.0" \
	--filter-name "MQRankSum" --filter-expression "MQRankSum < -12.5" \
	--filter-name "ReadPosRankSum" --filter-expression "ReadPosRankSum < -8.0"

To generate statistics for both the basic and advanced filtered VCF files, I used BCFTools:

spack load bcftools

bcftools stats -f "PASS,." ../5_gatk_combine/solenopsis_combined.vcf > solenopsis_combined.vcf.stats.txt

bcftools stats -f "PASS,." solenopsis_combined.Basicfilters.vcf > solenopsis_combined.Basicfilters.vcf.stats.txt

bcftools stats -f "PASS,." solenopsis_combined.GATKfilters.vcf > solenopsis_combined.GATKfilters.vcf.stats.txt

Finally, I used grep to pull out the number of SNPs and indels from the stats files after applying different filters. This helps to assess the impact of the filters on the data:

grep 'number of SNPs:' *stats.txt
grep 'number of indels:' *stats.txt

At last, I transferred all necessary files from the remote server to my local desktop. The files included the genome reference (solenopsis_invicta_genome.fa), its index file (solenopsis_invicta_genome.fa.fai), BAM files for alignments (SRR6922141_1-trimmed_sorted.RG.bam and its index), as well as GATK-processed BAM and VCF files:

scp [email protected]:/pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa .
scp [email protected]:/pickett_sphinx/teaching/EPP622_2024/raw_data/genome/solenopsis_invicta_genome.fa.fai .
scp [email protected]://pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/3_bwa/SRR6922141_1-trimmed_sorted.RG.bam . 
scp [email protected]://pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/3_bwa/SRR6922141_1-trimmed_sorted.RG.bam.bai .

scp [email protected]://pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/4_gatk/SRR6922141_1-trimmed_sorted_NC_052664.1.RG.realigned.bai .
scp [email protected]://pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/4_gatk/SRR6922141_1-trimmed_sorted_NC_052664.1.RG.realigned.bam .
scp [email protected]://pickett_sphinx/teaching/EPP622_2024/test2/analysis/sachar10/4_gatk/SRR6922141_1-trimmed_NC_052664.1.vcf .

Next, I loaded the transferred BAM and VCF files into IGV (Integrative Genomics Viewer) for analysis. I used IGV to visually inspect the alignment of reads against the reference genome, paying close attention to regions where variants were called.Using IGV, I reviewed the variant positions highlighted in the VCF file and analyzed whether the alignments supported these variants.Screenshots were taken of the IGV views for documentation, including the alignment tracks and variant calls, and the final answers were based on both visual inspection and data from the VCF file.