8 Gene Flow Analysis - PhyloAI/Ortho2Web GitHub Wiki

8.1 SNP Calling

This pipeline provides a comprehensive approach to SNP calling and variant filtering in genomic data, utilizing tools such as Burrows-Wheeler Aligner (BWA), SAMtools, Picard, Genome Analysis Toolkit (GATK), and VCFtools. Each tool plays a crucial role, from initial reference sequence indexing to alignment, variant calling, and filtering. The objective of this pipeline is to detect reliable SNPs for downstream analysis, ensuring data quality and accuracy.

Dependencies:

  • BWA: Used for aligning sequencing reads to a reference genome, supporting the identification of sequence variations.
  • SAMtools: Manages alignment data in SAM/BAM format, enabling file conversion, sorting, and indexing.
  • Picard: Primarily used here for creating sequence dictionaries and marking duplicate reads, a step that improves variant calling accuracy.
  • GATK: A toolkit for variant discovery and genotyping; in this pipeline, it was used for SNP calling, variant filtration, and genotype processing. VCFtools: A set of tools for filtering and working with VCF files, enabling further refinement of SNP data.

Step 1: Creating reference sequence index

conda install bioconda::bwa
conda install bioconda::samtools=1.9
conda install bioconda/label/cf201901::picard
conda install bioconda/label/cf201901::gatk4
# Install necessary tools for reference sequence indexing and SNP calling

bwa index -a bwtsw ASM2249599_genomic.fna
# Create BWA index for reference genome
samtools faidx ASM2249599_genomic.fna
# Create FASTA index for reference genome using Samtools
java -Xmx8G -jar picard-2.18.23-0/picard.jar CreateSequenceDictionary R=reference_genomic.fna O=reference_genomic.dict
# Create sequence dictionary for reference genome using Picard

Step 2: BWA-MEM alignment

for name in $(cat sample_namelist.txt); do 
  bwa mem -t $SLURM_NTASKS -M -R "@RG\tID:${name}\tPL:UNKNOWN\tLB:library\tSM:${name}" reference_genomic.fna ${name}_1.fq.gz ${name}_2.fq.gz > ${name}.sam;
done
# Align paired-end reads to the reference genome using BWA-MEM
# -t: Number of threads to use
# -M: Mark shorter split hits as secondary
# -R: Read group header line, including ID (read group ID), PL (platform), LB (library), and SM (sample)

Step 3: Converting SAM to BAM

for name in $(cat namelist.txt); do 
  samtools view -@ $SLURM_NTASKS -bS ${name}.sam -o ${name}.bam;
done
# -@: Number of threads to use
# -bS: Input format is SAM, output BAM
# -o: Output file name
# Convert SAM files to BAM files for each sample

Step 4: Sorting BAM files

for name in $(cat namelist.txt); do 
  samtools sort -@ $SLURM_NTASKS -m 3G -O bam -o ${name}.sorted.bam ${name}.bam;
done
# Sort BAM files for each sample
# -m: Maximum memory per thread (3G in this case)
# -O: Output format (bam)
# -o: Output sorted BAM file

Step 5: Marking duplicates

for name in $(cat namelist.txt); do 
  java -Xmx8g -jar picard.jar MarkDuplicates I=${name}.sorted.bam O=${name}.sorted.markdup.bam M=${name}.markdup_metrics.cvs;
done
# I: Input BAM file
# O: Output BAM file with duplicates marked
# M: Metrics file for duplicate information
# Mark duplicate reads in sorted BAM files using Picard

Step 6: Indexing BAM Files

for name in $(cat namelist.txt); do 
  samtools index ${name}.sorted.markdup.bam;
done
# Index the BAM files with marked duplicates for each sample

Step 7: Calling variants and output GVCF

for name in $(cat namelist.txt); do 
  gatk HaplotypeCaller -R reference_genomic.fna -I ${name}.sorted.markdup.bam -O ${name}.gvcf.gz -ERC GVCF;
done
# Call variants for each sample and output in GVCF format using GATK HaplotypeCaller
# -R: Reference genome
# -I: Input BAM file
# -O: Output GVCF file
# -ERC: Mode to emit reference confidence (GVCF in this case)

Step 8: Combining GVCF files

ls *gvcf.gz > gvcf.list
gatk CombineGVCFs -R reference_genomic.fna -V gvcf.list -O combine.gvcf.gz
# Combine GVCF files from all samples into a single GVCF file

Step 9: Genotyping GVCF

gatk GenotypeGVCFs -R reference_genomic.fna -V combine.gvcf.gz -O combine.vcf.gz
# Extract genotypes from combined GVCF file

Step 10: Applying variant filtration

gatk VariantFiltration \
    -R reference_genomic.fna \
    -V combine.vcf.gz \
    -filter "QD < 2.0" --filter-name "QD2" \
    -filter "QUAL < 30.0" --filter-name "QUAL30" \
    -filter "SOR > 3.0" --filter-name "SOR3" \
    -filter "FS > 60.0" --filter-name "FS60" \
    -filter "MQ < 40.0" --filter-name "MQ40" \
    -filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5" \
    -filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8" \
    -O combine_filtered.vcf.gz
# Apply filters to variants using GATK VariantFiltration

Step 11: Selecting SNP variants

gatk SelectVariants \
    -R reference_genomic.fna \
    -V combine_filtered.vcf.gz \
    --select-type-to-include SNP \
    -O combine_filtered_selected.vcf.gz
# Select only SNP variants from the filtered VCF file

Step 12: Filtering variants using VCFtools

conda install bioconda::vcftools
# Install VCFtools for further filtering

# Option 1: General filtering
vcftools --gzvcf combine_filtered_selected.vcf.gz --max-missing 0.5 --mac 3 --minQ 30 --recode --recode-INFO-all --out g5mac3

# Option 2: Minimum depth filtering
vcftools --vcf g5mac3.recode.vcf --minDP 3 --recode --recode-INFO-all --maf 0.05 --hwe 0.001 --out g5mac3dp3

8.2 Dsuite

Dsuite is a powerful tool for analyzing gene flow, especially suited for detecting introgression events between species or populations. It works by assessing shared genetic variation through the computation of D-statistics (also known as ABBA-BABA tests) and other related statistics. This makes it effective for pinpointing gene flow events across evolutionary lineages and studying hybridization. Key Features of Dsuite:

  • D-statistics (ABBA-BABA test): This statistic compares derived allele patterns across different populations to detect imbalances in allele sharing, which indicates gene flow between certain lineages. Dsuite automates the computation of D-statistics across all possible trios in a phylogenetic tree, streamlining the analysis of large datasets.

  • fd Statistic: Dsuite also calculates the fd statistic, a refined measure of gene flow that accounts for population structure. It is useful for distinguishing between genuine introgression signals and random genetic drift, which can often skew D-statistic results.

  • Tree-based Testing with Dtrios: Dsuite allows users to input a Newick-formatted phylogenetic tree to analyze introgression patterns among specified branches. The tree structure provides a framework to identify gene flow between different groups, making it more efficient than analyzing each trio separately.

  • Branch-specific Gene Flow (Fbranch): With the Fbranch functionality, Dsuite can highlight branches in a phylogenetic tree that show significant gene flow, providing detailed insights into lineage-specific introgression events.

  • Ease of Visualization: Dsuite produces outputs that can be easily visualized, enabling researchers to interpret introgression patterns across the tree. This visualization is useful for summarizing complex introgression patterns across species or population networks.

Step 1: Installtion

git clone https://github.com/millanek/Dsuite.git
cd Dsuite
make

Step 2: Running

Dsuite Dtrios combine.vcf.gz set.txt -o sample
Dsuite Dtrios g5mac3.recode.vcf set.txt -o sample -t species.newick
# set.txt: The left side contains species names, and the right side lists `Species1, Species2, ..., SpeciesN, Outgroup`
# -t: Species tree file is optional.

Step 3: P-value adjustment

install.packages('p.adjust')
install.packages("xlsx")
library(xlsx)

BBAA_P <- read.table("sample_BBAA.txt", as.is=T, header=T)
test <- p.adjust(BBAA_P$p.value, method="BH")
write.xlsx(test, file="BBAA_P.xlsx")
# Paste the resulting adjusted p-values back into the `sample_BBAA.txt` file.

Step 4: Plotting

cut -f 2 set.txt | uniq | head -n 5 > plot_order.txt
# Sorting
# -n: represents the number of lines in `set.txt`.

conda install conda-forge::ruby
wget https://github.com/mmatschiner/tutorials/archive/refs/heads/master.zip
unzip master.zip

ruby plot_d.rb BBAA_P.txt plot_order.txt 0.7 ade_BBAA_D_p.svg
ruby plot_f4ratio.rb BBAA_P.txt plot_order.txt 0.2 ade_BBAA_f4ratio_p.svg

Step 5: f-branch plotting

cd Dsuite/utils
python3 setup.py install --user --prefix=

Dsuite Fbranch species.newick sample_tree.txt > fbranch.out
# species.newick: Specify the species tree file.
# sample_tree.txt: Use the result file `sample_tree.txt` generated from `Dsuite Dtrios` with `-t` specified tree topology as input.
# Result: `fbranch.out`, f-branch statistics saved in matrix format.

python Dsuite/utils/dtools.py fbranch.out species.newick --outgroup Outgroup --use_distances --dpi 1200 --tree-label-size 30
# Arguments: `fbranch` file, species tree file, specified outgroup (must match the name in the tree), resolution (`dpi`), node label size.