10 Polyploidy Analysis - PhyloAI/Ortho2Web GitHub Wiki

10.1 WGD Mapping

This approach involves mapping the duplication nodes observed in gene trees onto the corresponding nodes in the species tree. High concentrations of gene duplications mapped to the same branch in the species tree may indicate a WGD event. Methods often use algorithms to reconcile gene trees with species trees by minimizing duplications and losses, allowing researchers to infer ancestral gene copies and duplication events.

Step 1: Extracting rooted orthogroups from homolog groups

python extract_clades.py 6_treeshrink tt orthogroups 4 taxon_table.txt out
# Input directory containing the trees. Tree file suffix. Output directory for orthogroups. Minimum number of ingroup taxa required. List of taxa containing both ingroups and outgroups. Output file suffix.

This script extracts rooted orthogroups from homolog groups based on the provided species tree and taxon information. The minimum number of taxa required is specified to filter orthogroups effectively.

Step 2: Mapping gene duplications onto the species tree

python map_dups_mrca.py orthogroups speciestree.rooted.newick 4 out
# Directory containing the output from the previous step. Rooted species tree in Newick format. Minimum number of taxa in each orthogroup. Output file suffix.

This script maps gene duplication events onto the species tree to trace their origins, using the orthogroups generated in the previous step.

Step 3: Plotting duplication percentages across branches

python plot_branch_labels.py dup_perc_filter10_global.out
# dup_perc_filter10_global.out: Output file from the previous step containing duplication percentages.

This script generates branch labels for plotting duplication percentages across the phylogenetic tree branches. The results highlight how duplication events are distributed across the tree.

Step 4: Plotting using R

a <- read.table('dup_perc_filter10_global.out.branch_labels')
pdf(file='perc_dup_labels.pdf', width=4, height=4)
hist(a[,1], breaks=60, col='grey', xlab='Percent of Gene Duplication', 
     ylab='Number of Branches', main='', axes=FALSE, xlim=c(0,0.6))
axis(1, pos=0)
axis(2, pos=0)
dev.off()

This R script generates a histogram of gene duplication percentages across the branches. It provides a visual representation of the percentage of gene duplication events and their distribution in the dataset.

10.2 GRAMPA Analysis

GRAMPA aims to reconstruct phylogenies while effectively accommodating the effects of whole-genome duplications (WGDs) and other gene family expansions, which often complicate traditional phylogenetic analyses. It utilizes a graph-based method that constructs a multi-labeled graph to represent the relationships among genes across multiple species. This approach enables the identification of homology relationships, including duplications and losses, that may not be easily captured by standard tree-based methods.

grampa.py -s species.tre -g gene_trees.txt -o outputdir -f out -p 10 -v 0
# -s: Species tree file in Newick format.
# -g: File containing gene trees.
# -o: Output directory for results.
# -f: Output file prefix.
# -p: Number of parallel processes.
# -v: Verbosity level.

10.3 nQuack Analysis

nQuack is a R package for preparing genomic data, exploring and fitting statistical models, and interpreting the results to determine ploidy levels in samples. This software facilitates robust statistical analysis and helps researchers make informed decisions about the genetic characteristics of their samples.

Step 1: Installation

install.packages("devtools")
devtools::install_github("mgaynor1/nQuack")
mkdir bin
cd bin
ln -s /apps/samtools/1.15/bin/samtools samtools

Step 2: Data preparation

# Prepare data
# Set in and out paths of files
inpath <- "../inst/extdata/01_raw/"
outpath <- "../inst/extdata/02_prepared/"
# List files in the inpath and remove their ending
filelist <- list.files(path = inpath, pattern = "*.bam" )
filelist <- gsub(".bam", "", filelist)
 
for( i in 1:length(filelist)){
  prepare_data(filelist[i], inpath, outpath)
}

# Process data
inpathtext <- "../inst/extdata/02_prepared/"
newfilelist <- list.files(path = inpathtext, pattern = "*.txt" )
 
for(i in 1:length(newfilelist)){
    samp <- newfilelist[i]
    temp <- process_data(paste0(inpathtext, samp), 
                         min.depth = 10, 
                         max.depth.quantile.prob = 1, 
                         error = 0.01, 
                         trunc = c(0.15,0.85))
    write.csv(temp, 
              file = paste0(""../inst/extdata/03_processed/", gsub(".txt", "", samp), ".csv"),
              row.names = FALSE)
} 

Step 3: Model inference
This part details how to fit various statistical models to the processed data and summarize the results to determine the best fit.

# Explore all models
samples <- c("sample1", "sample2", "sample3")
 
for(i in 1:length(samples)){
  temp <- as.matrix(read.csv(paste0("../inst/extdata/03_processed/", samples[i], ".csv")))
  out1 <- quackNormal(xm = temp, samplename = samples[i], cores = 10, parallel = FALSE)
  out2 <- quackBeta(xm = temp, samplename = samples[i], cores = 10, parallel = FALSE)
  out3 <- quackBetaBinom(xm = temp, samplename = samples[i], cores = 10, parallel = FALSE)
  allout <- rbind(out1, out2, out3)
  write.csv(allout, 
            file = paste0("../inst/extdata/04_output/", samples[i], ".csv"),
            row.names = FALSE)
}


# Model interpretation
inpathtext <- " ../inst/extdata/04_output/"
samples <- c("sample1", "sample2", "sample3")
 
for(i in 1:length(samples)){
  temp <- read.csv(paste0(inpathtext, samples[i], ".csv"))
  summary <- quackit(model_out =  temp, 
                     summary_statistic = "BIC", 
                     mixtures = c("diploid", "triploid", "tetraploid"))
  write.csv(summary, 
            file = paste0("../inst/extdata/05_interpret/", samples[i], ".csv"),
            row.names = FALSE)
} 

# Create key
key <- data.frame(sample = c("sample1", "sample2", "sample3"), 
           ploidal.level = c("diploid", "tetraploid", "triploid"))
 
# Read in quackit() output
dfs <- lapply(list.files("../inst/extdata/05_interpret/", full.names = TRUE  ), read.csv)
alloutput <- do.call(rbind, dfs)

# Combined
alloutputcombo <- dplyr::left_join(alloutput, key)
 
# Check the accuracy
alloutputcombo <- alloutputcombo %>%
                  dplyr::mutate(accuracy = ifelse(winnerBIC == ploidal.level, 1, 0))
 

sumcheck <- alloutputcombo %>% 
            group_by(Distribution, Type) %>% 
            summarize(total = n(), correct = sum(accuracy))
 
kbl(sumcheck) %>%
  kable_paper("hover", full_width = F) 
            
# Running only the best model
It explains how to run the best-performing model on each sample and perform bootstrapping to validate the results.
samples <- c("sample1", "sample2", "sample3")
out <- c()
 
for(i in 1:length(samples)){
  temp <- as.matrix(read.csv(paste0("../inst/extdata/03_processed", samples[i], ".csv")))
  out[i](/PhyloAI/Ortho2Web/wiki/i) <- bestquack(temp, 
                      distribution = "normal",
                      type = "fixed",
                      uniform = 1, 
                      mixtures = c("diploid", "triploid", "tetraploid"), 
                      samplename = samples[i])
}     

# Bootstrap replicates
samples <- c("sample1", "sample2", "sample3")
bout <- c()
 
for(i in 1:length(samples)){
  temp <- as.matrix(read.csv(paste0("../inst/extdata/03_processed/", samples[i], ".csv")))
  bout[i](/PhyloAI/Ortho2Web/wiki/i) <- quackNboots(temp, 
                        nboots = 100,
                      distribution = "normal",
                      type = "fixed",
                      uniform = 1, 
                      mixtures = c("diploid", "triploid", "tetraploid"), 
                      samplename = samples[i])
}
write.csv(bout[1](/PhyloAI/Ortho2Web/wiki/1), file = "../inst/extdata/sample1-boots.csv", row.names = FALSE)
write.csv(bout[2](/PhyloAI/Ortho2Web/wiki/2), file = "../inst/extdata/sample2-boots.csv", row.names = FALSE)
write.csv(bout[3](/PhyloAI/Ortho2Web/wiki/3), file = "../inst/extdata/sample3-boots.csv", row.names = FALSE)