4 Multi‐script Workflow for Genes Assembly and Orthology Inference - PhyloAI/Ortho2Web GitHub Wiki
Considering the limitations of the Nextflow pipeline, we proposed an alternative step-by-step workflow for phylogenomic analyses, especially for handling large datasets. This multi-script workflow still offers necessary flexibility and control, especially for large datasets.
Step 1: hybpiper_assemble
for name in $(cat namelist.txt);
do
hybpiper assemble -r ${name}_1.fq.gz ${name}_2.fq.gz -t_dna reference.fasta --cpu 50 --prefix ${name} --bwa;
done
# -r: readfiles
# -t_data: FASTA file containing DNA target sequences for each gene
# --cpu: number of CPUs
# --prefix: directory name for pipeline output
# --bwa: use BWA to search reads for hits to target
Step 2: hybpiper_summary
hybpiper stats -t_dna reference.fasta gene namelist.txt
# -t_data: FASTA file containing DNA target sequences for each gene
Step 3: hybpiper_heatmap
hybpiper recovery_heatmap seq_lengths.tsv --heatmap_filetype pdf
# seq_lengths.tsv: output from Step2
# --heatmap_filetype: file type to save the output heatmap image
Step 4: paralog_retriever
hybpiper paralog_retriever namelist.txt -t_dna reference.fasta --heatmap_filetype pdf
# -t_data: FASTA file containing DNA target sequences for each gene
# --heatmap_filetype: file type to save the output heatmap image
Below is the detailed workflow using individual scripts (also referring to Yang and Smith (2014) and Morales-Briones et al. (2022)).
Step 1: Renaming sequences
This step modifies sequence headers to include paralog information and standardizes the sequence identifiers by adding unique tags when needed. This ensures paralogs are distinguishable and allows downstream tools to process the sequences accurately.
for i in $(ls *.fasta); do
# Replace the sequence identifier to include paralog information
sed -i -E 's/(>.+)\.(.+)\s(.+)/\1@paralog_\2/' $i &&
# Keep only the first field of each line (sequence header and sequence)
awk '{print $1}' < $i > $i.fas &&
# Add ‘@unique’ to those identifiers that do not already include it.
sed -i '/>/ {/@/! s/$/@unique/}' $i.fas;
done
rename "s/_paralogs_no_chimeras.fasta.fas/.fasta/" *.fas
# Rename files to remove redundant suffixes and make them consistent
Step 2: Alignment
To align the sequences within each dataset to detect homologous regions. Multiple sequence alignment is crucial for comparing sequences across species or individuals, allowing for the accurate inference of phylogenetic relationships.
MAFFT is a fast and accurate alignment tool. It aligns sequences in parallel using multi-threading, which is key to reducing runtime for large datasets. The wrapper script automates the alignment for all input files and uses the appropriate number of threads for efficiency.
python2 paralog_scripts/mafft_wrapper.py 1_rename fasta 50 dna
# Run MAFFT for multiple sequence alignment on DNA sequences in the input directory.
# Arguments: input directory, file suffix, number of threads, sequence type (dna/aa)
Step 3: Filtering
To clean the alignments by removing poorly aligned or ambiguous regions, ensuring the quality of the data before phylogenetic analysis.
The pxclsq_wrapper.py script runs a filtering tool from the Phyx toolkit. It removes low-quality sequences or regions based on a defined threshold.
This step helps to avoid introducing errors into the phylogenetic trees due to poor-quality data.
python paralog_scripts/pxclsq_wrapper.py 2_mafft 0.1 dna
# Filter aligned sequences using the Phyx toolkit, with a threshold of 0.1.
# Arguments: input directory (aligned sequences), filtering threshold, sequence type
Step 4: Gene tree reconstruction
To infer phylogenetic trees for each gene dataset, representing evolutionary relationships among taxa.
To reduce runtime when constructing phylogenetic trees for multiple genes, we implemented parallelization strategies by splitting the genes into smaller batches and running multiple RAxML processes in parallel. This approach maximizes CPU utilization, especially when using a high-performance computing (HPC) cluster. We recommend using 4-8 threads per run to allow more parallel tasks to be submitted efficiently.
python paralog_scripts/raxml_bs_wrapper.py 3_aln-cln 4 dna
# Construct phylogenetic trees using RAxML with bootstrapping.
# Arguments: input directory (aligned cleaned sequence files), number of threads, sequence type
Step 5: Masking tips
# Rename tree file extensions from .tre to .tt to ensure compatibility with the input requirements of the next tool in the workflow.
rename 's/.tre/.tt/' *
# Mask tips in phylogenetic trees based on taxon IDs to remove undesired tips
python paralog_scripts/mask_tips_by_taxonID_transcripts.py 4_raxml 3_aln-cln y
# Arguments: directory of tree files, directory of aligned cleaned files, whether to mask paraphyletic tips (y/n)
Step 6: Removing long branches
Prune phylogenetic trees by removing long branches, which represent sequences with significant divergence. This step helps prevent distortion in tree interpretation.
TreeShrink identifies and removes outlier taxa with disproportionately long branches, potentially caused by sequencing errors or rapid evolution.
This process enhances tree accuracy by minimizing noise and allowing the analysis to focus on more typical evolutionary patterns.
python paralog_scripts/tree_shrink_wrapper.py 5_mask tt 0.01 6_treeshrink
# Prune the phylogenetic trees to remove long branches using TreeShrink with a specified threshold.
# Arguments: input directory, file suffix, threshold, output directory
Step 7-1: MO pruning
To prune paralogs and retain monophyletic orthologs based on MO strategy, ensuring that the trees accurately represent the evolutionary relationships of single-copy genes.
python paralog_scripts/prune_paralogs_MO.py 6_treeshrink tt 4 7_mo
# Prune paralogs to obtain monophyletic orthologs.
# Arguments: input homologs directory, tree file suffix, minimum taxon number, output directory
Step 7-2: MO orthologs generation
python paralog_scripts/write_ortholog_fasta_from_multiple_aln.py 1_rename 7_mo fasta tre 7_mo_fasta
# Extract FASTA sequences for monophyletic orthologs based on input alignment and tree files.
# Arguments: renamed directory (@), tree file directory, fasta file suffix, tree file suffix, output directory
Step 8-1: RT pruning
To prune paralogs and retain monophyletic orthologs based on RT strategy, ensuring that the trees accurately represent the evolutionary relationships of single-copy genes.
python paralog_scripts/prune_paralogs_RT.py 6_treeshrink tt 8_rt 4 RT_taxonIDs_table.txt
# Prune paralogs to RT orthologs, using internal and external group information.
# Arguments: input homologs directory, tree file suffix, output directory, minimum taxon number, taxon ID table
Step 8-2: RT orthologs generation
python paralog_scripts/write_ortholog_fasta_from_multiple_aln_RT.py 1_rename 8_rt fasta tre 8_rt_fasta
# Generate FASTA files for retained orthologs.
# Arguments: renamed directory (@), tree file directory, fasta file suffix, tree file suffix, output directory
Step 9-1: Filtering 1-to-1 orthologs
To prune paralogs and retain monophyletic orthologs based on 1to1 strategy, ensuring that the trees accurately represent the evolutionary relationships of single-copy genes.
python paralog_scripts/filter_1to1_orthologs.py 6_treeshrink tt 4 9_1to1
# Filter orthologs to obtain 1-to-1 relationships.
# Arguments: input homologs directory, tree file suffix, minimum taxon number, output directory
Step 9-2: 1-to-1 orthologs generation
python paralog_scripts/write_ortholog_fasta_from_multiple_aln.py 1_rename 9_1to1 fasta tre 9_1to1_fasta
# Generate FASTA sequences for filtered 1-to-1 orthologs.
# Arguments: renamed directory (@), tree file directory, fasta file suffix, tree file suffix, output directory
This approach offers flexibility by allowing individual steps to be adjusted according to specific needs, making it more adaptable for managing large-scale datasets or addressing custom requirements.