9.1_CAFE - bennestor/hakea_genome GitHub Wiki
This analysis was done using the genome_v1 sequence of H. prostrata
Install
#Conda install cafe v5.0.0 #module load gcc/4.9.3 #module swap PrgEnv-cray/5.2.82 PrgEnv-gnu/5.2.82
Prepare input files for CAFE from Orthofinder
From CAFE github https://github.com/harish0201/Analyses_Pipelines/blob/main/7.CAFE.sh
#Edit genecounts to CAFE format awk -F'\t' '{print "(null)\t"$0}' Orthogroups.GeneCount.tsv > tmp.tsv awk -F'\t' '{$NF=""; print $0}' tmp.tsv | rev | sed 's/^\s*//g' | rev | tr ' ' '\t' > cafe_input.txt #Change header of first column in cafe_input.txt to 'Desc' #Filter out gene families with > 100 gene copies to reduce variance python ../0_programs/CAFE5/tutorial/clade_and_size_filter.py -i cafe_input.txt -o filtered_cafe_input.txt -s
Run CAFE and try out different parameters From CAFE github https://github.com/hahnlab/CAFE5/blob/master/docs/tutorial/tutorial.md
#Optional multiple arguments for different k values: -k 1 -o 5_cafe_k_out/k1_1 … -k 10 -o 5_cafe_k_out/k10_3 … -p -k 1 -o 5_cafe_k_out/k1_p_1 … -p -k 9 -o 5_cafe_k_out/k9_p_2 #Count number of significant families to identify best model for file in */*_family_results.txt; do echo $file; cat $file | awk '$3 ~ /y/' | wc -l; printf "\n" ; done #Based on model likelihood and convergence of number of significant families and lambda, k2 appears to fit my data the best
Add error model estimation
#Rerun CAFE with k2 lambda cafe5 -t ../../1_inputs/SpeciesTree_rooted.txt.ultrametric.tre -i ../../1_inputs/filtered_cafe_input.txt -ek2_e/results/Base_error_model.txt -l 0.0065200039173412 -o k2_e_1 #k2 with error has highest likelihood
Run CAFE on large gene families that were initially filtered out
cafe5 -t ../../1_inputs/SpeciesTree_rooted.txt.ultrametric.tre -i ../../1_inputs/large_filtered_cafe_input.txt -l 0.0065200039173412 -o k2_e_large_filtered --cores 16 #likelihood = 2087, 8 significantly changed families #Hakea increase = 4, decrease = 2
Final results:
Model | Number of significant families | Hakea increase | Hakea decrease | Model likelihood | Lambda | |
---|---|---|---|---|---|---|
k2_error | 1,713 | 2,770 | 3,333 | 303,798 | 11,713,575 | 0.0065200039173412 |
Get IDs of Hakea genes in expanded and contracted orthogroups
#Append ids of large filtered families cat Base_change.tab | cut -f 1,5 | grep "+" | grep -v "+0" | cut -f 1 >> ../k2_e_1/expanded_orthogroups.ids cat Base_change.tab | cut -f 1,5 | grep "\-" | cut -f 1 >>../k2_e_1/contracted_orthogroups.ids
I used these ID lists for GO term enrichment.