9.1_CAFE - bennestor/hakea_genome GitHub Wiki

CAFE

This analysis was done using the genome_v1 sequence of H. prostrata

Install

   #Conda install cafe v5.0.0
   #module load gcc/4.9.3
   #module swap PrgEnv-cray/5.2.82 PrgEnv-gnu/5.2.82

Prepare input files for CAFE from Orthofinder

From CAFE github https://github.com/harish0201/Analyses_Pipelines/blob/main/7.CAFE.sh

   #Edit genecounts to CAFE format
   awk -F'\t' '{print "(null)\t"$0}' Orthogroups.GeneCount.tsv > tmp.tsv 
   awk -F'\t' '{$NF=""; print $0}' tmp.tsv | rev | sed 's/^\s*//g' | rev | tr ' ' '\t' > cafe_input.txt 
   
   #Change header of first column in cafe_input.txt to 'Desc' 
   
   #Filter out gene families with > 100 gene copies to reduce variance
   python ../0_programs/CAFE5/tutorial/clade_and_size_filter.py -i cafe_input.txt -o filtered_cafe_input.txt -s

Run CAFE and try out different parameters From CAFE github https://github.com/hahnlab/CAFE5/blob/master/docs/tutorial/tutorial.md

    #Optional multiple arguments for different k values:
    -k 1 -o 5_cafe_k_out/k1_1
    …
    -k 10 -o 5_cafe_k_out/k10_3
    …
    -p -k 1 -o 5_cafe_k_out/k1_p_1
    …
    -p -k 9 -o 5_cafe_k_out/k9_p_2
   
   #Count number of significant families to identify best model
   for file in */*_family_results.txt; do echo $file; cat $file | awk '$3 ~ /y/' | wc -l; printf "\n" ; done
   
   #Based on model likelihood and convergence of number of significant families and lambda, k2 appears to fit my data the best
   

Add error model estimation

   #Rerun CAFE with k2 lambda 
   cafe5 -t ../../1_inputs/SpeciesTree_rooted.txt.ultrametric.tre -i ../../1_inputs/filtered_cafe_input.txt -ek2_e/results/Base_error_model.txt -l 0.0065200039173412 -o k2_e_1 
   
   #k2 with error has highest likelihood

Run CAFE on large gene families that were initially filtered out

   cafe5 -t ../../1_inputs/SpeciesTree_rooted.txt.ultrametric.tre -i ../../1_inputs/large_filtered_cafe_input.txt -l 0.0065200039173412 -o k2_e_large_filtered --cores 16 
   #likelihood = 2087, 8 significantly changed families 
   #Hakea increase = 4, decrease = 2 

Final results:

Model Number of significant families Hakea increase Hakea decrease Model likelihood Lambda
k2_error 1,713 2,770 3,333 303,798 11,713,575 0.0065200039173412

Get IDs of Hakea genes in expanded and contracted orthogroups

   #Append ids of large filtered families 
   cat Base_change.tab | cut -f 1,5 | grep "+" | grep -v "+0" | cut -f 1 >> ../k2_e_1/expanded_orthogroups.ids 
   cat Base_change.tab | cut -f 1,5 | grep "\-" | cut -f 1 >>../k2_e_1/contracted_orthogroups.ids

I used these ID lists for GO term enrichment.

⚠️ **GitHub.com Fallback** ⚠️