Metazoa Dataset - Pas-Kapli/CoME-Tutorials GitHub Wiki
Previous steps
The transcriptomes were assembled de novo from the retrieved transcriptome samples using the Trinity pipeline (Grabherr et al., 2011) . Protein coding regions were extracted from each assembly using TransDecoder (Haas et al. 2013, https://github.com/TransDecoder), as follows: i) initially, Open Reading Frames (ORFs) of minimum 100 amino acids were predicted, ii) subsequently, they were scanned against the Pfam (Finn et al. 2016) and the Uniprot (The UniProt Consortium, 2017) databases and iii) finally, the likely coding sequences (CDs) were predicted, making sure that the peptides with either a blast or pfam hit were retained kept in the final set of CDs.
To identify orthologous genes among the 98 transcriptome samples we used a pipeline called "Forty-Two" (available as a Bitbucket repository https://bitbucket.org/dbaurain/42/). In a single run, the pipeline attempts to enrich a given multiple sequence alignment (MSA) with corresponding orthologous sequences from a single or multiple transcriptome samples. Forty-two uses a multiple Best Reciprocal Hit (multi-BRH) strategy.
The alignment of each orthologous group was carried out with the mafft-linsi approach.