Trascriptome Curation - a-lud/nf-pipelines GitHub Wiki
This is a somewhat niche pipeline designed around extracting non-redundant, high quality transcripts from Trinity
transcriptome assemblies. That last point is vitally important! This pipeline will only work with Trinity assembled transcriptomes.
Arguments
The pipeline has three optional arguments that can be passed.
--pipeline transcurate: Run the transcriptome curation pipeline--cdhit_pid: Percentage identity threshold passed toCD-HIT--database_dir: Directory path that contains theuniprot_sprot.fastaandPfam-A.hmmdatabases--completeORFs: Flag that indicates whether complete CDS sequences should be filtered for usinggffread
Pipeline Overview
Database Download
The uniprot and Pfam databases will be downloaded if the --database_dir argument is left empty. These are used to screen the longestOrf sequences from TransDecoder.
CD-HIT
The software CD-HIT is used to cluster highly similar assembled transcripts. The parameters used in the command have been taken from the
Trinity wiki relating to Too many transcripts. I relaxed the percentage identity threshold from 0.98 to 0.95, however you can provide whatever value you like using the
--cdhit_pid argument.
The output from CD-HIT is non-redundant transcript sequences in fasta format. A clstr file is also generated which describes the redundant clusters.
TransDecoder: LongOrfs
This first step obtains ORFs that are at least 100 amino-acids in length. This step is relatively fast.
Homology Search
The longest ORFs obtained in the previous step are screened against Uniprot_sprot and Pfam-A. These two processes utilise the database files provided by the user, or that are downloaded automatically. The Pfam process can take a while, while the BLAST process is usually pretty quick.
TransDecoder: Predict
The final step of the TransDecoder pipeline is integrating the homology information to extract likely coding sequences from the assembled transcripts. Transcripts are assigned a range of types, including (but not limited to) complete, 5-prime-truncated, 3-prime-truncated etc... Essentially the output of this step is a high-confidence set of CDS sequences.
Complete CDS sequences
The final optional step is to keep only CDS sequences that are complete. That is, any sequence that has a start and stop codon with no internal stops. This can be viewed as the final, high-quality output.