Trascriptome Curation - a-lud/nf-pipelines GitHub Wiki

This is a somewhat niche pipeline designed around extracting non-redundant, high quality transcripts from Trinity transcriptome assemblies. That last point is vitally important! This pipeline will only work with Trinity assembled transcriptomes.

Arguments

The pipeline has three optional arguments that can be passed.

--pipeline transcurate: Run the transcriptome curation pipeline
--cdhit_pid: Percentage identity threshold passed to CD-HIT
--database_dir: Directory path that contains the uniprot_sprot.fasta and Pfam-A.hmm databases
--completeORFs: Flag that indicates whether complete CDS sequences should be filtered for using gffread

Pipeline Overview

Database Download

The uniprot and Pfam databases will be downloaded if the --database_dir argument is left empty. These are used to screen the longestOrf sequences from TransDecoder.

CD-HIT

The software CD-HIT is used to cluster highly similar assembled transcripts. The parameters used in the command have been taken from the Trinity wiki relating to Too many transcripts. I relaxed the percentage identity threshold from 0.98 to 0.95, however you can provide whatever value you like using the --cdhit_pid argument.

The output from CD-HIT is non-redundant transcript sequences in fasta format. A clstr file is also generated which describes the redundant clusters.

TransDecoder: LongOrfs

This first step obtains ORFs that are at least 100 amino-acids in length. This step is relatively fast.

Homology Search

The longest ORFs obtained in the previous step are screened against Uniprot_sprot and Pfam-A. These two processes utilise the database files provided by the user, or that are downloaded automatically. The Pfam process can take a while, while the BLAST process is usually pretty quick.

TransDecoder: Predict

The final step of the TransDecoder pipeline is integrating the homology information to extract likely coding sequences from the assembled transcripts. Transcripts are assigned a range of types, including (but not limited to) complete, 5-prime-truncated, 3-prime-truncated etc... Essentially the output of this step is a high-confidence set of CDS sequences.

Complete CDS sequences

The final optional step is to keep only CDS sequences that are complete. That is, any sequence that has a start and stop codon with no internal stops. This can be viewed as the final, high-quality output.