Home - Magdoll/cDNA_Cupcake GitHub Wiki

What is cDNA_Cupcake

This is a GitHub repo and (mostly independent) Python/R scripts that I personally use on a daily basis to make my life easier. Most of the code is very simple, less than 20 lines, and does some kind of basic processing like reverse complementing a sequence.

This repo + wiki is meant to supplement the official Iso-Seq software.

How to use the scripts

The simplest way to use the script is to simply clone the GitHub repository, then add the GitHub repo path to your $PATH variable. The scripts are organized into different sub-directories (ex: sequence/, rarefaction/ etc) so you will have to add them individually.

git clone https://github.com/Magdoll/cDNA_Cupcake.git
export PATH=$PATH:<path_to_Cupcake>/sequence/

And so on...

However if you wish to use scripts such as collapse_isoforms_by_sam.py and chain_samples.py, you will need to install Cupcake. See Cupcake: supporting scripts for Iso-Seq after clustering step

# only if you need to use certain scripts 
python setup.py build
python setup.py install

Installing Cupcake (optional)

The only exception is Cupcake: supporting scripts for Iso-Seq after clustering step, which does require compiling and installation.

Basic Nucleotide Assumption

All the scripts assume that the input/output sequences consist only of: A, T, C, G.

Other nucleotides such as N, U, R, might cause incorrect behavior. Use at own risk.

Current list of maintained scripts

Annotation and Rarefaction

make_file_for_subsampling_from_collapsed.py: Prepare file for running subsampling (rarefaction curve).
subsample.py and subsample_with_category.py: Running subsamping. Results can be plotted with Excel graphics and R, etc.

See Annotation and Rarefaction Wiki for usage details.

Targeted Iso-Seq Analysis

calc_probe_hit_from_sam.py: calculate on-target rate based on FL read alignment + probe BED file.

See Targeted Iso-Seq Wiki for usage details.

Sequence Manipulation

get_seq_stats.py: Summarize length distribution of a FASTA/FASTQ file.
rev_comp.py: Reverse complement a sequence from command line.
fa2fq.py and fq2fa.py: Convert between FASTA and FASTQ format.
sort_fasta_by_len.py: sort fasta file by length (increasing or decreasing).
get_seqs_from_list.py: extract list of sequences given a fasta file and a list of IDs.
err_correct_w_genome.py: generate fasta sequences given genome and SAM file.
calc_expected_accuracy_from_fastq.py: calculate expected accuracy from FASTQ file. Can be used to calculate expected accuracies in Quiver/Arrow-polished low-quality isoform sequences.
sam_to_bam.py: quick script to run SAM to BAM conversion. Assumes samtools is installed.
sam_to_gff3.py: use BCBio and BioPython to convert SAM file into GFF3 format.
group_ORF_sequences.py: group identical ORF sequences from different isoforms.

See Sequence Manipulation Wiki for usage details.

Cupcake ToFU: supporting scripts for Iso-Seq after clustering step

collapse_isoforms_by_sam.py: Collapse HQ isoform results to unique isoforms (based on genome alignment).
get_abundance_post_collapse.py: Obtain count information post collapse to unique isoforms.
filter_by_count.py: Filter collapse result by FL count information.
filter_away_subset.py: Filter away 5' degraded isoforms.
simple_stats_post_collapse.py: Generating simple stats file to plot in R later.
chain_samples.py: Chaining together multiple samples.
fusion_finder.py: Finding fusion genes.
fusion_collate_info.py: Collate fusion information after running SQANTI(3).
color_bed12_post_sqanti.py: Color BED12 files using FL counts after running SQANTI(3).

See Cupcake: supporting scripts for Iso-Seq after clustering step for usage details.

List of Other Useful Tools

A list of useful tools that complements Cupcake:

SQANTI3: Comparing Cupcake collapsed results against genome annotation (ex: GENCODE)
IsoAnnot and TAPPAS: Functional annotation and differential analysis based on SQANTI3 output.