Home - MethodsDev/LongReadAlignmentAssembler GitHub Wiki

LRAA - Long Read Alignment Assembler

Isoform discovery and quantification based on long read isoform sequence alignments (PacBio or ONT).

LRAA has three modes, described below:

De novo (reference annotation-free) isoform identification and quantification
Reference-guided isoform detection and quantification
Isoform expression quantification only

LRAA also includes functionality to annotate reads or isoform structures according to SQANTI3-like categories. Visit SQANTI‐Like-Isoform-Structure-Comparisons for details.

Isoform Discovery

De novo Reference Annotation-free Isoform Discovery (and quantification)

Given a bam file from aligned reads (using minimap2), perform isoform discovery like so:

LRAA --genome genome.fasta --bam aligned_IsoSeq.mm2.bam

If using with reads that are not PacBio HiFi and have error rates that are > 2%, use the --LowFi parameter. By default, any read alignments with <98% identity are ignored.

Reference Annotation-guided Isoform Discovery (and quantification)

LRAA --genome genome.fasta --bam aligned_IsoSeq.mm2.bam --gtf reference_annotation.gtf

Note that input refernece annotations not found with evidence of expression are excluded from the output. Also, reference annotation structures may be extended if evidence supports alternative TSS or PolyA sites. Transcript identfiers will all be reassigned. Use GFFcompare to the reference to determine relationships to the input ref annotations.

If using with reads that are not PacBio HiFi and have error rates that are > 2%, use the --LowFi parameter. By default, any read alignments with <98% identity are ignored.

Isoform Quantification Only

LRAA --genome genome.fasta --bam aligned_IsoSeq.mm2.bam --gtf target_isoforms.gtf --quant_only

No novel isoform detection is performed. All original gene_id and transcript_ids are retained in the final output, including those with no evidence of expression (0 reads and 0 TPM).

If using with reads that are not PacBio HiFi and have error rates that are > 2%, use the --LowFi parameter. By default, any read alignments with <98% identity are ignored.

Isoform ID and/or Quant for single cell analyses

When using LRAA with single cell data, it's required that the cell barcodes and UMIs are encoded as 'CB' and 'XM' annotations, respectively, in the minimap2 aligned bam file.

Single cell analysis support is being revamped. Check back soon for updates.

Run LRAA in (pseudo)bulk mode (only current execution mode). Once LRAA completes, one of the output files will be a '*.tracking' file that includes read assignments and incorporates the cell barcode and UMI tags.

To construct a cell-by-transcript expression matrix, run the included script:

util/sc/singlecell_tracking_to_sparse_matrix.R --tracking LRAA.tracking --output_prefix sample_name

Note, the above can use a lot of RAM depending on the size of the input data. It's not unusual to require >64G RAM. A future version will be better optimized for memory usage.

Alternatively, if you have run PacBio's 'isoseq bcstats', you can run the script that constructs a filtered and raw cell-by-transcript expression similar to what CellRanger does, this using EmptyDroplet's approach. This script also has the option to produce QC plots, summary statistics and sequencing saturation for the "real cells" similar to CellRangers out.

util/sc/singlecell_tracking_to_sparse_matrix_with_filter_and_qc_options.R --working_dir ./ --tracking LRAA.tracking --bstats_file bstats.output.tsv --sample_name my_precious_sample --create_filtered 1

Optionally - to incorporate reference annotation gene symbols into the gene and isoform feature names, run the following:

 # use gffcompare to map LRAA isoform structures to GENCODE reference annotation structures
 gffcompare -r GENCODE_ref_annot.gtf LRAA.gtf 

 # update the LRAA gene symbols
 util/sc/incorporate_gene_symbols_in_sc_features.py \
       --LRAA_gtf LRAA.gtf  --ref_gtf GENCODE_ref_annot.gtf \
    --gffcompare_tracking gffcmp.tracking \
    --sparseM_dirs 'sample_name^gene-sparseM' 'sample_name^isoform-sparseM'

 # Note, the above will also generate a new GTF file "*.updated.gtf" that will include the revised gene and transcript identifiers.

The resulting sparse matrix can be used with tools such as Seurat for single cell analyses.

Cloud computing (Terra, DNA-Nexus, etc.): LRAA is available on Dockstore

Find LRAA on Dockstore here.

Maximizing parallel execution

Running LRAA directly as shown above can be time-consuming (hours to over a day depending on how large the data set is), as it runs through every chromosome (and top and bottom strand of each chromosome) separately and sequentially.

The cloud computing or local execution can be sped up by running each chromosome-based LRAA assembly in parallel, and this is done through the LRAA WDL workflow and specifying the main_chromosomes parameter with the list of chromosomes to scatter LRAA computes across. An example input json file for doing reference-guided isoform construction might look like so:

    {

    "LRAA_wf.sample_id" : "sample_id",
    "LRAA_wf.inputBAM" : "minimap2.bam",
    "LRAA_wf.annot_gtf" : "Gencode.gtf",
    "LRAA_wf.referenceGenome" : "GRCh38.genome.fa",
        
    "LRAA_wf.main_chromosomes" : "chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY chrM"
        

  }

If running locally, miniwdl could be used like so:

    miniwdl run LRAA.wdl -i input.json

a robust alternative to miniwdl is cromwell