Extracting bundle data for debugging - gpertea/stringtie GitHub Wiki

StringTie groups overlapping alignments and reference annotations into bundles, and each bundle is processed individually by StringTie in order to infer the assembled transcripts for each bundle.

There are rare situations where a particular input data configuration (read alignments + reference annotation in a genomic region) might cause StringTie to crash or produce inadequate results for a specific bundle.

The first step in investigating a situation like this is the identification of the bundle (i.e. the genomic region) which triggered the program failure. Since StringTie is generally run with multiple threads (a value >1 for the -p option), multiple bundles might be processed at the time of the crash, so the log file (generated with the -v option) will not accurately show which bundle data actually caused the crash. The proper way to find that problematic bundle is to re-run StringTie on the same data, with the same command line options except the -p option which must not be used for debugging purposes here, but -v option should be added instead, so the output of stringtie at stderr will look something like this (the last few lines):

[08/05 11:39:39]>bundle chr21:5011799-5017150(8) (1 guides) loaded, begins processing...
[08/05 11:39:39]^bundle chr21:5011799-5017150(8) done (1 processed potential transcripts).
[08/05 11:39:39]>bundle chr21:5018461-5244053(4619) (30 guides) loaded, begins processing...
Segmentation fault

In the example above the bundle causing the crash seems to be chr21:5018461-5244053. Admittedly it also possible (though unlikely) for a crash to have been caused by the data loading thread which might have encountered some malformed input data while preparing the next bundle. This "loading thread" is reading and parsing the read alignments while the current bundle is being processed; this loading thread cannot be disabled at run time and it has nothing to do with the -p option. However, this is very unlikely to be the cause of a StringTie crash (unless the input BAM file was corrupted for some reason; samtools can be used to validate it), so for the reminder of this article we'll assume that the problem (possible StringTie bug) is caused by the bundle being processed.

In order to help developers investigate the StringTie failure on a particular bundle, one should pull all the read alignments from that bundle into a smaller BAM file and share only this smaller BAM file with the developers, if possible. Additionally, if reference annotation was used with the -G option, all the "guides" (reference transcripts) in that bundle should also be extracted to a smaller GFF file and shared with the developers.

Assuming that the original BAM file was called sample1.bam and the reference genome annotation file was called gencode.gtf, the following commands can be used to generate the smaller BAM and GFF files for the bundle in the example above:

Make sure the .bam file is indexed:

samtools index sample1.bam

Extract the alignments from that bundle:

samtools view -b sample1.bam chr21:5018461-5244053 > bundle_c21.bam

Also extract the reference annotation (if any) for that bundle:

gffread -r chr21:5018461-5244053 -o- gencode.gtf | gzip -c > bundle_c21.gff.gz

In order to make sure that this bundle (its data) is indeed causing the StringTie to crash, the user should verify the reproducibility of the error (crash) here by running stringtie again with just these two files as input, while the rest of the command line parameters should be the same as those used when the error first ocurred. (The gff file should be uncompressed for this, so something like this might be needed first: gzip -cd bundle_c21.gff.gz > bundle_c21.gff).

If the error was reproduced on these bundle data on the user side, the 2 files obtained above bundle_c21.bam and bundle_c21.gff.gz should be sent to the developer for debugging, along with the exact command line options used to reproduce the error. If any of these files are larger than a few megabytes (the BAM file might be), sending these files as e-mail attachments might fail, so using a file storage/sharing service like Dropbox, Box or Google Drive etc. is generally the better option.

For such large files, if using a file storage/sharing service is not easy or convenient for the user, we can also provide a FTP server for the user to upload these files. Please send an e-mail to [email protected] if you want to use this FTP upload option and I'll provide detailed instructions for uploading files to our FTP server.