Importing Sequencing Results - StanfordBioinformatics/pulsar_lims GitHub Wiki

Overview

In the Snyder Mapping Center for ENCODE, sequencing is done at the Stanford Genome Sequencing Service Center(GSSC). GSSC uploads the results to DNAnexus as Projects (folders essentially). For each Illumina Flowcell, each lane on the Flowcell will have a corresponding Project on DNAnexus that contains the FASTQ files and QC reports. Pulsar needs a way to check for new Projects on DNAnexus and import the metadata (i.e. number of reads) into corresponding SequencingResult objects in Pulsar. Recall that a Pulsar SequencingRequest has one or more SequencingRuns. Each SequencingRun has one or more SequencingResults, each of which corresponds to a DNAnexus Project and each Projects stems from the same Illumina sequencing run (flow cell).

Checking for new DNAnexus Projects

A script is scheduled via Heroku Scheduler that runs once a day at UTC-8:00, or 12:30 PST. This script is named import_seq_results.py that is part of the pulsarpy_dx Python client. Upon opening up the Heroku scheduler addon

heroku addons:open scheduler

you can see how the job is scheduled:

scheduled job

Lets ignore the dms portion before the Python script for a moment. What you can see here is that there is a script named import_seq_results.py and runs daily. It writes its log files to an S3 bucket in a daily folder (via the --log-s3 flag). It works by scanning DNAnexus for new projects in the last 2 days (-d 2 flag) that the user has access to and checking if they are being billed to the specified account. Some of the sequencing result metadata of such projects are then imported into Pulsar. For that to work, the script has to look for the right SequencingRequest in Pulsar, then creates a SequencingRun if necessary, and then creates any additional SequencingResult records. How does the script know which SequencingRun to look for? Well, a DNAnexus Project can be tagged with metadata attributes called properties. GSSC adds a handful automatically, i.e.

dx project properties

The important one here is the one called library_name. The lab lets the GSSC know what value they want to be used for this property. That value is chosen to be the record ID of the SequencingRequest ID in Pulsar. Thus, the script just needs to look for a SequencingRequest having the ID specified in the library_name prop. If a SequencingRequest isn't found, then the fact of the matter is logged. If any unhandled exception arises during the execution of the script, an email is sent to the email address specified via the environment variable SUPPORT_EMAIL_ADDR with the error message in the body.

The dms portion of the scheduled job is a command-line utility that comes from the heroku-buildpack-dms Heroku Buildpack. I installed this then added the deadmanssnitch add-on as follows:

heroku buildpacks:add https://github.com/deadmanssnitch/heroku-buildpack-dms
heroku addons:create deadmanssnitch

The purpose of dms is to give me a way of knowing if my script doesn't run for some reason (i.e. the web server node crashed). Indeed, the DNAnexus sequencing result import script is very lightweight and so can be run on the web server node. Note that since the web app is based on Rails, there isn't Python support on the web dyno unless one installs the Python buildpack, which I had to do beforehand:

heroku buildpacks:add heroku/python

How to create a snitch

To use dms, you have to first create a snitch. In the first image above that shows the scheduled job, the alphanumeric string 21d714c7da is the snitch ID. To create your snitch, open the deadmanssnitch addon with heroku addons:open deadmanssnitch and follow the instructions there. That will result in a snitch ID that you can use when scheduling a job via Heroku scheduler.

Deploying updates to import_seq_results.py

Because the Python buildpack is added to the Pulsar app on Heroku, it reads the list of pip packages you have listed in your requirements.txt file (if present in the app's root directory). The buildpack caches builds, so say if pulsarpy_dx source code is updated and pushed back to GitHub, then you need a way to have Heroku pick up those changes. Simply creating a new version of your app won't do the trick, since the buildpack will still use the cache. The solution to this is to purge that cache by using a Heroku CLI plugin called Heroku Builds. Here is how I would purge the cache for the Pulsar app:

heroku builds:cache:purge  -a pulsar-encode

You still need to create a new version of your app so that it will use a new, updated environment. You can use an empty commit to do this:

git commit --allow-empty -m "Force Heroku rebuild"