The Reference Repository - pb-dyim/SMRT-Analysis GitHub Wiki

The reference repository is a file-based data store used by SMRT Analysis to manage reference sequences and associated information. The full description of all of the attributes of the reference repository is beyond the scope of this document, but you need to use some basic aspects of the reference repository in most SMRT Pipe analyses.

Example: Analysis of multi-contig references can only be handled by supplying a reference entry from a reference repository.

It is simple to create and use a reference repository:

  • A reference repository can be any directory on your system. You can have as many reference repositories as you wish; the input to SMRT Pipe is a fully resolved path to a reference entry, so this can live in any accessible reference repository.

Starting with the FASTA sequence genome.fasta, you upload the sequence to your reference repository using the following command:

referenceUploader -c -p/path/to/repository -nGenomeName
-fgenome.fasta

where:

  • /path/to/repository is the path to your reference repository.
  • GenomeName is the name to use for the reference entry that will be created.
  • genome.fasta is the FASTA file containing the reference sequence to upload.

For a large genome, we highly recommended that you produce the BLASR suffix array during this upload step. Use the following command:

referenceUploader -c -p/path/to/repository -nHumanGenome -fhuman.fasta --Saw='sawriter -welter'

There are many more options for reference management. Consult the MAN page entry for referenceUploader by entering referenceUploader -h.

To learn more about what is being stored in the reference entries, look at the directory containing a reference entry. You will find a metadata description (reference.info.xml) of the reference and its associated files. For example, various static indices for BLASR and SMRT View are stored in the sequence directory along with the FASTA sequence.