Usage Information - kuleshov/architect GitHub Wiki

Input data

Architect takes as input:

  • Genomic contigs in fasta format assembled using a standard (short-read) assembler.
  • A mapping of read clouds to contigs in bam format.
  • Optionally, an alignment of paired-end reads to the contigs.

Scaffolding Usage

Architect is run as follows.

usage: architect.py scaffold [-h] --fasta FASTA [--edges EDGES] --containment
                             CONTAINMENT --out OUT [--min-ctg-len MIN_CTG_LEN]
                             [--cut-tip-len CUT_TIP_LEN]
                             [--pe-abs-thr PE_ABS_THR]
                             [--pe-rel-thr PE_REL_THR]
                             [--pe-rc-rel-thr PE_RC_REL_THR]
                             [--rc-abs-thr RC_ABS_THR]
                             [--rc-rel-edge-thr RC_REL_EDGE_THR]
                             [--rc-rel-prun-thr RC_REL_PRUN_THR] [--log LOG]

optional arguments:
  -h, --help            show this help message and exit
  --fasta FASTA         Input scaffolds/contigs
  --edges EDGES         Known paired-end or read cloud connections
  --containment CONTAINMENT
                        Container hits and various meta-data
  --out OUT             Prefix for the ouput files
  --min-ctg-len MIN_CTG_LEN
                        Discard contigs smaller than this length (def: 0)
  --cut-tip-len CUT_TIP_LEN
                        Cut tips smaller than this length
  --pe-abs-thr PE_ABS_THR
                        Threshold for absolute support when pruning paired-end
                        edges
  --pe-rel-thr PE_REL_THR
                        Threshold for relative support when pruning paired-end
                        edges
  --pe-rc-rel-thr PE_RC_REL_THR
                        Threshold for relative support for read-cloud /
                        paired-end pruning
  --rc-abs-thr RC_ABS_THR
                        Minimum support for create read-cloud based edge
  --rc-rel-edge-thr RC_REL_EDGE_THR
                        Threshold for relative support when creating read-
                        cloud based edges
  --rc-rel-prun-thr RC_REL_PRUN_THR
                        Threshold for relative support when pruning read-cloud
                        based edges
  --log LOG             Save stdout to log file

A containment file encodes container hits in the genome. The edges file encodes paired-end read information. The fasta file contains the pre-assembled contigs.

The input files are generated from bam alignments using scripts in the /scripts folder.

The user may tune Architect via various parameters described above. At the moment, we have set their defaults to values that were found to work well with Illumina TruSeq Synthetic Long Read datasets.

Details of the scaffolding algorithm

The scaffolding algorithms proceeds in three stages, as described in the ISMB paper.

  1. Graph formation
  2. Graph pruning
  3. Contraction of ambiguous paths

At the graph formation stage, read cloud edges are added when they have more than --rc-abs-thr wells supporting them (this was fixed to 4 in the pseudocode in the ISMB paper).

The graph pruning stage consists of three steps as well

  1. Pruning links using paired-end reads, when links have less than --pe-abs-thr absolute paired-end support and less than --pe-rel-thr relative support (the cutoffs correspond respectively to tau1 and tau2 in the paper)
  2. Pruning the links using both paired-ends and read clouds, when there is less than --pe-rc-rel-thr support (rho1 in the paper)
  3. Pruning links using paired-end reads, when there is less than --pe-rel-prun-thr relative support (rho2 in the paper) or the link is supported by less than --rc-abs-thr wells that are in common between its two adjacent vertices.

Finally, we produce the output orderings by contracting unambiguous path in the pruned graph.

Inspecting the assembly graph

If an assembly does not achieve the desired level of quality, it may be useful to inspect the scaffold graph and try to see what goes wrong. Architect has a simple viewer module for doing that.

usage: architect.py view [-h] --fasta FASTA --edges EDGES --containment
                         CONTAINMENT [--edge EDGE] [--vertex VERTEX]
                         [--check-correctness] [--dot DOT] [--gfa GFA]
                         [--log LOG]

optional arguments:
  -h, --help            show this help message and exit
  --fasta FASTA         Input scaffolds/contigs
  --edges EDGES         Known paired-end or read cloud connections
  --containment CONTAINMENT
                        Container hits and various meta-data
  --edge EDGE           View neighborhood around particular edge id
  --vertex VERTEX       View neighborhood around particular vertex id
  --check-correctness   Check if edges are correct using true intervals
  --dot DOT             Dotfile for visualization in Abyss Explorer
  --gfa GFA             GFA file for visualization in Bandage
  --log LOG             Save stdout to log file

The input parameters fasta, edges, and containment are same as in the scaffolding module.

Now, we can also ask Architect to print information about a given vertex or edge (and their neighborhood). For that, we use the --edge and --vertex flags, together with the corresponding ids. In addition to neighboring vertices and adjacent edges, this will print information about incident wells, as well as the true genomic regions of the vertices (if the right information is given as input; see below).

We may also print the entire assembly graph in .gfa or .dot format, that can be then fed into third-party visualization software.

Providing ground-truth alignments

When analyzing the assembly of a genome with a known reference, it can be very useful to give Architect the true alignments of the input contigs to the known genome.

These alignments can be obtained by aligning the input contigs/scaffolds to the known reference using a tool such as MUMmer. The result is a set of intervals chr:start-end for each contig, which indicate that a portion of the contig aligns to those intervals.

When we start an assembly, we provide these true alignments in the .containment file in the form of R-type records. Architect keeps track of these intervals during the scaffolding process. If we then want to query the assembly graph using the view module, we will be able to see the true regions associated with each vertex. This information can be used in conjunction with the read-cloud alignment to debug the scaffolding process.