Usage Information - kuleshov/architect GitHub Wiki
Input data
Architect takes as input:
- Genomic contigs in
fasta
format assembled using a standard (short-read) assembler. - A mapping of read clouds to contigs in
bam
format. - Optionally, an alignment of paired-end reads to the contigs.
Scaffolding Usage
Architect is run as follows.
usage: architect.py scaffold [-h] --fasta FASTA [--edges EDGES] --containment
CONTAINMENT --out OUT [--min-ctg-len MIN_CTG_LEN]
[--cut-tip-len CUT_TIP_LEN]
[--pe-abs-thr PE_ABS_THR]
[--pe-rel-thr PE_REL_THR]
[--pe-rc-rel-thr PE_RC_REL_THR]
[--rc-abs-thr RC_ABS_THR]
[--rc-rel-edge-thr RC_REL_EDGE_THR]
[--rc-rel-prun-thr RC_REL_PRUN_THR] [--log LOG]
optional arguments:
-h, --help show this help message and exit
--fasta FASTA Input scaffolds/contigs
--edges EDGES Known paired-end or read cloud connections
--containment CONTAINMENT
Container hits and various meta-data
--out OUT Prefix for the ouput files
--min-ctg-len MIN_CTG_LEN
Discard contigs smaller than this length (def: 0)
--cut-tip-len CUT_TIP_LEN
Cut tips smaller than this length
--pe-abs-thr PE_ABS_THR
Threshold for absolute support when pruning paired-end
edges
--pe-rel-thr PE_REL_THR
Threshold for relative support when pruning paired-end
edges
--pe-rc-rel-thr PE_RC_REL_THR
Threshold for relative support for read-cloud /
paired-end pruning
--rc-abs-thr RC_ABS_THR
Minimum support for create read-cloud based edge
--rc-rel-edge-thr RC_REL_EDGE_THR
Threshold for relative support when creating read-
cloud based edges
--rc-rel-prun-thr RC_REL_PRUN_THR
Threshold for relative support when pruning read-cloud
based edges
--log LOG Save stdout to log file
A containment
file encodes container hits in the genome. The edges
file encodes paired-end read information. The fasta
file contains the pre-assembled contigs.
The input files are generated from bam
alignments using scripts in the /scripts
folder.
The user may tune Architect via various parameters described above. At the moment, we have set their defaults to values that were found to work well with Illumina TruSeq Synthetic Long Read datasets.
Details of the scaffolding algorithm
The scaffolding algorithms proceeds in three stages, as described in the ISMB paper.
- Graph formation
- Graph pruning
- Contraction of ambiguous paths
At the graph formation stage, read cloud edges are added when they have more than --rc-abs-thr
wells supporting them (this was fixed to 4 in the pseudocode in the ISMB paper).
The graph pruning stage consists of three steps as well
- Pruning links using paired-end reads, when links have less than
--pe-abs-thr
absolute paired-end support and less than--pe-rel-thr
relative support (the cutoffs correspond respectively totau1
andtau2
in the paper) - Pruning the links using both paired-ends and read clouds, when there is less than
--pe-rc-rel-thr
support (rho1
in the paper) - Pruning links using paired-end reads, when there is less than
--pe-rel-prun-thr
relative support (rho2
in the paper) or the link is supported by less than--rc-abs-thr
wells that are in common between its two adjacent vertices.
Finally, we produce the output orderings by contracting unambiguous path in the pruned graph.
Inspecting the assembly graph
If an assembly does not achieve the desired level of quality, it may be useful to inspect the scaffold graph and try to see what goes wrong. Architect has a simple viewer module for doing that.
usage: architect.py view [-h] --fasta FASTA --edges EDGES --containment
CONTAINMENT [--edge EDGE] [--vertex VERTEX]
[--check-correctness] [--dot DOT] [--gfa GFA]
[--log LOG]
optional arguments:
-h, --help show this help message and exit
--fasta FASTA Input scaffolds/contigs
--edges EDGES Known paired-end or read cloud connections
--containment CONTAINMENT
Container hits and various meta-data
--edge EDGE View neighborhood around particular edge id
--vertex VERTEX View neighborhood around particular vertex id
--check-correctness Check if edges are correct using true intervals
--dot DOT Dotfile for visualization in Abyss Explorer
--gfa GFA GFA file for visualization in Bandage
--log LOG Save stdout to log file
The input parameters fasta
, edges
, and containment
are same as in the scaffolding module.
Now, we can also ask Architect to print information about a given vertex or edge (and their neighborhood). For that, we use the --edge
and --vertex
flags, together with the corresponding ids. In addition to neighboring vertices and adjacent edges, this will print information about incident wells, as well as the true genomic regions of the vertices (if the right information is given as input; see below).
We may also print the entire assembly graph in .gfa
or .dot
format, that can be then fed into third-party visualization software.
Providing ground-truth alignments
When analyzing the assembly of a genome with a known reference, it can be very useful to give Architect the true alignments of the input contigs to the known genome.
These alignments can be obtained by aligning the input contigs/scaffolds to the known reference using a tool such as MUMmer
. The result is a set of intervals chr:start-end
for each contig, which indicate that a portion of the contig aligns to those intervals.
When we start an assembly, we provide these true alignments in the .containment
file in the form of R
-type records. Architect keeps track of these intervals during the scaffolding process. If we then want to query the assembly graph using the view
module, we will be able to see the true regions associated with each vertex. This information can be used in conjunction with the read-cloud alignment to debug the scaffolding process.