D. melanogaster Example - kuleshov/architect GitHub Wiki
Let's try out Architect on an earlier assembly of D. melanogaster.
We will take as input a set of contigs assembled from regular short reads using SPAdes. In addition, we will use as input pre-computed links between contigs that were derived from an alignment of paired-end reads, as well as "container hits" obtained from aligning a TLSR read cloud library, in which the short reads have been subsampled to 25%.
More concretely, we will run Architect on 3 files:
-
scaffolds.fasta
: scaffolds obtained via SPAdes from short reads -
edges.tsv
: paired-end links obtained from the same reads -
drosophila.containment
: container hits obtained from aligning a subsampled TSLR read cloud library
These files can be downloaded from https://stanford.box.com/s/10crf36gvf998ylhjo59pr7xswwb0uga
.
We may run the scaffolder on this data as follows:
python architect.py scaffold \
--fasta scaffolds.fasta \
--edges edges.tsv \
--containment drosophila.containment \
--out drosophila
This will produce two files:
-
drosophila.ordering
: which will contain the scaffold orderings. -
drosophila.fasta
: scaffolds determined by the ordering; the number of N's between contigs is unreliable (may be off by +/-2kbp) and may confuse aligners
The format of the ordering file is:
<ordering id> <ctg_id1;containment_string1;orientation1> <ctg_id2;containment_string2;orientation2>
A containment string, if available, indicates the regions of the (known) reference genome to which a contig maps.
It can be used for validation; its format is ctg:start-end
.
Finally, orientation
is either R
or S
, indicating whether the contig came from the reverse or the forward strand, respectively.
We can use the information in the ordering
file to verify our accuracy.
This is done using an evaluation script:
python scripts/na50.py -o drosophila.ordering
This will print statistics for each contig. The most important line is the last one; it prints out five numbers:
<num_orderings> <num_misorderings> <bp_verifiable> <n50> <na50>
Here, the misorderings are the equivalent of a misassembly (in the QUAST sense). The NA50 is also an extension of the QUAST definition to orderings.
In our case, this result should read:
57567 40 122423762 262797 252190
The alignment information in drosophila.layout
is
taken from drosophila.containment
. This file contains a mapping of
most contigs to the known Drosophila reference. This information
may be used for internald diagnostics, as well as for evaluating the final layout.
The input files edges.tsv
and drosophila.containment
are generated
via scripts in the /scripts
subfolder.
The paired-end link file edges-tsv
is generated using pe-connections.py
from an alignment of paired-end reads to the input scaffolds.fasta
.
We used the same paired-end reads as for de-novo assembly to generate
edges.tsv
.
The file drosophila.containment
is actually a concatenation of two files:
- A map of container hits; these correspond to entries with
W
in the first column. - An alignment of contigs to the reference; these are lines with
R
in the first column.
The container hits are obtained using the script bam_to_containment.py
from an alignment of TSLR read clouds to scaffolds.fasta
.
The true intervals are converted directly from the true alignments
identified by the QUAST tool. The scripts for generating them can
be found in the scripts/ground-truth
subfolder.