GBZ format for space-efficient storage of GFA with many paths.
Wrapper for GBWT and GBWTGraph serialized using the simple-sds format.
Better defined semantics.
Serialization and loading use exceptions to handle failures.
Requires version 2.3 of the vgteam fork of SDSL and GBWT version 1.3.
gfa2gbwt outputs the compressed simple-sds .gbz format by default.
v0.6 (2021-03-17)
Uses the vgteam fork of SDSL.
In CMake builds, uses the same SDSL as GBWT.
When a GFA file contains both P-lines and W-lines, use the P-lines as reference paths.
Reference paths have sample name _gbwt_ref and the path name as contig name.
GBWTGraph file format version 2.
Optional node-to-segment translation for GFA files.
Compatible with version 1.
Preliminary SegmentHandleGraph interface for GBWTGraph.
If node-to-segment translation is included in the graph, each handle maps to a (segment name, starting offset) pair.
Each node also maps to a (segment name, node id range) pair.
Iteration over segments and links.
Compressed GBWTGraph file format for GFA storage.
Both GBWT and GBWTGraph in the same file. GBWTGraph typically compresses to ~20% of the original space.
gfa2gbwt can output plain and compressed formats and convert between the two.
GFA extraction from GBWTGraph.
v0.5 (2021-01-15)
Major improvements to GFA parsing:
Multi-pass algorithm that validates the input before starting GBWT construction.
Segment names are automatically translated into integer ids if necessary.
Long segments are broken down into smaller nodes (at most 1024 bp by default) if necessary.
Segment-to-node translation table is written to basename.trans.
GBWT metadata can be generated by parsing path names.
Support for proposed W-lines.
v0.4 (2020-11-05)
CachedGBWTGraph: A GBWTGraph overlay that uses the cached interface automatically. Intended for algorithms that repeatedly access the edges in a small subgraph.
Minimizer index v7 (compatible with v6):
An option to use bounded syncmers instead of minimizers.
Graph algorithms in algorithms.h:
topological_order(): Find a topological order for all handles in the subgraph induced by a subset of nodes.
New queries:
GBWTGraph::cached_follow_edges(): A version of follow_edges() using CachedGBWT.
hits_in_subgraph(): Report minimizer hits in a subgraph induced by a set of nodes.
Changes to path cover:
local_haplotypes(): Revert to the path cover algorithm if there are no haplotypes in the component.
augment_gbwt(): Augment an existing GBWT with a path cover of missing components.
v0.3 (2020-04-14)
Implemented the new SerializableHandleGraph interface.
Query MinimizerIndex::count_and_find() that returns the occurrence count and a pointer to the internal representation of the occurrences.
Better path cover for acyclic component by always starting from a head node in such components.
Avoid querying for nonexistent nodes during construction when the source graph has gaps between node ids.
Store 64 bits of payload for each position in the minimizer index.
Minimizer index file format v6 (not compatible with the earlier versions).
v0.2 (2019-11-08)
An option to use 128-bit keys in the minimizer index, supporting up to 63 bp minimizers.
GBWT construction from a greedy path cover of an arbitrary graph.
GBWT construction by sampling local haplotypes by their true frequencies.
Minimizer index file format v5 (compatible with v4 from GBWTGraph v0.1).
v0.1 (2019-09-06)
The first standalone release of GBWTGraph.
handlegraph::HandleGraph and handlegraph::SerializableHandleGraph interfaces.
A version of follow_edges() that only follows paths supported by the haplotypes.
Direct access to the internal sequence representation.
Graph construction from GFA 1.0 with no overlaps/containments and integer segment identifiers.
Minimizer index construction from the haplotypes.
Requires GBWT v1.0.
Future work
Merging graphs (over the same chromosome / multiple chromosomes).