Version History - jltsiren/gbwtgraph GitHub Wiki

Current version

  • Path name fields used for identifying haplotypes in index_haplotypes_with_paths() are stored as minimizer index tag PATH_NAME_FIELDS_TAG.
  • New algorithms:
    • gbwt_to_canonical_gfa(): Optimized version of handlegraph::algorithms::canonical_gfa() for GBWTGraph.

Releases

v1.2 (2025-10-30)

  • Minimizer index serialization is more consistent with other structures.
    • Failed sanity checks are thrown as exceptions.
    • I/O error handling depends on the exception settings in the stream.
  • GBWTGraph deserialization no longer fails when node-to-segment translation contains unused segments at the end.
  • Minimizer index version 11:
    • The index is parameterized by key type but not by value type.
    • Values are now pairs consisting of a Position and 0–15 words of payload.
    • Payload size is a runtime parameter defined when building the index.
    • Serialization is defined for all minimizer indexes.
    • Tags structure for storing arbitrary key-value pairs.

v1.1 (2025-03-07)

  • Bug fixes:
    • GFA parsing now understands W-lines with with missing (*) SeqStart / SeqEnd fields.
  • Improved support for parallelizing GBWT construction over graph components:
    • MetadataBuilder for generating GBWT metadata.
    • assing_paths() and insert_paths() for including reference/generic paths in the built GBWT.
    • ConstructionJobs::contig_names() for determining contig names for graph components.
  • Subgraph queries similar to GBZ-base:
    • Extract a context within a given distance of a reference path offset/interval or a node.
    • Path-based queries require building a PathIndex for the GBZ graph.
    • The resulting Subgraph object can be converted to GFA.
  • Multithreaded path cover algorithms.
  • path_lcs() for finding the longest common subsequence of two paths, weighted by sequence length in the nodes.
  • Methods for changing the reference samples in a GBZ graph.
  • Sequence extraction using the gbz_extract tool.
    • The extracted sequences maintain the same order as the corresponding paths in the GBWT.
  • Minimizer index version 10.
    • No actual changes. Version number was incremented due to a breaking change in minimizer payloads used in vg.

v1.0 (2023-12-04)

  • Faster copying / moving / swapping of GBZ objects.
  • GFA compression/decompression changes:
    • L-line overlap is now written as 0M instead of *.
    • Option to extract GFA without using the node-to-segment translation.
  • Minimizer index version 9:
    • Payload is now 128 bits per hit (already in version 8).
    • Parameterized over both key type and value type, allowing the use minimizer indexes without a payload.
    • Option to provide a set of frequent kmers that should be avoided as minimizers, as in Winnowmap.
    • Plain kmer indexes can also be used.
    • Iterator over the kmers in the index.
    • Compatible with version 8 but not with earlier versions.
  • New graph operations:
    • subgraph(): Return a subgraph based on the given GBWT index (faster than using a constructor).
  • New algorithms:
    • gbwt_construction_jobs(): Partition a graph into sets of components for parallel GBWT construction.
    • partition_chains(): Partition top-level chains in a snarl decomposition between GBWT construction jobs.

v0.9 (2022-05-13)

  • Terminology change consistent with libhandlegraph senses:
    • Old reference paths are now generic (named) paths.
    • Some samples may be interpreted as reference paths.
    • All other paths are haplotype paths.
  • GFA compression/decompression changes:
    • The output is GFA 1.1 instead of GFA 1.0.
    • Path names are now interpreted as contig names instead of sample names by default.
    • Preset --pan-sn for parsing path names and extracting paths in the PanSN format.
    • --ref-only mode that extracts only graph topology and named paths as P-lines.
    • An optional list of reference samples can be stored using GFA header tag RS and GBWT tag reference_samples.
    • The API supports multiple regexes for parsing path names and assigning path senses by the first regex that matches the path name.
  • GBWTGraph implements the PathMetadata interface from libhandlegraph.
    • Metadata that depends on the sense (type) of the path.
    • Haplotype paths are not exposed through the PathHandleGraph interface.
  • Construction from HandleGraph takes an optional NamedNodeBackTranslation for building the node-to-segment translation.
  • Path cover / subsampled GBWTs now include (a selected subset of) paths from the original graph as generic paths.

v0.8.1 (2022-02-17)

  • Faster GFA decompression in some degenerate cases by caching large GBWT records.

v0.8 (2022-01-21)

  • GBWTGraph implements more libhandlegraph interfaces:
    • PathHandleGraph: Reference paths corresponding to sample name _gbwt_ref are reported as named paths.
    • NamedNodeBackTranslation: Node chopping + node id / segment name translation for GFA.
    • CachedGBWTGraph is only a HandleGraph.
  • Multithreaded GFA (de)compression in gfa2gbwt and with gfa_to_gbwt() / gbwt_to_gfa().
    • Faster and more memory-efficient compression, faster decompression.
    • Path order in the GBWT and path/edge order in the decompressed GFA file is different from the old algorithms.

v0.7 (2021-11-19)

  • File format version 3:
  • GBZ format for space-efficient storage of GFA with many paths.
    • Wrapper for GBWT and GBWTGraph serialized using the simple-sds format.
    • Better defined semantics.
  • Serialization and loading use exceptions to handle failures.
  • Requires version 2.3 of the vgteam fork of SDSL and GBWT version 1.3.
  • gfa2gbwt outputs the compressed simple-sds .gbz format by default.

v0.6 (2021-03-17)

  • Uses the vgteam fork of SDSL.
    • In CMake builds, uses the same SDSL as GBWT.
  • When a GFA file contains both P-lines and W-lines, use the P-lines as reference paths.
    • Reference paths have sample name _gbwt_ref and the path name as contig name.
  • GBWTGraph file format version 2.
    • Optional node-to-segment translation for GFA files.
    • Compatible with version 1.
  • Preliminary SegmentHandleGraph interface for GBWTGraph.
    • If node-to-segment translation is included in the graph, each handle maps to a (segment name, starting offset) pair.
    • Each node also maps to a (segment name, node id range) pair.
    • Iteration over segments and links.
  • Compressed GBWTGraph file format for GFA storage.
    • Both GBWT and GBWTGraph in the same file. GBWTGraph typically compresses to ~20% of the original space.
    • gfa2gbwt can output plain and compressed formats and convert between the two.
  • GFA extraction from GBWTGraph.

v0.5 (2021-01-15)

  • Major improvements to GFA parsing:
    • Multi-pass algorithm that validates the input before starting GBWT construction.
    • Segment names are automatically translated into integer ids if necessary.
    • Long segments are broken down into smaller nodes (at most 1024 bp by default) if necessary.
    • Segment-to-node translation table is written to basename.trans.
    • GBWT metadata can be generated by parsing path names.
    • Support for proposed W-lines.

v0.4 (2020-11-05)

  • CachedGBWTGraph: A GBWTGraph overlay that uses the cached interface automatically. Intended for algorithms that repeatedly access the edges in a small subgraph.
  • Minimizer index v7 (compatible with v6):
    • An option to use bounded syncmers instead of minimizers.
  • Graph algorithms in algorithms.h:
    • topological_order(): Find a topological order for all handles in the subgraph induced by a subset of nodes.
  • New queries:
    • GBWTGraph::cached_follow_edges(): A version of follow_edges() using CachedGBWT.
    • hits_in_subgraph(): Report minimizer hits in a subgraph induced by a set of nodes.
  • Changes to path cover:
    • local_haplotypes(): Revert to the path cover algorithm if there are no haplotypes in the component.
    • augment_gbwt(): Augment an existing GBWT with a path cover of missing components.

v0.3 (2020-04-14)

  • Implemented the new SerializableHandleGraph interface.
  • Query MinimizerIndex::count_and_find() that returns the occurrence count and a pointer to the internal representation of the occurrences.
  • Better path cover for acyclic component by always starting from a head node in such components.
  • Avoid querying for nonexistent nodes during construction when the source graph has gaps between node ids.
  • Store 64 bits of payload for each position in the minimizer index.
  • Minimizer index file format v6 (not compatible with the earlier versions).

v0.2 (2019-11-08)

  • An option to use 128-bit keys in the minimizer index, supporting up to 63 bp minimizers.
  • GBWT construction from a greedy path cover of an arbitrary graph.
  • GBWT construction by sampling local haplotypes by their true frequencies.
  • Minimizer index file format v5 (compatible with v4 from GBWTGraph v0.1).

v0.1 (2019-09-06)

  • The first standalone release of GBWTGraph.
  • handlegraph::HandleGraph and handlegraph::SerializableHandleGraph interfaces.
  • A version of follow_edges() that only follows paths supported by the haplotypes.
  • Direct access to the internal sequence representation.
  • Graph construction from GFA 1.0 with no overlaps/containments and integer segment identifiers.
  • Minimizer index construction from the haplotypes.
  • Requires GBWT v1.0.

Future work

  • Merging graphs (over the same chromosome / multiple chromosomes).