Updates - ACEnglish/truvari GitHub Wiki
October 7, 2025
- New
stratpcommand to automatically generate benchmark performance evaluation across stratifications -
truvari.VariantRecord.allele_freq_annosnow stores the results to speed up reuse in e.g.collapse. - LazyImporting for faster startup times
-
collapsenow allows--sizemax -1to work with all large SVs easily. - New
collapseargument--fast-clusterwill dramatically speed up runtime when collapsing large (>100kbp) SVs - bwapy, which is a bother to install on macs, is now optional by default (#295)
-
vcf2df --parquetwill write a parquet file, which is more stable across environments than the default joblib file. - Miscellaneous bug fixes (#288, #286, #284, #282, #275)
April 21, 2025
- Fixed FP BNDs being dropped details.
- Restore default
--sizemax- Some callers make SVs that span the entire chromosome, which disrupts truvari's chunking strategy -
phab- Can now harmonize samples' variants across any number of VCFs. This entails a UI change of no more
-b/-c. - For a large jobs, a
--lowmemflag:- turns on progress bars and process pool monitoring for better tracking of failed harmonization jobs
- Much lower memory usage, and fewer failures
- Api refactor to programmatically build haplotypes with
phab.VCFtoHaplotypesand other phab functions
- Can now harmonize samples' variants across any number of VCFs. This entails a UI change of no more
-
bench- Can now run on vcfs without SAMPLE columns (e.g. annotation files)
- Fixed
anno grpafheader tags. Some tools (e.g. IGV) don't link Type before Number. - Fixed edge case when iterating variants not inside regions
February 16, 2025
- The default
--alignmethod forphabandbenchswitched to POA. See discussion for details. - Fix bug in
--pick acwhere FN/FP variants were not being counted/output. - Fix
--dup-to-insTicket #258 -
ga4ghnow also writes a variant count summary json
February 5, 2025
-
bench- new automatic hook into the refine step via
truvari bench --refine
- new automatic hook into the refine step via
-
refine- completely reworked UI in favor of easier whole-genome SV refinement. See wiki for details
- Now writes a consolidated
refine.base.vcf.gzandrefine.comp.vcf.gzfor easier tracking of variants' final states. - Default behavior count original variant representations instead of the
phabvariant representations
-
collapse- Add
--dup-to-ins - Fixed bug where regions with >100 variants would sometimes not have all variants compared
-
--chainfunctionality now capped to do only 1 transitive match, preventing uncontrolled over-merging
- Add
-
ga4gh- New/renamed parameters as part of general improvement work
- Output suffixes are now
.base.vcf.gzand.comp.vcf.gzfor consistency.
-
stratify- 1--complement` now outputs a single line of total variant counts outside of the regions instead of arbitrarily assigning variants to their nearest region
-
misc
- Fix BND bugs
-
pysam.VariantFile.allele_variant_typesfalsely identified some BNDs as INDELs, causing incorrect filtering by Truvari - SVs Decomposed to BNDs strandedness flipped to be more representative of original SV
-
- unroll seqsim checks all directions
- Match sorting breaks seq/size ties with start/end distance
- Long SV roll limit speeds - ≥500bp, rolling is turned off
-
truvari.VariantRecord.withinedge case fix
- Fix BND bugs
January 9, 2025
- Reference context sequence comparison is now deprecated and sequence similarity calculation improved by also checking lexicographically minimum rotation's similarity. details
- Symbolic variants (
<DEL>,<INV>,<DUP>) can now be resolved for sequence comparison when a--referenceis provided. The function for resolving the sequences is largely similar to this discussion - Symbolic variants can now match to resolved variants, even with
--pctseq 0, with or without the new sequence resolving procedure. - Symbolic variant sub-types are ignored e.g.
<DUP:TANDEM> == <DUP> -
--sizemaxnow default to-1, meaning all variant ≥--sizemin / --sizefiltare compared - Redundant variants which are collapsed into kept (a.k.a. removed) variants now more clearly labeled (
--removed-outputinstead of--collapsed-output) - Fixed 'Unknown error' caused by unset TMPDIR (#229 and #245)
- Fixes to minor record keeping bugs in refine/ga4gh better ensure all variants are counted/preserved
- BND variants are now compared by bench (details)
- Cleaner outputs by not writing matching annotations (e.g.
PctSeqSimilarity) that areNone - Major refactor of Truvari package API for easy reuse of SV comparison functions (details)
September 9, 2024
-
bench- Correctly filtering
ALT=*alleles (details) and monomorphic reference- including test coverage this time
- Correctly filtering
-
stratify- Default behavior is to count variants within (#221)
-
collapse- Faster sub-chunking operations by dropping use of pyintervaltree
-
anno chunks- New command for identifying windows with a high number of SVs (details)
July 31, 2024
-
refine&stratify- Fixed variant and bed boundary overlapping issue
- general
March 28, 2024
-
collapse- Fewer comparisons needed per-chunk on average
- Fixed
--chainfunctionality (details) - Fixed
--gtconsolidation of format fields
-
bench- Faster result at the cost of less complete annotations with
--shortflag
- Faster result at the cost of less complete annotations with
-
refine- Assures variants are sequence resolved before incorporating into consensus
-
bench --passonly --sizemaxparameters are used when building consensus for a region. Useful forrefine --use-original-vcfs - When a refined region has more than 5k variants, it is skipped and a warning of the region is written to the log
- Flag
--use-region-coordsnow expands--regioncoordinates by 100bp (phab --bufferdefault) to allow variants to harmonize out of regions.
- general
- Dynamic bed/vcf parsing tries to choose faster of streaming/fetching variants
February 6, 2024
-
collapse- Faster handling of genotype data for
--gtand--keep common
- Faster handling of genotype data for
- general
- Fix to bed end position bug for including variants (details)
- Fix to Dockerfile
-
refine- Changes to
--recountthat accompany the fix to bed end positions.
- Changes to
- New command
ga4ghto convert Truvari results into GA4GH truth/query VCFs with intermediates tags
January 12, 2024
-
collapse- New parameter
--gtdisallows intra-sample events to collapse (details) - New parameter
--intrafor consolidating SAMPLE information during intra-sample collapsing (details) - Preserve phasing information when available
- Faster O(n-1) algorithm instead of O(n^2)
- Faster sub-chunking strategy makes smaller chunks of variants needing fewer comparisons
- Fixed rare non-determinism error in cases where multiple variants are at the same position and equal qual/ac could be ordered differently.
- New parameter
-
phab- Correct sample handling with
--bSamples--cSamplesparameters - Faster generation of consensus sequence
- Resolved 'overlapping' variant issue causing variants to be dropped
- New
poaapproach to harmonization. Faster than mafft but less accurate. Slower than wfa but more accurate.
- Correct sample handling with
-
bench- New, easier
MatchIdfield to track which baseline/comparison variants match up details -
entry_is_presentmethod now considers partial missing variants (e.g../1) as present - Removed the 'weighted' metrics from
summary.json
- New, easier
-
consistency- Fixed issue with counting duplicate records
- Added flag to optionally ignore duplicate records
-
anno svinfonow overwrites existing SVLEN/SVTYPE info fields - general
- Reduced fn matches for unroll sequence similarity by reporting maximum of multiple manipulations of variant sequence (roll up/down/none). Comes at a small, but reasonable, expense of some more fp matches.
- Bump pysam version
- Fixed bug in
unrollsequence similarity that sometimes rolled from the wrong end - Fixed bug for handling of None in ALT field
-
truvari.compress_index_vcfforces overwriting of tabix index to prevent annoying crashes
August 7, 2023
-
bench -
phab- ~2x faster via reduced IO from operating in stages instead of per-region
- Removed most external calls (e.g. samtools doesn't need to be in the environment anymore)
- new
--align wfaallows much faster (but slightly less accurate) variant harmonization - increased determinism of results detals
-
refine- Faster bed file intersection of
--includebedand--regions - Refine pre-flight check
- Correct refine.regions.txt end position from IntervalTree correction
- Better refine region selection with
--use-original -
--use-includebedswitched to--use-region-coordsso that default behavior is to prefer the includebed's coordinates -
--use-original-vcfsto use the original pre-bench VCFs -
refine.variant_summary.jsonis cleaned of uninformative metrics
- Faster bed file intersection of
-
stratify- parallel parsing of truvari directory to make processing ~4x faster
-
msa2vcfFixed REPL decomposition bug to now preserve haplotypes -
anno grpaf- expanded annotation info fields -
anno density- new parameter--stepsizefor sliding windows -
collapse- New optional
--median-infofields #146
- New optional
- Minor updates
- Fix some
annothreading on macOS #154 - Monomorphic/multiallelic check fix in
bench -
PHAB_WRITE_MAFFTenvironment variable to facilitate updating functional test answer key - Slightly slimmer docker container
- Fix some
March 13, 2023
As part of the GIAB TR effort, we have made many changes to Truvari's tooling to enable comparison of variants in TR regions down to 5bp. Additionally, in order to keep Truvari user friendly we have made changes to the UI. Namely, we've updated some default parameters, some command-line arguments, and some outputs. There are also a few new tools and how a couple of tools work has changed. Therefore, we decided to bump to a new major release. If you're using Truvari in any kind of production capacity, be sure to test your pipeline before moving to v4.0.
- New
refinecommand for refining benchmarking results. Details -
bench- Unroll is now the default sequence comparison approach.
- New
--pickparameter to control the number of matches a variant can participate in details - The
summary.txtis now namedsummary.json - Outputs parameters to
params.json - Output VCFs are sorted, compressed, and indexed
- Ambiguous use of 'call' in outputs corrected to 'comp' (e.g.
tp-call.vcf.gzis nowtp-comp.vcf.gz) - Renamed
--pctsimparameter to--pctseq - Fixed bug where FP/FN weren't getting the correct, highest scoring match reported
- Fixed bug where
INFO/Multiwasn't being properly applied - Fixed bug where variants spanning exactly one
--includebedregion were erroneously being counted. - Removed parameters:
--giabreport,--gtcomp,--multimatch,--use-lev,--prog,--unroll
-
collapse- Renamed
--pctsimparameter to--pctseq - Runtime reduction by ~40% with short-circuiting during
Matcher.build_match - Better output sorting which may allow pipelines to be a little faster.
- Renamed
-
vcf2df- More granular sizebins for
[0,50)including better handling of SNPs -
--multisampleis removed. Now automatically add all samples with--format - key index column removed and replaced by chrom, start, end. Makes rows easier to read and easier to work with e.g. pyranges
- More granular sizebins for
-
anno- Simplified ui. Commands that work on a single VCF and can stream (stdin/stdout) no longer use
--inputbut a positional argument. - Added
addid
- Simplified ui. Commands that work on a single VCF and can stream (stdin/stdout) no longer use
-
consistency- Slight speed improvement
- Better json output format
-
segment- Added
--passonlyflag - Changed UI, including writing to stdout by default
- Fixed END and 1bp DEL bugs, now adds N to segmented variants' REF, and info fields SVTYPE/SVLEN
- Added
- API
- Began a focused effort on improving re-usability of Truvari code.
- Entry point to run benchmarking programmatically with Bench object.
- Better development version tracking. details
- Improved developer documentation. See readthedocs
- general
- msa2vcf now left-trims and decomposes variants into indels
- Functional tests reorganization
- Fix for off-by-one errors when using pyintervaltree. See ticket
- Removed progressbar and Levenshtein dependencies as they are no longer used.
August 27, 2022
-
bench-
--dup-to-insflag automatically treats SVTYPE==DUP as INS, which helps compare some programs/benchmarks - New
--unrollsequence comparison method forbenchandcollapse(details)
-
- Major
anno trfrefactor (TODO write docs) including:- annotation of DEL is fixed (was reporting the ALT copy numbers, not the sample's copy numbers after incorporating the ALT
- allow 'denovo' annotation by applying any TRF annotations found, not just those with corresponding annotations
- New
anno grpafannotates vcf with allele frequency info for groups of samples - New
phabfor variant harmonization (details) - backend
-
truvari.entry_sizereturns the length of the event in the cases where len(REF) == len(ALT) (e.g. SNPs entry_size is 1) - New key utility for
truvari.build_anno_trees
-
- general
- Float metrics written to the VCF (e.g. PctSizeSimilarity) are rounded to precision of 4
- Nice colors in some
--helpwith rich
-
divide- output shards are now more easily sorted (i.e.
ls divide_result/*.vcf.gzwill return the shards in the order they were made) - compression/indexing of sub-VCFs in separate threads, reducing runtime
- output shards are now more easily sorted (i.e.
- user issues
July 7, 2022
- Improved performance of
consistency(see #127) - Added optional json output of
consistencyreport - Allow GT to be missing, which is allowed by VCF format specification
- TRF now uses
truvari.entry_variant_typeinstead of trying to usepysam.VariantRecord.info["SVLEN"]directly which allows greater flexibility. - vcf2df now parses fields with
Number=\d(e.g. 2+), which is a valid description -
truvari.seqsimis now case insensitive (see #128) - Collapse option to skip consolidation of genotype information so kept alleles are unaltered
-
truvari anno dpcnt --presentwill only count the depths of non ./. variants - New collapse annotation
NumConsolidaterecords how many FORMATs were consolidated - Official conda support
May 25, 2022
- New utilities
vcf_rangesandmake_temp_filename - New annotations
dpcntandlcr - Fixed a bug in
truvari collapse --keepthat prevented themaxqualorcommonoptions from working - Increased determinism for
truvari collapseso that in cases of tied variant position the longer allele is returned. If the alleles also have the same length, they are sorted alphabetically by the REF - New
truvari bench --extendfunctionality. See discussion for details
Apr 1, 2022
- Removed
truvari.copy_entryforpysam.VariantRecord.translatea 10x faster operation - Faster
truvari collapse(@c8b319b) - When building
MatchResultbetween variants with shared start/end positions, we save processing work by skipping haplotype creation and just compare REFs/ALTs directly. - Updated documentation to reference the paper https://doi.org/10.1101/2022.02.21.481353
- New
truvari anno densityfor identifying regions with 'sparse' and 'dense' overlapping SVs (details) - Better
benchgenotype reporting withsummary.txthaving agt_matrixof Base GT x Comp GT for all Base calls' best, TP match. - New
truvari anno bpovlfor intersecting against tab-delimited files (details) - New
truvari dividecommand to split VCFs into independent parts (details) - Replaced
--bufferparameter with--minhaplenfor slightly better matching specificity - Bugfix -
truvari anno trfno longer duplicates entries spanning multple parallelization regions - Bugfix -
collapseMatchId/CollapseId annotation wasn't working - Bugfixes - from wwliao (@4dd9968 @ef2cfb3)
- Bugfixes - Issues #107, #108
Dec 22, 2021
-
benchnow annotates FPs by working a little differently. See bench for details. - Recalibrated TruScore and new reciprocal overlap measurement for sequence resolved
INS(details) - Match objects are now usable via the SDK. See #94 for an example of using Truvari programmatically
-
file_zipperVCF iteration strategy (GenomeTree->RegionVCFIterator) that improves speed, particularly when using--includebed -
collapserefactored to use Match object and for prettier code, cleaner output. -
anno remapnow optionally addsINFOfield of the location of the top N hits. - An experimental tool
truvari segmentadded to help SV association analysis. -
vcf2dfnow supports pullingFORMATfields from multiple samples. -
vcf2dfnow adds('_ref', '_alt'), or('_ref', '_het', '_hom')forINFO,Number=[R|G]fields, respectively. - Improved documentation, including http://truvari.readthedocs.io/ for developers.
- Increasing/diversifying test coverage exposed minor bugs which were fixed.
-
bench --no-ref --cSamplebug fixes. - Minor usability feature implemented in
help_unknown_cmd.
Sep 15, 2021
As Truvari's adoption and functionality grows, we decided to spend time working on sustainability and performance of the tool. Multiple Actions for CI/CD have been added. Many components have been refactored for speed, and other 'cruft' code has been removed. Some of these changes (particularly the switch to using edlib for sequence similarity) affects the results. Therefore, we've bumped to a new major release version.
- Working on speed improvements
- Added edlib as the default when calculating pctseq_sim, keeping Levenstein as an option (
--use-lev). -
truvari benchsummary's gt_precision/gt_recall are replaced by gt_concordance, which is just the percent of TP-comp calls with a concordant genotype.--no-refhas better functionality.--giabreportis different. - Added
—keep commontotruvari collapse, which allows one to choose to keep the allele with the highest MAC. -
truvari collapse --hapwasn't working correctly. The assumptions about the calls being phased wasn't being properly used (e.g. don't collapse 1|1) and the NumCollapsed was being populated before the single-best match was chosen. The latter is a reporting problem, but the former had an effect on the results with ~3% of collapsed calls being mis-collapsed. -
truvari anno trfis now faster and simpler in its approach and whats reported.. and hopefully more useful. -
truvari anno grmhas min_size and regions arguments added. - truv2df has become
truvari vcf2dfwhere the default is vcf conversion with options to run on atruvari benchoutput directory. It also allows a specific sample to be parsed with--formatand better Number=A handling. - NeighId added to
truvari anno numneigh, which works like bedtools cluster. - The method af_calc now makes MAC/AC.
- Added 'partial' to
truvari anno remap. - Added
truvari anno svinfo. - Removed
truvari statsastruvari vcf2dfis better and began building community-driven summaries. - Ubiquitous single version.
- Added a Dockerfile and instructions for making a Truvari docker container.
- Code and repository cleaning.
- Github actions for automated pylint, testing, and releases to pypi.
- Preserving per-version documentation from the wiki in
docs/.
Jan 27, 2021
We've expanded and improved Truvari's annotations. We've added an SV "collapsing" tool. And we've added a way to turn VCFs into pandas DataFrames easily for downstream analysis/QC.
May 14, 2020
After performing a drastic code refactor, we were able to create several helper methods from Truvari's core functionality around SV comparisons and VCF manipulations. This reusable code gave us an opportunity to create tools relevant for SV analysis.
Truvari now contains multiple subcommands. In addition to the original benchmarking functionality (truvari bench), Truvari can generate SV relevant summary statistics, compute consistency of calls within VCFs, and we've begun to develop annotations for SVs. Details on these tools are on the WIKI.
We are committed to continually improving Truvari with the hopes of advancing the study and analysis of structural variation.
September 25th, 2019
Truvari has some big changes. In order to keep up with the o deement of Python 2.7 https://pythonclock.org/ We're now only supporting Python 3.
Additionally, we now package Truvari so it and its dependencies can be installed directly. See Installation below. This will enable us to refactor the code for easier maintenance and reusability.
Finally, we now automatically report genotype comparisons in the summary stats.