Code Quality Benchmark - adrianzap/softwipe GitHub Wiki

To generate a benchmark, we have executed softwipe on a collection of programs, most of which are bioinformatics tools from the area of evolutionary biology. Some of the below tools (genesis, raxml-ng, repeatscounter, hyperphylo) have been developed in our lab. You will find a table containing the code quality scores below. Note that this is subject to change as we are refining our scoring criteria and including more tools.

Softwipe scores for each category are assigned such that the "best" program in each category that is not an outlier obtains a 10 out of 10 score, and the "worst" program in each category that is not an outlier is assigned a 0 out of 10 score. An outlier is defined to be a value that lies outside of Tukey's fences.

All code quality categories use relative scores. For instance, we calculate the number of compiler warnings per total Lines Of Code (LOC). Hence, we can use those relative scores to compare and rank the different programs in our benchmark. The overall score that is used for our ranking is simply the average over all score categories. You can find a detailed description of the scoring categories and the tools included in our benchmark below.

program	overall	relative score	compiler_and_sanitizer	assertions	cppcheck	clang_tidy	cyclomatic_complexity	lizard_warnings	unique	kwstyle	infer	test_count
genesis-0.24.0	9.0	9.1	9.9	8.7	8.4	9.2	9.0	9.4	8.2	8.2	N/A	10.0
fastspar	8.3	8.6	9.6	2.0	9.9	9.9	8.8	7.9	8.8	6.4	9.7	10.0
axe-0.3.3	7.6	7.6	9.4	1.2	6.6	9.3	6.2	7.6	8.4	9.8	N/A	10.0
pstl	7.5	7.1	10.0	0.4	8.0	5.6	9.3	9.9	6.3	8.4	N/A	10.0
raxml-ng_v1.0.1	7.5	7.8	9.9	4.2	6.6	9.0	7.9	6.6	4.0	9.2	N/A	10.0
kahypar	7.3	7.6	6.7	2.4	8.0	N/A	9.2	9.6	3.3	9.1	N/A	10.0
bindash-1.0	7.2	6.9	8.3	8.8	5.8	7.1	8.7	9.5	8.2	8.5	N/A	0.0
ExpansionHunter-4.0.2	7.2	7.3	8.7	1.8	8.6	9.4	8.9	9.1	0.4	7.9	N/A	10.0
ripser-1.2.1	6.9	6.7	10.0	6.3	6.4	2.4	8.9	9.1	8.6	9.9	7.1	0.0
naf-1.1.0/unnaf	6.8	7.3	9.9	4.0	9.8	10.0	6.9	7.5	7.2	3.3	9.5	0.0
virulign-1.0.1	6.8	7.0	9.1	3.4	9.4	9.0	7.3	5.8	7.5	9.3	N/A	0.0
naf-1.1.0/ennaf	6.8	6.8	9.9	10.0	9.4	10.0	7.2	6.7	0.0	5.2	9.0	0.0
glucose-3-drup	6.7	6.7	8.6	10.0	5.2	9.4	8.7	8.4	8.5	1.4	N/A	0.0
Treerecs-v1.2	6.7	6.6	5.8	1.8	6.7	8.6	9.0	9.0	1.6	7.5	N/A	10.0
dawg-1.2	6.6	6.6	10.0	0.0	6.3	10.0	8.4	8.1	7.9	9.1	N/A	0.0
RepeatsCounter	6.6	6.1	7.7	0.0	7.0	6.8	9.0	10.0	9.3	9.5	N/A	0.0
samtools-1.11	6.5	6.6	8.6	1.2	7.4	9.1	3.8	2.2	8.2	6.3	8.1	9.9
bpp-4.3.8	6.4	6.4	9.8	9.3	7.1	8.9	2.8	2.0	6.6	9.3	7.9	0.0
swarm-3.0.0	6.3	6.1	10.0	0.3	9.3	3.8	8.0	7.7	4.3	9.9	10.0	0.0
usher-0.3.2	6.3	6.4	8.9	2.1	7.4	9.3	7.5	7.5	7.5	6.4	N/A	0.0
ntEdit-1.2.3	6.1	6.0	8.4	0.0	7.1	9.7	7.9	6.7	3.8	7.7	9.4	0.0
prank-msa	5.9	6.2	5.3	5.1	9.9	9.0	7.0	6.6	1.4	5.8	9.0	0.0
IQ-TREE-2.0.6	5.9	5.5	2.3	2.5	4.7	7.8	8.2	7.7	5.3	6.6	N/A	7.7
emeraLD	5.7	5.5	4.2	0.0	9.4	8.4	6.3	5.3	9.0	8.6	N/A	0.0
dna-nn-0.1	5.6	5.4	7.9	4.1	6.8	6.0	6.7	5.0	6.1	7.8	N/A	0.0
openmp	5.5	5.4	5.8	0.9	0.2	1.5	8.1	7.3	7.6	8.3	N/A	10.0
HLA-LA	5.5	5.5	7.9	10.0	4.1	9.5	5.0	4.1	2.9	3.1	8.0	0.0
BGSA-1.0	5.4	5.0	7.3	0.0	0.2	10.0	7.5	6.8	8.2	9.4	5.1	0.0
minimap2-2.17	5.3	4.9	6.8	2.6	5.2	6.6	6.1	5.2	8.0	5.1	7.6	0.0
ngsTools/ngsLD	5.3	4.9	9.0	0.0	7.3	6.1	5.0	3.9	8.3	7.9	N/A	0.0
Seq-Gen-1.3.4	5.3	5.0	8.9	0.0	6.8	8.3	5.7	5.2	8.9	2.5	6.3	0.0
defor	5.3	5.2	0.1	0.0	6.1	9.4	6.9	6.4	9.0	9.4	N/A	0.0
copmem-0.2	5.2	5.2	10.0	0.2	7.6	8.6	8.5	7.8	4.2	4.5	0.3	0.0
phyml-3.3.20200621	5.2	5.3	9.6	5.5	5.0	8.1	4.3	2.7	5.9	3.7	6.8	0.0
dr_sasa_n	4.8	5.1	0.4	0.0	9.8	10.0	2.3	1.6	9.2	9.9	N/A	0.0
SF2	4.8	4.9	10.0	1.3	4.6	7.9	3.0	0.8	3.3	6.9	10.0	0.0
vsearch-2.15.1	4.7	4.4	7.1	0.0	8.2	1.1	5.0	3.9	5.6	9.7	6.6	0.0
clustal-omega-1.2.4	4.7	5.1	7.4	3.1	6.9	8.8	3.9	2.5	5.3	3.9	N/A	0.2
cellcoal-1.0.0	4.6	4.1	9.7	0.0	6.2	7.5	0.8	0.1	7.2	6.9	8.1	0.0
ms	4.6	4.6	8.4	0.0	0.0	10.0	6.2	5.3	6.4	0.0	9.6	0.0
MrBayes-3.2.7a	4.3	4.0	9.6	1.4	8.2	7.1	0.0	0.1	3.8	4.5	8.1	0.0
Gadget-2.0.7	4.2	4.2	10.0	0.0	0.0	10.0	0.4	0.1	5.4	9.1	N/A	3.0
prequal	4.1	4.6	2.4	5.9	0.3	9.9	6.0	4.0	1.0	2.8	8.8	0.0
crisflash	4.1	4.1	5.9	0.0	3.9	10.0	5.4	4.1	6.2	4.9	0.5	0.0
cryfa-18.06	3.9	4.1	6.2	2.0	0.0	9.7	5.9	5.5	6.0	0.0	N/A	0.0
athena-public-version-21.0	3.9	3.4	3.1	0.0	1.7	8.2	4.5	2.5	0.6	9.1	8.7	0.3
sumo	3.8	3.8	0.0	1.2	6.6	9.4	8.0	7.4	0.0	0.5	N/A	0.7
PopLDdecay	3.8	3.6	9.2	0.0	9.6	10.0	0.1	0.0	0.0	0.0	8.6	0.0
gargammel	3.8	3.4	10.0	0.0	8.4	6.4	0.0	0.1	0.9	3.4	9.1	0.0
mafft-7.475	3.7	3.0	9.3	0.0	6.4	7.8	0.3	0.4	0.7	6.5	4.6	0.8
covid-sim-0.13.0	2.8	2.6	7.5	0.0	5.2	0.0	0.0	0.0	7.3	0.3	N/A	4.9
INDELibleV1.03	2.5	2.3	6.1	0.0	0.7	9.3	0.7	0.8	6.7	0.0	0.5	0.0

Tools included

Bioinformatics-related tools:

indelible 1.03 simulates sequence data on phylogenetic trees paper
ms population genetics simulations paper
mafft 7.429 multiple sequence alignment paper
mrbayes 3.2.6 Bayesian phylogenetic inference paper
bpp 3.4 multispecies coalescent analyses paper
tcoffee multiple sequence alignment paper
prank 0.170427 multiple sequence alignment paper
sf (SweepFinder) population genetics paper
seq-gen 1.3.4 phylogenetic sequence evolution simulation paper
dawg 1.2 phylogenetic sequence evolution simulation github
repeatscounter evaluates quality of a data distribution for phylogenetic inference github
raxml-ng 0.8.1 phylogenetic inference paper
genesis 0.22.1 phylogeny library github
minimap 2.17-r943 pairwise sequence alignment paper
Clustal Omega 1.2.4 multiple sequence alignment paper
samtools 1.9 utilities for processing SAM (Sequence Alignment/Map) files paper
vsearch 2.13.4 metagenomics functions paper github
swarm 3.0.0 amplicon clustering paper github
phyml 3.3.20190321 phylogenetic inference paper
IQ-TREE 1.6.10 phylogenetic inference paper
cellcoal 1.0.0 coalescent simulation of single-cell NGS genotypes github
treerecs 1.0 species- and gene-tree reconciliation gitlab
HyperPhylo judicious hypergraph partitioning, for creating a data distribution for phylogenetic inference paper
HLA*LA - HLA (human leukocyte antigen) typing from linearly projected graph alignments paper
Dna-nn 0.1 implements a proof-of-concept deep-learning model to learn relatively simple features on DNA sequences paper
ntEdit 1.2.3 scalable genome sequence polishing paper
lemon framework for rapidly mining structural information from Protein Data Bank paper
DEFOR depth- and frequency-based somatic copy number alteration detector paper
naf 1.1.0 Nucleotide Archival Format for lossless reference-free compression of DNA sequences paper
ngsLD - Evaluating linkage disequilibrium using genotype likelihoods paper
dr_sasa 0.4b - Calculation of accurate interatomic contact surface areas for quantitative analysis of non-bonded molecular interactions paper
Crisflash software to generate CRISPR guide RNAs against genomes annotated with individual variation paper
BGSA 1.0 global sequence alignment toolkit paper
virulign 1.0.1 codon-correct alignment and annotation of viral genomes paper
PopLDdecay 3.40 tool for linkage disequilibrium decay analysis paper
fastspar 0.0.10 rapid and scalable correlation estimation for compositional data paper
ExpansionHunter 3.1.2 tool to analyze variation in short tandem repeat regions paper
bindash 1.0 fast genome distance estimation paper
copMEM 0.2 finding maximal exact matches via sampling both genomes paper
cryfa 18.06 secure encryption tool for genomic data paper
emeraLD rapid linkage disequilibrium estimation with massive datasets paper
axe 0.3.3 rapid sequence read demultiplexing paper
prequal detecting non-homologous characters in sets of unaligned homologous sequences paper
SCIPhI-0.1.7 mutation detection in tumor cells github
UShER a program that rapidly places new samples onto an existing phylogeny using maximum parsimony github
gargammel "a set of programs aimed at simulating ancient DNA fragments" github

Other tools:

KaHyPar hypergraph partitioning tool website
Athena++ magnetohydrodynamics paper
Gadget 2 simulations of cosmological structure formations paper
Candy Kingdom modular collection of SAT solvers and tools for structure analysis in SAT problems github
glucose-3-drup Glucose 3.0 (a SAT solver) with online DRUP proofs and proof traversal github
CovidSim 0.13.0 COVID-19 microsimulation model developed by the MRC Centre for Global Infectious Disease Analysis hosted at Imperial College, London. github
Eclipse SUMO traffic simulation package github
LLVM OpenMP github
LLVM Parallel-STL github
Ripser 1.2.1 "code for the computation of Vietoris–Rips persistence barcodes" github

Scoring categories

compiler and sanitizer: Here, we compile each benchmark tool using the clang compiler and count the number of warnings. We activate almost all warnings for this. We have weighted the warnings, such that each warning has a weight of 1, 2, or 3, where 3 is most dangerous (for instance, implicit type conversions that might result in precision loss are level 3 warnings). We calculate a weighted sum of clang warnings, where each warning that occurs in the compilation adds its level (1, 2, or 3) to the weighted sum. Additionally, we execute the tool with clang sanitizers (ASan and UBSan) and if the sanitizers find warnings, we add them to the weighted sum. Sanitizer warnings default to level 3. The compiler and sanitizer score is calculated from the weighted sum of warnings per total LOC.
assertions: The count of assertions (C-Style assert(), static_assert(), or custom assert macros, if defined) per total LOC.
cppcheck: The count of warnings found by the static code analyzer cppcheck per total LOC. Cppcheck categorizes its warnings; we have assigned each category a weight, similarly to the compiler warnings.
clang-tidy: The count of warnings found by the static code analyzer clang-tidy per total LOC. Clang-tidy categorizes its warnings; we have assigned each category a weight, similarly to the cppcheck and compiler warnings.
cyclomatic complexity: The cyclomatic complexity is a software metric to quantify the complexity/modularity of a program. See full Wikipedia article here. We use the lizard tool to assess the cyclomatic complexity of our benchmark tools. Keep in mind that the above table does not contain the real cyclomatic complexity values, but the scores, which rate all tools relative to each other regarding their cyclomatic complexity.
lizard warnings: The number of functions that are considered too complex, relative to the total number of functions. Lizard counts a function as "too complex" if its cyclomatic complexity, its length, or its parameter count exceeds a certain treshold value.
unique rate: The amount of unique code; a higher amount of code duplication decreases this value. The unique rate is obtained using lizard.
kwstyle: The count of warnings found by the static code style analyzer KWStyle per total LOC. We configure KWStyle using the KWStyle.xml file that is delivered with softwipe.
infer: We weight the warnings found by the static analyzer Infer and use the weighted warnings per LOC rate to calculate a score.
test count: We try to put the amount of written unit test LOC in relation with the overall LOC count and compute the rate as test_code_loc/overall_loc. At the moment we keep the detection of unit test LOC simple and declare files which have the keyword "test" in their path as test code files.

Analysis tool versions

For the benchmark we used the following analysis tool versions:

clang 11.0.0
clang-tidy 5.0.1
cppcheck 2.1
lizard 1.17.7
kwstyle latest git version (25.02.2021)
infer 0.17.0

Absolute values

For comparability reasons, we provide the absolute values for all results from which the above table is derived. The following table contains each programs total lines of pure code (by which it is sorted), total number of functions, and the absolute results for each scoring category. Note that these are already weighted results, that is, for example, level 3 compiler warnings are counted as 3 warnings here.

program	loc	functions	compiler	sanitizer	assertions	cppcheck	clang_tidy	cyclomatic_complexity	lizard_warnings	unique	kwstyle	infer	test_count
sumo	514811	23788	1664573	0	995	9585	9057	4.6	1285	0.7079	43493	N/A	2563
IQ-TREE-2.0.6	220709	10930	93864	0	852	6368	13767	4.2	527	0.9098	5827	N/A	10994
Treerecs-v1.2	171121	10189	40533	0	483	3140	6861	2.4	210	0.845	3314	N/A	64810
dr_sasa_n	146963	86	97039	0	0	182	22	11.5	15	0.9968	94	N/A	0
kahypar	109786	9732	20475	0	417	1207	N/A	1.7	72	0.8796	751	N/A	72051
MrBayes-3.2.7a	95597	962	1959	4	205	964	7969	22.6	287	0.8872	3984	242	0
openmp	91040	3530	21406	0	127	6600	22104	4.4	196	0.9479	1157	N/A	16895
raxml-ng_v1.0.1	87135	2545	645	0	572	1633	2602	4.7	181	0.8903	531	N/A	13049
samtools-1.11	78959	2321	6414	0	151	1125	2038	9.5	364	0.9626	2264	200	7626
mafft-7.475	77251	932	3275	0	0	1558	4849	17.8	226	0.81	2098	534	405
ExpansionHunter-4.0.2	72944	3945	5275	5	208	584	1296	2.8	74	0.7912	1157	N/A	18758
phyml-3.3.20200621	70845	1609	1800	1	596	1942	3762	9.0	235	0.9188	3301	298	0
athena-public-version-21.0	65302	1509	24518	1	3	2990	3325	8.8	229	0.8005	463	111	131
genesis-0.24.0	62886	3855	266	0	859	567	1472	2.4	49	0.9608	885	N/A	7658
bpp-4.3.8	41109	793	499	0	646	668	1314	10.8	129	0.9305	210	110	0
clustal-omega-1.2.4	34160	883	4970	0	162	576	1133	9.4	133	0.9106	1557	N/A	42
vsearch-2.15.1	24384	506	4039	0	0	242	6409	8.2	62	0.9142	65	107	0
prank-msa	24023	756	6334	0	188	12	660	6.0	54	0.8378	773	31	0
HLA-LA	23811	462	2817	0	1653	753	337	8.3	55	0.872	1217	62	0
usher-0.3.2	22140	849	1405	1	72	319	456	5.3	45	0.9463	610	N/A	0
covid-sim-0.13.0	13200	124	1857	0	0	350	9280	32.5	42	0.9434	1255	N/A	433
Gadget-2.0.7	12589	148	0	0	0	1534	4	16.9	47	0.9117	83	N/A	257
fastspar	11346	90	226	0	35	4	23	3.1	4	0.9779	310	4	9933
cellcoal-1.0.0	11000	66	189	0	0	229	793	14.7	21	0.9406	264	27	0
pstl	10380	1162	0	0	7	115	1303	1.6	2	0.9247	128	N/A	6617
INDELibleV1.03	9697	216	2150	0	0	543	199	14.9	45	0.9321	4252	139	0
minimap2-2.17	8841	339	1599	0	35	236	859	7.1	34	0.9569	334	28	0
swarm-3.0.0	7092	212	0	0	3	26	1204	4.6	10	0.8945	7	0	0
dawg-1.2	7058	256	0	0	0	146	0	3.9	10	0.9539	47	N/A	0
PopLDdecay	6557	57	292	0	0	16	3	19.5	20	0.4369	1418	12	0
SF2	5337	121	0	0	11	158	312	10.5	25	0.8789	129	0	0
glucose-3-drup	4772	479	390	0	149	126	78	3.3	16	0.9705	318	N/A	0
dna-nn-0.1	4768	210	574	1	30	85	541	6.4	22	0.923	82	N/A	0
ngsTools/ngsLD	4373	113	236	0	0	65	487	8.3	14	0.9643	69	N/A	0
Seq-Gen-1.3.4	3980	120	237	0	0	70	195	7.5	12	0.9828	222	19	0
gargammel	3444	17	0	0	0	31	357	48.7	5	0.8183	169	4	0
crisflash	3279	84	763	0	0	107	0	7.8	10	0.9238	128	47	0
copmem-0.2	3026	133	4	0	1	40	123	3.7	6	0.8939	125	48	0
axe-0.3.3	2781	60	94	0	5	53	54	7.0	3	0.9677	4	N/A	802
prequal	2600	99	1083	0	23	179	4	7.2	12	0.8228	139	4	0
ntEdit-1.2.3	2365	87	213	0	0	38	23	4.8	6	0.8867	42	2	0
cryfa-18.06	2216	74	473	5	7	370	20	7.3	7	0.9213	372	N/A	0
ms	2182	71	193	1	0	201	0	7.0	7	0.9263	641	1	0
emeraLD	1642	51	524	0	0	5	74	6.8	5	0.988	18	N/A	0
bindash-1.0	1622	88	152	0	23	38	133	3.2	1	0.963	19	N/A	0
naf-1.1.0/unnaf	1620	77	4	2	10	2	0	6.1	4	0.9415	80	1	0
naf-1.1.0/ennaf	1615	73	7	1	78	5	0	5.7	5	0.6041	60	2	0
BGSA-1.0	1405	30	216	0	0	100	0	5.3	2	0.9621	7	9	0
virulign-1.0.1	1149	46	56	0	6	4	33	5.6	4	0.9464	6	N/A	0
ripser-1.2.1	1053	105	0	0	10	21	221	2.8	2	0.9742	1	4	0
defor	695	27	602	0	0	15	11	6.2	2	0.9876	3	N/A	0
RepeatsCounter	243	19	30	2	0	4	22	2.4	0	1.0	1	N/A	0

How to create the benchmark

To calculate this benchmark, the results of all softwipe runs must be saved into a results directory that has one subdirectory for each tool that should be included in the benchmark. Most importantly, for each tool, the output of softwipe must be saved into a file called "softwipe_output.txt", which has to lie in the according subdirectory for that tool. For example, the directory structure has to look like this:

results/
results/tool1/
results/tool1/softwipe_output.txt
results/tool2/
results/tool2/softwipe_output.txt
...

Then, the script calculate_score_table.py can be used to parse all the softwipe output files and generate a csv that contains all scores. The script requires the path to the results directory (results/ in our example). The script contains a list called FOLDERS that contains the names of all subdirectories that will be included in the benchmark (tool1, tool2, etc. in out example). To add or remove a tool to/from the benchmark, edit this list.

The script recalculates all scores from the rates, rather than parsing the scores directly. This is done so that softwipe doesn't need to be rerun for all tools if the scoring functions get changed. The script simply uses softwipe's scoring functions from scoring.py. These scoring functions use the values calculated by the compare_results.py script, which are the best/worst values that are not outliers, as mentioned above.