Fishtest Mathematics - official-stockfish/fishtest GitHub Wiki
This document outlines the core statistical models, testing methodologies, and algorithms employed by Fishtest for chess engine evaluation and parameter tuning.
Fishtest utilizes advanced models to accurately represent the outcome probabilities in chess engine matches:
-
Pentanomial Model:
- Fishtest employs the pentanomial model for analyzing engine match results.
- This model considers five outcomes (e.g., ll, ld, dd/wl, wd, ww), providing a more accurate representation than the traditional trinomial model (win, draw, loss).
- Benefit: Leads to a substantial saving of testing resources compared to the trinomial model.
-
Opening Book Bias Estimation:
- The difference between the results predicted by the pentanomial and trinomial models allows for an estimation of the Root Mean Square (RMS) value of biases present in the opening book.
- This calculation is based on the accounting identity [archive].
For efficient sequential testing (determining if a change is statistically significant as quickly as possible), Fishtest uses the Generalized Sequential Probability Ratio Test (GSPRT):
-
Core Principle (GSPRT):
- Fishtest uses the GSPRT, a generalization of the standard SPRT.
- In GSPRT, unknown parameters within the statistical hypotheses (H0 and H1) are replaced by their maximum likelihood estimates.
- It's shown (loc. cit.) that GSPRT asymptotically behaves like SPRT when the error probabilities approach zero.
- The asymptotic guarantees of GSPRT do not depend on the Elo differences being tested being small.
- Normalized Elo for Bounds:
-
Elo Estimation from GSPRT Results:
- To estimate Elo differences from completed GSPRT tests, Fishtest uses formula (6.1) in this document [archive].
- The underlying model approximates the GSPRT process as a random walk, specifically using formula (2.1) from this document [archive].
- This approximation essentially treats the GSPRT like an SPRT for the drift of a Brownian motion, where the infinitesimal variance is estimated from the sample data.
-
SPRT Calculator and Resource Estimation:
- The web-based SPRT calculator and the resource consumption estimation tables rely on formulas detailed in this document [archive].
Fishtest uses the Simultaneous Perturbation Stochastic Approximation (SPSA) algorithm for parameter tuning tasks.
To ensure the reliability of test results contributed by distributed workers, Fishtest uses a χ2 (Chi-squared) test to detect statistically anomalous workers whose results deviate significantly from the norm.
Many of the formulas and statistical models used in Fishtest have been validated through simulation: