Concepts Statistics - UBOdin/mimir GitHub Wiki
The mimir.statistics
package includes modules for obtaining an assortment of statistical measures about data in the backend database.
FuncDeps
Tools for constructing an (approximate) functional dependency graph based on existing data in the database.
Todo: Describe FuncDeps
DetectSeries
The DetectSeries module heuristically detects columns that are likely to be meaningful sort orders on their respective tables. Heuristics presently include
- All columns with temporal types (Date and DateTime) are candidates
- Any numeric column is a candidate if it exhibits periodicity. For example, a column consisting of 1, 2, 3, 4, 5... is a candidate ordering.
- Approximate periodicity is also accepted according to either of the following two measures:
- Coefficient of variation: Coefficient of variation is also known as the Relative Standard Deviation(RSD), is a standardized measure of dispersion of a distribution. It is calculated as the percent of standard deviation to the mean of the distribution. In Mimir, RSD is used to evaluate if a column follows a series. The columns are sorted and the difference between the adjacent values are calculated. The RSD of this difference values show how close the column values follow an arithmetic progression (evaluation of common difference). RSD score ranges between 0 to 1 and is inversely proportional to the closeness of the column to the series.
- R – Squared value: a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination. This measure evaluates the deviation of the actual value from the estimated mean value (estimated series). R-Squared score ranges between 0 to 1 and is directly proportional to the closeness of the column to the series.