Distributions - accord-net/framework GitHub Wiki

Probability distributions

Note: For a cool example showing the probability distributions in action, please see The Statistics Workbench article at CodeProject.

Basic usage

Let's say that in our application we would need to create a Normal (Gaussian) distribution with a specific mean and standard deviation. In order to do that, let's first import the framework's namespaces into our source file:

using Accord.Statistics;
using Accord.Statistics.Distributions.Univariate;

After that, lets create a Normal distribution with μ= 4 and σ= 4.2

// Create a new Normal distribution with mu: 4 and s: 4.2
var normal = new NormalDistribution(mean: 4, stdDev: 4.2);

Now we can query any property of the distribution as we would like:

double mean   = normal.Mean;              // 4.0   (mu)
double median = normal.Median;            // 4.0   (for the Gaussian, same as the mean)
double mode   = normal.Mode;              // 4.0   (for the Gaussian, same as the mean)
double stdDev = normal.StandardDeviation; // 4.2   (sigma) 
double var    = normal.Variance;          // 17.64 (sigma squared)

And we can also query its associated functions:

// The cumulative distribution function, or the probability that a real-valued 
// random variable will be found to have a value less than or equal to some x:
double cdf = normal.DistributionFunction(x: 1.4);        // 0.26794249453351904

// The probability density function, or the relative likelihood for a real-valued 
// random variable will be found to take on a given specific value of x:
double pdf = normal.ProbabilityDensityFunction(x: 1.4);  // 0.078423391448155175

// The log of the probability density function, useful for applications where
// precision is critical
double lpdf = normal.LogProbabilityDensityFunction(x: 1.4); // -2.5456330358182586

// The complementary distribution function, or the tail function, that gives the
// probability that a real-valued random variable will be found to have a value 
// greater than some x. This function is also known as the Survival function.
double ccdf = normal.ComplementaryDistributionFunction(x: 1.4); // 0.732057505466481

// The inverse distribution function, or the Quantile function, that is able to
// revert probability values back to the real value that produces that probability
double icdf = normal.InverseDistributionFunction(p: cdf); // 1.4

// The Hazard function, or the failure rate, the event rate at time t conditional 
// on survival until time t or later. Note that this function may only make sense
// when using time-defined distributions, such as the Poisson.
double hf = normal.HazardFunction(x: 1.4);            // 0.10712736480747137

// The cumulative hazard function, that gives how much the hazard 
// function accumulated over time until a given time instant x.
double chf = normal.CumulativeHazardFunction(x: 1.4); // 0.31189620872601354

// Every distribution has a friendly string representation
string str = normal.ToString(); // N(x; mu = 4, s2 = 17.64)

Note that those functions and measures are part of the general IDistribution interface. As such, no matter what distribution we create, the usage is always the same. For example, let's switch our Normal distribution for a discrete Poisson distribution instead:

// Create a new Poisson distribution with lambda = 0.7
PoissonDistribution poisson = new PoissonDistribution(0.7);

double mean = poisson.Mean;                // 0.7    (lambda) 
double median = poisson.Median;            // 1.0
double mode = poisson.Mode;                // 0.7    (lambda)  
double stdDev = poisson.StandardDeviation; // 0.836  [sqrt((lambda))]
double var = poisson.Variance;             // 0.7    (lambda) 

// The cumulative distribution function, or the probability that a real-valued 
// random variable will be found to have a value less than or equal to some x:
double cdf = poisson.DistributionFunction(k: 1);        // 0.84419501644539618

// The probability density function, or the relative likelihood for a real-valued 
// random variable will be found to take on a given specific value of x:
double pdf = poisson.ProbabilityMassFunction(k: 1);  // 0.34760971265398666

// The log of the probability density function, useful for applications where
// precision is critical
double lpdf = poisson.LogProbabilityMassFunction(k: 1); // -1.0566749439387324

// The complementary distribution function, or the tail function, that gives the
// probability that a real-valued random variable will be found to have a value 
// greater than some x. This function is also known as the Survival function.
double ccdf = poisson.ComplementaryDistributionFunction(k: 1); // 0.15580498355460382

// The inverse distribution function, or the Quantile function, that is able to
// revert probability values back to the real value that produces that probability
int icdf = poisson.InverseDistributionFunction(p: cdf); // 1

// The Hazard function, or the failure rate, the event rate at time t conditional 
// on survival until time t or later. Note that this function may only make sense
// when using time-defined distributions, such as the Poisson.
double hf = poisson.HazardFunction(x: 1);            // 2.2310564445595058

// The cumulative hazard function, that gives how much the hazard 
// function accumulated over time until a given time instant x.
double chf = poisson.CumulativeHazardFunction(x: 1); // 1.8591501591854034

// Every distribution has a friendly string representation
string str = poisson.ToString(); // Poisson(x; lambda = 0.7)

We can also create a multidimensional distribution in the same way. In the next example, let's see how we can create a multidimensional Gaussian distribution and compute some of its measures:

// Create a multivariate Gaussian distribution 
var dist = new MultivariateNormalDistribution
(
    // mean vector mu
    mean: new double[]  { 4, 2  },

    // covariance matrix sigma
    covariance: new double[,] 
    {
        { 0.3, 0.1 },
        { 0.1, 0.7 }
    }
);

// Common measures
double[] mean = dist.Mean;         // { 4, 2 }
double[] median = dist.Median;     // { 4, 2 }
double[] mode = dist.Mode;         // { 4, 2 }
double[,] cov = dist.Covariance;   // { { 0.3, 0.1 }, { 0.1, 0.7 } }
double[] var = dist.Variance;      // { 0.3, 0.7 } (diagonal from cov)
int dimensions = dist.Dimension;   // 2

// Probability density functions
double pdf1 = dist.ProbabilityDensityFunction(2, 5);    // 0.000000018917884164743237
double pdf2 = dist.ProbabilityDensityFunction(4, 2);    // 0.35588127170858852
double pdf3 = dist.ProbabilityDensityFunction(3, 7);    // 0.000000000036520107734505265
double lpdf = dist.LogProbabilityDensityFunction(3, 7); // -24.033158110192296

// Cumulative distribution function (for up to two dimensions)
double cdf = dist.DistributionFunction(3, 5);               // 0.033944035782101534
double ccdf = dist.ComplementaryDistributionFunction(3, 5); // 0.00016755510356109232

Estimating a distribution from data values

It might be the case that, instead of having specific values for the mean and standard deviation, we would like to estimate a Normal distribution directly from a set of values. For example, let's assume we have the following values:

// Suppose those are our observed values
double[] observations = { 0.10, 0.40, 2.00, 2.00 };

// Now, in order to fit a distribution to those values, 
// all we have to do is to create a base distribution:
var normal = new NormalDistribution();

// and then pass those values to its .Fit method:
normal.Fit(observations);

// Result is N(x; 1.125, 1.017)
string text = normal.ToString();

Now, let's suppose that some of our samples are actually more significant than others. Some of them have more weight. In this case, we can specify a real value associated with each of the observations giving how much weight each particular observation has in comparison to others. We can then pass both vectors, the observations and the weights, to the distribution fitting process as shown below.

// Suppose those are our observed values and their individual weights
double[] observations = { 0.10, 0.40, 2.00, 2.00 };
double[] weights = { 0.25, 0.25, 0.25, 0.25 };

// We create a base distribution:
var normal = new NormalDistribution();

// and then pass the values and weights to its .Fit method:
normal.Fit(observations, weights);

// Result is N(x; 1.125, 1.017)
string text = normal.ToString();

The following example shows how to fit one multidimensional Gaussian distribution to a set of observations. Unfortunately, those observations are linearly dependent and result in a non-positive definite covariance matrix. In order to circunvent this problem, we pass a fitting options object specifying that the estimation should be done in a robust manner:

// Suppose we would like to fit a multidimensional Normal distribution to
// those samples. Those samples are linearly dependent between themselves,
// and attempting to fit them directly might result in an exception.
double[][] observations = 
{
    new double[] { 1, 2 },
    new double[] { 2, 4 },
    new double[] { 3, 6 },
    new double[] { 4, 8 }
};

// Create a base distribution with two dimensions
var normal = new MultivariateNormalDistribution(2);

// Fit the distribution to the data
normal.Fit(observations, new NormalOptions()
{
    Robust = true // do a robust estimation
});

Generating new samples from an existing distribution

It is possible to generate new samples from any univariate distribution in the framework. Some distributions have specialized sample generation methods that are faster than others, but all of them can be sampled, be through a specialized method or using the default Quantile Function method.

The following example shows how to generate 1000000 samples from a Normal distribution with mean 2 and standard deviation 5.

// Create a Normal with mean 2 and sigma 5
var normal = new NormalDistribution(2, 5);

// Generate 1000000 samples from it
double[] samples = normal.Generate(1000000);

// Try to estimate a new Normal distribution from the 
// generated samples to check if they indeed match
var actual = NormalDistribution.Estimate(samples);

string result = actual.ToString("N2"); // N(x; mu = 2.01, s2 = 25.03)

The following example shows how to generate 1000000 samples from a Binomial distribution with 4 trials and a probability of success of 0.2.

// Create a Binomial with n = 4 and p = 0.2
var binomial = new BinomialDistribution(4, 0.2);

// Generate 1000000 samples from it
double[] samples = target.Generate(1000000).ToDouble();

// Try to estimate a new Binomial distribution from
// generated samples to check if they indeed match
var actual = new BinomialDistribution(4);
actual.Fit(samples);

string result = actual.ToString("N2"); // Binomial(x; n = 4.00, p = 0.20)
The following example shows how to generate 1000000 samples from a Multivariate Normal Distribution with mean vector [2, 6] and covariance matrix [2 1; 1 5].
// mean vector
double[] mu = { 2.0, 6.0 };

// covariance
double[,] cov = 
{
    { 2, 1 },
    { 1, 5 } 
};

// Create a multivariate Normal distribution
var normal = new MultivariateNormalDistribution(mu, cov);

// Generate 1000000 samples from it
double[][] samples = normal.Generate(1000000);

// Try to estimate a new Normal distribution from
// generated samples to check if they indeed match
var actual = MultivariateNormalDistribution.Estimate(samples);

Creating a new probability distribution

In the Accord.NET Framework, there are two ways to create a new probability distribution: using lambda functions or using inheritance. While the later is the prefered way to create new distributions since it is much easier to share those distributions accross different classes and projects, using lambda functions might easier for writing quick-and-dirty code when we would like to test something as fast as possible.

Using lambda functions

To create a new distribution using a lambda function we can use the GeneralContinuousDistribution class. This class can create new distributions based on a lambda function defining the CDF, the PDF, or both functions or the desired distribution.

  // Let's suppose we have a formula that defines a probability distribution
  // but we dont know much else about it. We don't know the form of its cumulative
  // distribution function, for example. We would then like to know more about
  // it, such as the underlying distribution's moments, characteristics, and 
  // properties.

  // Let's suppose the formula we have is this one:
  double mu = 5;
  double sigma = 4.2;

  Func<double, double> df = x =>
      1.0 / (sigma * Math.Sqrt(2 * Math.PI)) * Math.Exp(-Math.Pow(x - mu, 2) / (2 * sigma * sigma));

  // And for the moment, let's also pretend we don't know it is actually the
  // p.d.f. of a Gaussian distribution with mean 5 and std. deviation of 4.2.

  // So, let's create a distribution based _solely_ on the formula we have:
  var distribution = GeneralContinuousDistribution.FromDensityFunction(df);

  // Now, we can check everything that we can know about it:
  double mean = distribution.Mean;     // 5
  double median = distribution.Median; // 5
  double var = distribution.Variance;  // 17.64
  double mode = distribution.Mode;     // 5

  double cdf = distribution.DistributionFunction(x: 1.4); // 0.19568296915377595
  double pdf = distribution.ProbabilityDensityFunction(x: 1.4); // 0.065784567984404935
  double lpdf = distribution.LogProbabilityDensityFunction(x: 1.4); // -2.7213699972695058

  double ccdf = distribution.ComplementaryDistributionFunction(x: 1.4); // 0.80431703084622408
  double icdf = distribution.InverseDistributionFunction(p: cdf); // 1.3999999997024655

  double hf = distribution.HazardFunction(x: 1.4); // 0.081789351041333558
  double chf = distribution.CumulativeHazardFunction(x: 1.4); // 0.21776177055276186

Using inheritance

To create a new distribution using inheritance, first you need to know which kind of distribution you would like to create. You will need to create a new class and then choose one of the following classes to inheric from: UnivariateContinuousDistribution, UnivariateDiscreteDistribution, MultivariateContinuousDistribution or MultivariateDiscreteDistribution. Once you know what base class you have to choose, implement all virtual methods required by this class.

For example, let's suppose we would like to implement the Rademacher distribution, as given in Wikipedia.

namespace MyApplicationNamespace
{
    class RademacherDistribution
    {
    }
}

From the Wikipedia description, it is written:

"In probability theory and statistics, the Rademacher distribution (which is named after Hans Rademacher) is a discrete probability distribution where a random variate X has a 50% chance of being either +1 or -1.[1]"

Since this is a discrete probability distribution, now we know we would need to be inheriting from UnivariateDiscreteDistribution. So let's do it, right click the class name, and implement all of its virtual methods:

class RademacherDistribution : UnivariateDiscreteDistribution
{
    public override double Mean
    {
        get { throw new NotImplementedException(); }
    }

    public override double Variance
    {
        get { throw new NotImplementedException(); }
    }

    public override double Entropy
    {
        get { throw new NotImplementedException(); }
    }

    public override AForge.IntRange Support
    {
        get { throw new NotImplementedException(); }
    }

    public override double DistributionFunction(int k)
    {
        throw new NotImplementedException();
    }

    public override double ProbabilityMassFunction(int k)
    {
        throw new NotImplementedException();
    }

    public override string ToString(string format, IFormatProvider formatProvider)
    {
        throw new NotImplementedException();
    }

    public override object Clone()
    {
        throw new NotImplementedException();
    }
}

Now, all we have to do is to fill those missing fields:

class RademacherDistribution : UnivariateDiscreteDistribution
{
    public override double Mean
    {
        get { return 0; }
    }

    public override double Variance
    {
        get { return 1; }
    }

    public override double Entropy
    {
        get { return System.Math.Log(2); }
    }

    public override AForge.IntRange Support
    {
        get { return new AForge.IntRange(-1, +1); }
    }

    public override double DistributionFunction(int k)
    {
        if (k < -1) return 0;
        if (k >= 1) return 1;
        return 0.5;
    }

    public override double ProbabilityMassFunction(int k)
    {
        if (k == -1) return 0.5;
        if (k == +1) return 0.5;
        return 0.0;
    }

    public override string ToString(string format, IFormatProvider formatProvider)
    {
        return "Rademacher(x)";
    }

    public override object Clone()
    {
        return new RademacherDistribution();
    }
}

After the distribution has been created, we will be able to query it even for things we didn't implement, such as its Inverse Distribution Function, Hazard Function, and others:

var dist = new RademacherDistribution();

double median = dist.Median; // 0
double mode = dist.Mode;     // 0

double lpdf = dist.LogProbabilityMassFunction(k: -1);       // -0.69314718055994529
double ccdf = dist.ComplementaryDistributionFunction(k: 1); //  0.0

int icdf1 = dist.InverseDistributionFunction(p: 0); // -1
int icdf2 = dist.InverseDistributionFunction(p: 1); // +1

double hf = dist.HazardFunction(x: 0);            // 0.0
double chf = dist.CumulativeHazardFunction(x: 0); // 0.69314718055994529

string str = dist.ToString(); // Rademacher(x)

Guessing a plausible distribution for your data

Let's say you have a bunch of data, and you would like to discover which statistical distribution would be a good fit for it. The Accord.NET Framework allows you to rank statistical distributions according to they goodness-of-fit using, for example, Kolmogorov-Smirnov's suite of tests.

// Create a new distribution, such as a Poisson
var poisson = new PoissonDistribution(lambda: 0.42);

// Draw enough samples from it
double[] samples = poisson.Generate(1000000).ToDouble();

// Let's pretend we don't know from which distribution
// those sample come from, and create an analysis object
// to check it for us:
var analysis = new DistributionAnalysis(samples);

// Compute the analysis
analysis.Compute();

// Get the most likely distribution (first)
var mostLikely = analysis.GoodnessOfFit[0];

// The result should be Poisson(x; lambda = 0.420961)
var result = mostLikely.Distribution.ToString();

As you can see, the framework could correctly guess that the samples came from a Poisson distribution.

More advanced examples

Mixture distributions

The following example shows how to create a mixture distribution of any number of component distributions.

// Create a new mixture containing two Normal distributions
Mixture<NormalDistribution> mix = new Mixture<NormalDistribution>(
    new NormalDistribution(2, 1), new NormalDistribution(5, 1));

// Common measures
double mean = mix.Mean;     // 3.5
double median = mix.Median;   // 3.4999998506015895
double var = mix.Variance; // 3.25

// Cumulative distribution functions
double cdf = mix.DistributionFunction(x: 4.2);              // 0.59897597553494908
double ccdf = mix.ComplementaryDistributionFunction(x: 4.2); // 0.40102402446505092

// Probability mass functions
double pmf1 = mix.ProbabilityDensityFunction(x: 1.2); // 0.14499174984363708
double pmf2 = mix.ProbabilityDensityFunction(x: 2.3); // 0.19590437513747333
double pmf3 = mix.ProbabilityDensityFunction(x: 3.7); // 0.13270883471234715
double lpmf = mix.LogProbabilityDensityFunction(x: 4.2); // -1.8165661905848629

// Quantile function
double icdf1 = mix.InverseDistributionFunction(p: 0.17); // 1.5866611690305095
double icdf2 = mix.InverseDistributionFunction(p: 0.46); // 3.1968506765456883
double icdf3 = mix.InverseDistributionFunction(p: 0.87); // 5.6437596300843076

// Hazard (failure rate) functions
double hf = mix.HazardFunction(x: 4.2); // 0.40541978256972522
double chf = mix.CumulativeHazardFunction(x: 4.2); // 0.91373394208601633

// String representation:

// Mixture(x; 0.5 * N(x; mu; = 5, s2 = 1) + 0.5 * N(x; mu = 5, s2 = 1))
string str = mix.ToString(CultureInfo.InvariantCulture);

The next example shows how to estimate one mixture distribution directly from data.

var samples1 = new NormalDistribution(mean: -2, stdDev: 1).Generate(100000);
var samples2 = new NormalDistribution(mean: +4, stdDev: 1).Generate(100000);

// Mix the samples from both distributions
var samples = samples1.Concatenate(samples2);

// Create a new mixture distribution with two Normal components
var mixture = new Mixture<NormalDistribution>(
    new NormalDistribution(-1),
    new NormalDistribution(+1));

// Estimate the distribution
mixture.Fit(samples);

var a = mixture.Components[0].ToString("N2"); // N(x; mu = -2.00, s2 = 1.00)
var b = mixture.Components[1].ToString("N2"); // N(x; mu =  4.00, s2 = 1.02)

Gaussian mixture models

The following example shows how to create multivariate Gaussian mixture models, and how to convert them to multivariate Normal mixtures.

// Observations
double[][] samples =
{
    new double[] { 0, 1 },
    new double[] { 1, 2 }, 
    new double[] { 1, 1 },
    new double[] { 0, 7 },
    new double[] { 1, 1 },
    new double[] { 6, 2 },
    new double[] { 6, 5 },
    new double[] { 5, 1 },
    new double[] { 7, 1 },
    new double[] { 5, 1 }
};

// Create a new Gaussian mixture with 2 components
var gmm = new GaussianMixtureModel(components: 2);

// Compute the model (estimate)
double error = gmm.Compute(samples);

// Classify the first samples (they should belong to the same class)
int c0 = gmm.Gaussians.Nearest(samples[0]);
int c1 = gmm.Gaussians.Nearest(samples[1]);

// Classify the last samples (they should belong to the other class)
int c7 = gmm.Gaussians.Nearest(samples[7]);
int c8 = gmm.Gaussians.Nearest(samples[8]);

// Extract the multivariate Normal distribution from it
MultivariateMixture<MultivariateNormalDistribution> mixture = 
    gmm.ToMixtureDistribution();

Independent and joint distributions

The following example shows how to create a multivariate distribution by assuming that each dimension of the data is governed by uncorrelated univariate distributions.

// Observations
double[][] samples =
{
    new double[] { 0, 1 },
    new double[] { 1, 2 }, 
    new double[] { 1, 1 },
    new double[] { 0, 7 },
    new double[] { 1, 1 },
    new double[] { 6, 2 },
    new double[] { 6, 5 },
    new double[] { 5, 1 },
    new double[] { 7, 1 },
    new double[] { 5, 1 }
};

// Create a new 2D independent Normal distribution
var independent = new Independent<NormalDistribution>(
    new NormalDistribution(), new NormalDistribution());

// Estimate it!
independent.Fit(samples);

var a = independent.Components[0].ToString("N2"); // N(x1; mu = 3.20, s2 = 7.96)
var b = independent.Components[1].ToString("N2"); // N(x2; mu = 2.20, s2 = 4.40)

In the discrete case, this can be represented by a joint distribution.

Non-parametric distributions

The following example shows how to create a non-parametric distribution from the samples themselves, without assuming one particular form for the shape of the distribution. Those distributions include the Empirical Distribution, the Multivariate Empirical Distribution, and the Empirical Hazard Distribution.

// Observations
double[] samples = { 5, 5, 1, 4, 1, 2, 2, 3, 3, 3, 4, 3, 3, 3, 4, 3, 2, 3 };

// Create a new non-parametric Empirical distribution
var distribution = new EmpiricalDistribution(samples);

double mean = distribution.Mean;     // 3
double median = distribution.Median; // 2.9999993064186787
double var = distribution.Variance;  // 1.2941176470588236
double mode = distribution.Mode;     // 3

double chf = distribution.CumulativeHazardFunction(x: 4.2);           // 2.1972245773362191
double cdf = distribution.DistributionFunction(x: 4.2);               // 0.88888888888888884
double pdf = distribution.ProbabilityDensityFunction(x: 4.2);         // 0.181456280142802
double lpdf = distribution.LogProbabilityDensityFunction(x: 4.2);     // -1.7067405350495708
double hf = distribution.HazardFunction(x: 4.2);                      // 1.6331065212852196
double ccdf = distribution.ComplementaryDistributionFunction(x: 4.2); //0.11111111111111116
double icdf = distribution.InverseDistributionFunction(p: cdf);       // 4.1999999999999993
double smoothing = distribution.Smoothing;                            // 0.67595864392399474

string str = distribution.ToString(); // Fn(x; S)

Or in the multivariate case, using kernels:

// Suppose we have the following data, and we would
// like to estimate a distribution from this data

double[][] samples =
{
    new double[] { 0, 1 },
    new double[] { 1, 2 },
    new double[] { 5, 1 },
    new double[] { 7, 1 },
    new double[] { 6, 1 },
    new double[] { 5, 7 },
    new double[] { 2, 1 },
};

// Start by specifying a density kernel
IDensityKernel kernel = new EpanechnikovKernel(dimension: 2);

// Create a multivariate Empirical distribution from the samples
var dist = new MultivariateEmpiricalDistribution(kernel, samples);

// Common measures
double[] mean = dist.Mean;       // { 3.71, 2.00 }
double[] median = dist.Median;   // { 3.71, 2.00 }
double[] var = dist.Variance;    // { 7.23, 5.00 } (diagonal from cov)
double[,] cov = dist.Covariance; // { { 7.23, 0.83 }, { 0.83, 5.00 } }

// Probability mass functions
double pdf1 = dist.ProbabilityDensityFunction(new double[] { 2, 1 });    // 0.039131176997318849
double pdf2 = dist.ProbabilityDensityFunction(new double[] { 4, 2 });    // 0.010212109770266639
double pdf3 = dist.ProbabilityDensityFunction(new double[] { 5, 7 });    // 0.02891906722705221
double lpdf = dist.LogProbabilityDensityFunction(new double[] { 5, 7 }); // -3.5432541357714742