GSoC 2018 Application Akash Vaish: Improving Probability and Random processes - sympy/sympy GitHub Wiki

Improving Probability and Random processes

About me

Contact Information

Name Akash Vaish
University Birla Institute of Technology and Science, Pilani, Goa Campus
Programme Student of M.Sc(Hons.) Mathematics and B.E(Hons.) Computer Science
E-mail [email protected]
Github profile akash9712
Time-Zone IST (UTC + 5:30)

Personal Background and programming setup:

I am a 3rd year undergraduate student of Mathematics and Computer science at BITS Pilani, Goa Campus, India.

I am using Ubuntu 16.04 (LTS), and Sublime Text 3 for writing all of my code. For the purpose of debugging, I use PuDB, to which I was introduced via Aaron Meurer’s blog. I prefer PuDB over other debuggers because of its functional design, and the fact that it is very intuitive to use.

Being a student of both Mathematics and Computer Science, as part of my curriculum, I have completed some relevant coursework like Real Analysis, Optimisation, Graph Theory, Measure and Integration, Numerical Analysis, Cryptography, Number Theory, Object Oriented Programming, Logic in Computer Science etc. I’m currently enrolled in courses on Data Structures and Algorithms and Database Management Systems, and a few others.

Among these, my study of Probability and Statistics is directly relevant to the summer of code project while Operations Research has helped me develop an understanding of the practical applications of probability and statistics. Reliability, Queueing models, Inventory models and Introductory Game theory were part of the Operations Research coursework I did.

I have also audited the following MOOCs for getting acquainted with Python:

Note: The aforementioned courses included weekly projects in python. These included assignments such as building a Tic-Tac-Toe player based on Monte Carlo methods, Conway’s game of life, a browser version of a replica of 2048 game etc. However, the course guidelines do not allow me to share the source code of the projects.

Prior to working in Python, my programming experience was confined to C which is syntactically more complex. Contrastingly, development in python was delightfully intuitive. For example, the simplicity of building web crawlers using Beautiful Soup library helped me automate web-related tasks. Writing small scripts like these eased the learning curve for me. One of my favourite feature about python is list comprehension, which often makes the code very clean and concise.

My favourite feature about SymPy is the summation feature. It was one of the first features of SymPy I used while trying to familiarize myself with the library. Being unacquainted with the summation algorithms, the ability of the software to recognize and evaluate a convergent series amazed me:


>>> x = Symbol('x')

>>> summation(1/(x**2), (x, 2, oo)).doit()

-1 + pi**2/6

I am fairly comfortable with Git and GitHub, having used these over the past months for contributing to SymPy and otherwise.

Contributions:

I was introduced to SymPy while searching for Python/C++ based projects I wanted to undertake for increasing my knowledge, especially the practical aspects of what I was learning in my Mathematics coursework, and hence started contributing to the project in August 2017.

While trying to fix a bug in the ntheory module, I came across a novel way of implementing a function that checked if a number is a perfect square or not. The approach taken greatly intrigued me and discoveries such as these have motivated me to continue contributing to the SymPy codebase.

The following are the lists of merged/open pull requests I have created since then(listed in chronological order).

Merged

Open

The Project

Introduction:

Improvements to the statistics module were suggested as one of the prospective GSoC projects in 2017, when certain issues** **with the current version were pointed out. While some of these issues are now fixed, there is still a lot of room for improving the existing features, as well as implementing new ones.

Project Idea:

The ideas page provides an adequate list of ideas for the summer project, which can be broadly divided into three parts:

  • Implementation of other probability spaces, random variables with compound probability distributions, multivariate distributions like the multivariate normal distributions, etc.

  • Implementation of random processes (including random walks and markov chains)

  • Exporting random variables to other libraries.

While the above implementations are fairly independent, it would be more systematic to work on one at a time, and move on to the other one after it has been wrapped up. The plan for the project as described in the sections below will be based upon this premise.

Motivation

My primary motivation for this project is my liking towards the field of probability, and a chance to make a significant contribution to an open source library.

The importance of probability has increasingly been recognised in most, if not all areas of science. SymPy implements univariate probability spaces in great detail, but spaces involving multiple random variables and random processes are not implemented at all.

As a CAS, it is essential that SymPy implements these, along with random processes.

Implementation plans

STAGE 1

I plan to start the project by completing the implementation of discrete random variables implemented in drv.py. This would involve the various corresponding classes implemented for continuous and finite random variables (in crv.py and drv.py respectively), and thus implementing things like conditional probability, a discrete product space etc. This is necessary, because if the other ideas mentioned above are to be implemented for all three types of random variables, it is important that the basics work properly.

I started this part of the project in #14218, but could not move forward with it, due to exams, and working for the proposal. Completion of this PR is the first thing I would like to do as a part of this summer project.

STAGE 2

Joint Distributions

With discrete random variables implemented to be at par with its finite and continuous counterparts, the next step would be to move the project forward by implementing joint probability distributions for any given number of variables.

This can be implemented for both continuous, and discrete random variables, where the joint probability distribution may/may not be a piecewise function.

Integration in SymPy is possible for Piecewise functions, even when the limits are not symbols other than the variable of integration. This makes multiple integration over Rn possible. Thus, given a continuous joint distribution, it would be possible to calculate probabilities for a specified event. For the discrete analogue, piecewise_fold could be used for summing up over the given domain.

A JointPSpace class will need to be implemented. The following is a basic working prototype. The method _check() here is intended to act as an optional method to check if the given joint probability distribution is valid or not. The argument raises an error if the distribution can be verified to be invalid, and returns none otherwise.


class JointPSpace(Basic):

    """

    Class for joint probability distribution over Rn

    """

    def symbols(self):

        return self.args[0]

    

    def domain(self):

        dimension = len(self.args[0])

        return S.Reals**dimension

    

    def density(self):

        return self.args[1]

    

    def _check(self): #Raises error if integral over the domain is not 1.

        dens = self.density()

        for i in range(len(self.symbols())):

            dens = integrate(dens, (self.symbols()[i], -oo, oo))

        if (dens == 1) == False:

            raise ValueError("Given distribution is not a valid.")

...

Example:


>>> A = JointPSpace((x, y), Piecewise((0, y < 0), (exp(-x), y < x), (0, True)))

>>> A._check()

>>> A = JointPSpace((x, y), Piecewise((exp(-x), y < x), (0, True)))

>>> A._check()

Traceback (most recent call last):

....

ValueError: Given distribution is not a valid.

*Note: This is only a basic prototype of how the class could be created. The aim for the final implementation would be create a function that creates these the object internally like the currently implemented random variables, and if the user wants, limit the domain of each RV. Also, instead of using symbols as parameters, the user should be able to give random symbols as parameters, wherein the domain would automatically be restricted as per each RV. This can be done by the JointPSpace inheriting from ProductPSpace, instead of starting from scratch.

Joint distributions will then be completed by adding other functionalities such as

  • Marginal densities

  • Conditional probabilities

  • Separate implementation of the class for for continuous and discrete distributions etc.

Joint distributions also lay down the groundwork for Compound probability distributions.

STAGE 3

Compound Distributions

SymPy already allows for the creation of compound random variables. The density can computed for some cases:


>>> x = Symbol('x')

>>> N = Normal('N', 0, 1)

>>> M = Normal('M', N, 2)

>>> simplify(density(M)(x))

sqrt(10)*exp(-x**2/10)/(10*sqrt(pi))

but raises an error for the others:


>>> Y = Poisson('Y', 3)

>>> Z = Poisson('Z', Y)

>>> density(Z)(x)

…

NotImplementedError

Even for the cases where the density is calculated, the probability of an event can’t be calculated.


#M is the normal distribution defined above.

>>> P(M > 2)

…

RecursionError: maximum recursion depth exceeded while calling a Python object

Currently, if a given expression contains random variables belonging to more than one random distribution, the probability space is taken as a ProductPSpace:


    if all(rv.pspace == rvs[0].pspace for rv in rvs):

        return rvs[0].pspace

    # Otherwise make a product space

    return ProductPSpace(*[rv.pspace for rv in rvs])

(stats/rv.py)

which assumes each of those distributions to be independent. The density is then computed by multiplying the densities of each random variable present in the expression:


class ProductContinuousPSpace(ProductPSpace, ContinuousPSpace):

    ...

    def pdf(self):

        p = Mul(*[space.pdf for space in self.spaces])

        return p.subs(dict((rv, rv.symbol) for rv in self.values))

(stats/continuous.py)

The issue with this is algorithm is that it is applied whenever when the given expression is multivariate, thus assuming each of those random variables are independent. This is not true for compound distributions.

To fix this, I plan to implement a separate class of probability space called CompoundPSpace. An object of this class would be created if the parameters of a given random variable are found to be distributed according to another random variable.

I couldn’t find an algorithm that computes the densities of compound random variables in general, but this can be implemented for certain known cases, as given here.

The check for random variables being given as parameters can be given while creating the random variable itself, wherein, an object of a class called CompoundPSpace would be returned.

For example, while creating a compounded continuous random variable:


def rv(symbol, cls, args):

    args = list(map(sympify, args))

    dist = cls(*args)

    dist.check(*args)

    if any(isinstance(arg, RandomSymbol) for arg in args):

        return  return CompoundPSpace(symbol, dist).value

    return SingleContinuousPSpace(symbol, dist).value

One example of a compound random variable with a standardized distribution is a Normal RV with its mean also distributed normally. Consider the following mock-up of how this case can be dealt with:

class CompoundPSpace(PSpace):

	def __new__(cls, s, distribution):

		dist = _compute_compound_type(distribution) 

		return Basic.__new__(cls, s, dist)

dist here is a special function whose aim is to find out what kind of distribution the compounding results in. This can be done in multiple stages:

  1. Find out what my original distribution is. In the current example, it is Normal; Say, N(M, phi**2).

  2. Find out how the distribution is compounded, i.e., which parameter is being compounded and what is the . Here, it is the mean, with a Normal latent distribution; Say, M(mu, sigma**2).

  3. Return the resultant distribution with the modified parameters. In our case, this is again Normal, with N(mu, phi**2 + sigma**2)[~N(M, phi**2)]

I plan to implement Compound RVs in a separate file, probably called something along the lines of compoundrv.py. A function can be implemented which considers the various known cases of compound distribution, with each case accounting for the type of distribution of the our parametrized distribution.

Example:


def compute_compound_type(distribution, **kwargs):

…

if isinstance(distribution, NormalDistribution):

mean, var = distribution.args()

 	if isinstance(mean, NormalDistribution) and not \

       		isinstance(var, RandomVariable):

		dens = density(mean)

		return NormalDistribution(

dens.args[0], var + dens.args[1])

	…

An unevaluated object would be returned if the type of distribution can’t be evaluated.

Dependence between random variables

SymPy already implements measures of dependence between random variables, such as the Pearson Coefficient and Covariance. However, without the implementation of multivariate/compound distributions, these measures cannot not be used except in elementary cases.

Example:


>>> from sympy import *

>>> from sympy.stats import *

>>> E = Exponential('E', 2)

>>> N = Normal('N', 0, 1)

>>> M = Normal('M', N, 2)

>>> correlation(E, E + N)

sqrt(5)/5

>>> correlation(N, M)

...

raise TypeError("s should have been string or Symbol")

TypeError: s should have been string or Symbol

With the implementation of joint probability distributions, and the current implementation of dependency measures, it would be possible to create Joint distributions that imply dependence between random variables.

The dependence between RVs should be determinable even one of them has been compounded with the other. This will be relatively easier to implement after the preceding classes have been implemented, and would not need any special treatment once the dependence between RVs can be assumed.

STAGE 4

Random Processes

The ideas page mentions Random walks and markov chains as two random processes that can be implemented. Following are the details of how each of them would look like.

Markov Chains

The natural thing to do would be to use SymPy matrices to implement the transition matrices and initial state matrix for a markov chain. Consequently, state matrices can be computed after any given number of iterations.

Here’s a very crude prototype I wrote for the purpose of this proposal:


class MarkovChain(Basic):

    def __new__(cls, state0, transitionMatrix, **kwargs):

        if not (isinstance(state0, Matrix) and \

            isinstance(transitionMatrix, Matrix)):

            raise ValueError()

        # Similarly, raise errors when sum of each row is not one in

        # the initial state and transition matrices, or transition

        # matrix is not a square matrix.

        return Basic.__new__(cls, state0, transitionMatrix, **kwargs)

    def initial_state(self):

        return self.args[0]

    def transition_matrix(self):

        return self.args[1]

    def state(self, n):

        assert n >= 0

        return self.initial_state()*self.transitionMatrix()**n

    def stationary_state(self):

        """Return Stationary matrix if it exists, else raise error."""

        T = self.transition_matrix()

        T = MutableMatrix(Transpose(T) - eye(T.rows))

        row_ones = Matrix([[1 for i in range(T.rows)]])

        T = T.row_insert(T.rows, row_ones)

        B = [[0] for i in range(T.rows - 1)]

        B = Matrix(B + [[1]])

        stationary_mat = linsolve((T, B))

        if stationary_mat != S.EmptySet:

            return stationary_mat

        raise ValueError()

The stationary_state method considers the equations we get from multiplying an unknown state matrix (which is what we wish to find) with the transition matrix, along with the condition that all entries of should sum up to one.

As I mentioned above, this is only a rough idea of how Markov Chains could look like. During the course of the project, this can be extended to incorporate other features, such as higher order markov chains, or finding out if a set of states is closed.

Random Walks

The aim would be to implement random walks in one, two, or three dimensions, since these are the ones that are most often used. I plan to implement them in the cartesian space pertaining to the dimension of the walk.

Each random walk can be implemented as a separate class, or in the same class where the dimension is given as a parameter by the user. The user will also decide the step size, with the default being one.

The following is a functioning but quite naive prototype of a random walk in one dimension, implemented as a separate class. It assumes the initial position of 0, and the likelihood of moving in each direction as the same.


from sympy import *

from sympy.stats import *

import random

class Random_walk(Basic):

    def __init__(self, step_size = 1):

        self.step_size = step_size

        self._position = S.Zero

        self._path = [0]

    def position(self):

        """Returns current position"""

        return self._position

    def path(self):

        """Return the path taken to the current position."""

        return self._path

    def move(self):

        """Takes a single step."""

        if random.randint(0, 1):

            self._position +=1

        else:

            self._position -= 1

        self._update_path(self.position())

    def walk(self, n):

        """Takes a walk of n steps"""

        for i in range(n):

            self.move()

            self._update_path(self.position())

    def _update_path(self, n):

        self._path += [n]


>>> A = Random_walk()

>>> A.position()

0

>>> A.move()

>>> A.position()

1

>>> A.walk(5)

>>> A.path()

[0, 1, 2, 2, 3, 3, 2, 2, 1, 1, 2, 2]

>>> A.position()

2

Random walk classes could be extended to include features like calculating the probability of taking a particular path, or the probability of each choice while taking a step in the walk being distributed according to a distribution of the user’s choice. In particular, Gaussian Random walks can be implemented. Random walks can also be integrated with the plotting module, to give a graphical representation of the same.

STAGE 5

This would be the last stage of my summer project, which involves making SymPy random variables exportable to PyStan and PyMC3, and wrapping the project up.

Because both these packages are python interfaces, designing a code generator would basically deal with translating SymPy RVs so that they can be fed into the external library’s API.

Given below is a small example, where the convert_normal function converts a SymPy Normal RV to a PyMC3 RV.


import sympy.stats as ss

import pymc3 as pm

def convert_normal(normal_rv):

    """

    Given parameter is a sympy normal RV, which will be converted to

    a PyMC3 random variable. The PyMC3 RV is returned.

    """

    converter = pm.Model()

    name = str(normal_rv.args[0])

    mu, sd =  (int(arg) for arg in ss.density(normal_rv).args)

    with converter: #initialize model

        pm_normal = pm.Normal(name, mu, sd)

    return pm_normal

Usage:


>>> N = ss.Normal('N', 0, 1)

>>> x = convert_normal(N)

>>> type(x)

<class 'pymc3.model.FreeRV'>

I would finally like to conclude the project by completing the documentation, and tying up various ends to make sure that the work done during the course of the project can be incorporated into the library.

Timeline

The following timeline is in with alignment to what has been mentioned in the implementation details.

Also, please note that the below timeline is tentative, but I will try to adhere to it as strictly as possible.

April 23rd - May 14th: Community Bonding Period

The goal in this period would be to get to know my mentor and other organisation members well, and get a clear idea of the workflow during the project.

I will spend some time understanding the mathematical concepts involved in the implementations. If possible, I would also try to get some coding done, so as to stay ahead of schedule and avoid lagging in the future.

Post May 14th: The coding begins!

Week 1 (May 14th - May 21st): Completing prerequisites

  • Complete drv.py, so that discrete random variables have the same level of functionality as offered by continuous and finite random variables.

  • The majority of focus will be on implementing conditional probability and product spaces for discrete random variables. This will allow for a comparatively trouble-free implementation of joint distributions.

Week 2, 3 and 4th(May 22nd - June 13th): Joint Distributions

  • I’ll start working on joint distributions, for which the discrete and continuous cases will have to be taken care of separately.

  • After the working basic implementation of JointPSpace, I will move on to add various side features such as computing the marginal densities and conditional probabilities for joint distributions. Also, if needed, any special cases or well known cases can be implemented during the course.

  • By the end of the 4th week, I will have a working joint probability space for any given random variable.

PHASE-1 EVALUATION:

As per the above schedule, I will have completed conditional probabilities for discrete random variables, and joint probabilities over any number of given random variables. I will also make sure to add the test cases and complete the documentation, so that a pull request can be formulated for the purpose of this evaluation.

Week 5, 6 and 7(June 14th - July 5th): Compound Distributions and Dependency measures

I have taken a period of 3 weeks for this particular task, because I feel that this would be the most challenging part of the project, where several cases would need to be taken care of.

  • The first step is to alter the current mechanism, where the random variables involved in compound distributions are incorrectly assumed to be independent, and give a wrong result.

  • This would need making a seperate set of classes which identify whether the compound distribution is a previously known case, and compute the distribution if that is the case.

  • This would again have to taken care of separately for discrete and continuous random variables.

  • By the end of the 7th week, I’ll have working compound distributions, accounting for all cases where a closed form of the compounded distributions can be calculated.

  • With joint probability and compound distributions wrapped up, it would be relatively straightforward to implement measures of dependency in random variables. This would also be facilitated by the already existing methods for calculating correlation and covariance.

Week 8 and 9(July 6th - July 20th): Random Processes

  • Implementing dependency measures would be the last stage for the completion of the part of the project related to symbolic probability in SymPy. I would now move on the implementing random processes.

  • Week 8th and 9th would be dedicated to implementing markov chains, random walks, and their variations.

  • I would try to split the time between the random processes as first completing the basics of each process, and only then moving on the variations.

PHASE-2 EVALUATION:

As stated above, project work related to symbolic probability would be wrapped up before the end of week 7. I would send another PR at this point, which could be my submission for this evaluation period. Because of the date of evaluation, I am not sure if I would be able to include random processes in my PR, but I will try to ensure that at least the basic processes have been implemented by then.

Week 10 and 11(July 21th - August 4th): Exporting RV expressions

  • I expect to design a translator for generating expressions from SymPy random variable expressions that can directly be used in PyStan within the first week.

Final Week

  • The last week would be reserved for completing any left over tasks, and tying up any loose ends present.

  • All documentation work will be completed by the end of this week, and would send a final PR, bringing the coding work to an end.

  • I would submit the final report for all the work done during the summer during this time, along with what needs to be done in the future.

Post GSoC

I would like to continue my affiliation to SymPy and the open source community, and keep working towards the development of this library. In particular, I would be happy to work on any issues related to stats module (but not limited to this module), or implement new features.

Time commitment

  • My end semester exams begin from 1st May and continue till 11th May. Though this period overlaps with the community bonding period, I will try to ensure that it doesn’t hinder the bonding period in any way.

  • I have no other commitments for the summer, and would be able to spend 40-50 hours per week. I will make sure that I inform my mentor about any leave I take in advance, and make up for the leave by devoting sufficient amount of extra time.

  • Due to no other commitments, I would be able to devote extra time to the project in case I am running behind schedule.

  • My next semester begins on 1st August. However, due to a very flexible attendance policy, and lighter coursework in the beginning of the semester, the timeline would not be affected.

References

Stats in Sympy (General):

Joint Distributions -

Compound Distributions -

Random Processes:

PyStan/PyMC3 Documentation:

Discussion on SymPy Mailing list-

Apart from the above mentioned online resources, I also referred to the following:

  • Previous years’ proposals listed on this page.

  • An Introduction to Probability Theory and its Applications (vol. 1 and 2) by William Feller

  • Principles of Random Walk by Frank Spitzer

  • Introduction to Probability and Statistics by J.S Milton and J.C Arnold

⚠️ **GitHub.com Fallback** ⚠️