GSoC 2020 proposal Arpan Chattopadhyay: Improving Assumptions Module - sympy/sympy GitHub Wiki

Google Summer of Code 2020 - Proposal

Improving Assumptions Module in SymPy

Me, the person:

Name: Arpan Chattopadhyay

University: Birla Institute of Technology and Science, Pilani

Email: [email protected]

GitHub: Arpan612 (Arpan Chattopadhyay)

Gitter: Arpan612

LinkedIn: Arpan Chattopadhyay

Time-zone: IST (UTC+5:30)

Age: 19

I prefer to be contacted via Gitter as I am very active there. However, other ways of communication are also welcome.

I am Arpan Chattopadhyay, pursuing B.E.(Hons.) Electrical and Electronics (2nd Year) at Birla Institute of Technology and Science, Pilani (BITS Pilani). I am highly interested in the fields of Symbolic Mathematics, Applied Mathematics, Machine Learning, and Mathematical Modeling. I would love to continue working in open source projects and hopefully, one day, create my team of programmers for building an open-source project as big as that of SymPy. I am very well versed in English and love interacting with people. Apart from coding, I love watching films, particularly mystery ones. I play and follow football as well. I believe I have been a good fit in the community culture and I hope to contribute even after the GSoC program gets over.

Me, the programmer:

I have been introduced to programming in Class 11 where we were taught C++ programming language as a part of our course curriculum. I started Python a month after joining college. I found great comfort and interest in coding with Python. I made some projects with Django, a framework for Python in the coding projects our college used to organize. I also got experience with Javascript, My SQL, and Assembly level programming in my college. I found great inspiration in collaborating with people and working on projects. At that time, I was introduced to the world of open source and the limitless possibilities it held for me. I was fortunate enough to take part in a Study Oriented Project which involved Symbolic Mathematics and the use of SymPy in Python. I found out about the various tasks which can be done by SymPy and was very impressed. Thus, I decided to finally give wings to my dream of working and collaborating in a large open-source project and SymPy was a natural choice, given the interest, it sparked in me.

OS: Ubuntu 16.04 LTS

Hardware Configuration: Intel i7 8th generation

Python: Version 3.7.4

Editor: Atom (Version 1.45.0) and Anaconda (Version 4.7.12)

My Project:

A core part of any CAS is the assumptions system. This is the subsystem used to record and infer properties of expressions, since virtually any algorithm will only work under some restrictions. Before 2009, we were using an assumption system which was very complex. The assumptions module, core and caching were designed in such an interdependent manner, segregating them individually became a really difficult task.Apart from this, the system had many performance issues. Another problem we faced was that only a handful of facts were implemented. Due to this fundamental nature of the assumption system being so compounded, changes and implementations which were needed on an immediate basis could not be made.

This problem escalated because of its proximity with SymPy core. As a response, building up of a new Assumptions module had begun. This was necessary since the existence of the entire SymPy library depended on it. Since this is a structural change in the library, it takes many years and GSoC projects to implement this. The biggest advantage I have now is that since so many people have worked on it, through their project reports, I have been able to understand to a great extent the structure of it and the precise challenges we face. This proposal aims to systematically identify the issues, formulate tasks and execute it. The old and the new assumption model is shown below.

Assuming assumptions

The main aim is to figure out a way to disconnect the assumptions from the SymPy core, within the time frame before the next release. As far as the SymPy core is concerned, we will use some logic and facts to implement calculation in convenient mode like .is_bounded, is_infinity, is_zero and so on.

The Old Assumptions is time-tested and fast while New Assumptions is slow but provides better logical inferences. We aim to maintain the speed as close to the old assumptions as possible while implementing all the new functionalities and logical inferences we wish to achieve.

Things to remember:

If you change the API, you'll have to modify everywhere that API is used. If you change what an assumption means (e.g., if you change oo.is_real), you'll have to go through everywhere that assumption and make sure that its use is still correct.

Even just making the assumptions more powerful (able to return True or False as opposed to None more often) can require some changes, because some code may take different paths. The final goal is to not have multiple assumptions systems, at least from a user perspective. We will probably have multiple inference layers, but these should be hidden from the user. To do this, we either need to remove one of the existing systems (already tried and failed), or merge the two, so that one calls the other and they just become different layers in the inference stack. So we try to make as much addition possible in new assumptions. The changes which are not scalable from old to new are used in the merged layer. Also the changes we wish to bring is quite huge.

The only way to get such a large change is to split it up into much smaller changes that can be merged into master as soon as possible. So, we make sure that the workflow also segregates PRs into specific functionalities and then we add it to master.

Project Details:

Phase 0: (Community Bonding)

I will begin by doing a thorough analysis of the implementation of the assumption module across the codebase. I will be noting down the areas and other modules where the effect of assumption change is either huge or very less. These notes will be key points used throughout the project to make the project sustainable. Discussion of Proof of concept for the new assumptions with the mentors. Identifying the areas here where work is complete and figuring out the areas where the work branches out to. Discussing the additional implementation we can pursue in the work already done and start working on it.

Discussion of the work done in the project GSoC 2019 Improving-Assumptions. Try to find out what key concepts did he focus on which improved the speed by this margin. Discuss further possibilities of improvement of speed of satask both by the approach used before and explore the possibility of some new approach for the task. The PR below is relevant for this.

https://github.com/sympy/sympy/pull/17144

Chalk out a plan for the scalable implementation of the SAT solvers. Figure out the conflicts we might face and try to figure out how to handle them.

For non-real expressions isn't the is_positive property always False? · Issue #16313 · sympy/sympy S(0)**real(!=0) should be (0 or zoo) and hence non-positive. · Issue #16332 · sympy/sympy

The above issues will be explored and discussed with mentors to get a better understanding of handlers and the codegen. The work left in the PR below will be completed. If any extension of the work is possible, a separate PR will be made to work on it.

Improved AskPositiveHandler and AskNegativeHandler · Issue #16213 · sympy/sympy

Phase 1:

We begin by removing the inconsistencies in both old and new assumption modules. The issues above give a good insight of the problems already we face.

Contradictory facts about infinities in the new assumptions · Issue #5976 · sympy/sympy https://code.google.com/archive/p/sympy/issues/2877 These examples show how the old assumption gave incorrect answers to simple algebra/logic questions. We figure out more of such inconsistencies and implement the correct functionality in the new assumptions module. Some wrong results in the new assumptions · Issue #7286 · sympy/sympy

In [6]: ask(Q.prime(x**y), Q.integer(x)&Q.integer(y)) Out[6]: False In [7]: ask(Q.integer(abs(x)), ~Q.integer(x)) Out[7]: False This is a broken part of the old assumption module we will fix.

Next we move on to satask. The current approach for satask creates too many unnecessary clauses. We will be investigating the facts which are not required for the query and fix it. For example, Q.positive(x) doesn't depend on facts like Q.diagonal(x) or Q.invertible(x) yet is used.

The logic module is essentially built around the assumptions system and is an integral part of the assumptions mechanism. Currently, it lacks the capability of forming complete FOL queries. This restricts the power of SymPy. The issue below highlights this task.

Universal quantification in sympy.logic · Issue #17446 · sympy/sympy

The following PR is relevant for the above task https://github.com/sympy/sympy/pull/7608

In the GSoC project last year, this work was started with an aim of working on the name conflicts with basic SymPy objects along with implementation of some functionalities. However, some files of assumptions, logic and printing had compatibility issues. Debugging is also required of the functionalities implemented. I will solve all these issues along with working on any new functionalities as per requirement. The work done last year is below.

https://github.com/sympy/sympy/pull/17069

We continue our work in satask by adding support for querying Relational expressions. Implementation of this task is far from over.

Currently, new assumptions don't support queries involving Relationals.

In new assumptions

ask(Q.is_true(x>0), Q.positive(x)) # => None

In old assumptions

x = Symbol('x', positive=True) x > 0 True

With a new system supporting Relationals it would reduce the user's dependency on old syntax for such queries. Currently, the queries like ask(Q.positive(x+y+z),assumptions) takes into account the assumptions on each variable and proceeds through a set of cases to arrive at the result. This system may give incorrect results with the ignorance.

of some cases. This is what the old assumptions is doing and the current implementation of new assumptions has followed.

Partial work on this issue is done and the link for which is below. We will complete the work in this PR and add additional functionalities.

https://github.com/sympy/sympy/pull/17392

Phase 2:

First, we complete the work done in querying Relational expressions. Now we start working on ‘ask’ handlers. The deduction system for ask was two parted. First it would use some basic logic deduction using a SAT solver and some known facts (like Implies(Q.positive, Q.real)). Second, each fact would have a handler object, which defines methods on how to determine its truth for specific classes given some assumptions, like class AskPositiveHandler(Handler): @staticmethod def Pow(expr, assumptions): This Returns True, False, or None if expr, which is a Pow, is positive, given the facts in assumptions.

We will keep the new assumption close to the old one. A two argument function.

Next, we work on ‘refine’ function. The idea behind refine() is that since assumptions are separate from objects, the simplifications would need to be done in a function, which takes the object and the assumptions. So it would work like refine(sqrt(x**2), Q.positive(x)) -> x. Refine also has a handler system similar to ‘ask’ to handle writing all these simplification routines. The traditional way this was done is in the core, meaning the simplification happens when the object is created, by checking the assumptions on x. It reduces speed to a good extent and makes simplification of function of functions difficult. This is important to boost the performance. Old assumptions are used heavily in the core for automatic simplification which slows things down. Refine is the new way of doing so which uses new assumptions. We can use refine to do the auto simplification and make it optional.

https://github.com/sympy/sympy/issues/17052

The above link contains issues which we will be fixing, as a starting point of our work in refine function.

Now, we work on implementing sat handlers for matrices. We will start off by migrating from handlers/matrices.py to sathandlers.py. We implement scalar matrices, non negative elements, real elements. I have seen mathematical problems involving complex terms in matrices as well. I will give an attempt in implementing something in this area as well. We will also try to implement a more generalised expression to make assumptions about all the elements of a matrix.

Phase 3:

We continue to remove assumptions from the core as much as possible. If assumptions stay in the core they add too much overhead in the computation. Perhaps, this is primarily because they are called too many times. We will explore the modules and the corresponding expressions where scope of removal from core and implementation of the new assumption module is possible. Having said that, most of the modules heavily use old assumptions and converting them to a new API would have some compatibility issues. If we assume that the API structure of old assumptions is better, it would be a loss to remove old assumptions syntax completely. In these modules, we will be programming the assumptions in such a manner that the old assumptions call the new assumptions, and thus, the better API is maintained. To sum up, this task follows a module by module approach. Modules which are more frequently used by users are worked on first. We attempt to implement new assumption syntax in its entirety, but if compatibility issues persist, we switch over to the mechanism of old assumptions calling the new assumptions, after discussion with the mentors.

Next, we focus on better interaction of inference engines and handlers. The old assumptions system uses backwards chaining: given an expression e and a property foo, in order to answer is_foo(e), it first tries calling the foo handler for e. If this is not successful, it looks at a list of all properties bar which, under some circumstances, imply foo. Then each of them is investigated in a recursive pattern. For example, they first use the handler then they start looking at properties implying it. In the current implementation of the new system, the handlers and logic inference are completely separate. For example, given an expression e, suppose we ask(Q.is_nonzero(e)). In the absence of global assumptions, the current implementation of ask is going to call the nonzero handler for e, and if this is inconclusive, terminate. Even if there is a conclusive is_positive handler, it is not going to be called and once the new system is in place, we will link more closely together the handlers and logic inference code because the current system is not as per expectations.

Now, we focus on the simplification and refinement code. Currently all auto simplification and refinement code is procedural and ad-hoc. We build a kind of general simplification engine responsible for simplifications. The attempt here is to build a sustainable code here which encodes simplification rules in a more declarative fashion. We attempt to do it using LogPy. This work replaces sympy's unify module with an external dependency. The work involves adding logpy to setup.py, using logpy from tuple methods, import variables from logpy directly. Here, compatibility may be a problem. I think the latest version of logpy will be compatible. Since the basic flow of work is the same, if logpy does not work out, we will use other libraries based on discussion with mentors.

Timeline: (As per the latest schedule sent by GSoC authorities)

Community Bonding Period: (Present - May 31)

Analysis of the implementation of the assumption module across the codebase.

Discussion of Proof of concept for the new assumptions with the mentors.

Discussion of the work done in the project GSoC 2019 Improving-Assumptions.

Discussion of the issues mentioned in project details.

Discussion of the proposed workflow and changing the order of work if required.

Phase 1: June 1 to June 7: (Week 1)

Remove the inconsistencies in both old and new assumption modules based on the workflow in project details.

PRs for the work done by June 7 latest.

June 8 to June 21: (Week 2 and 3)

Start exploring the logic model and FOL queries in satask. Study the codebase and all the implementation done by June 9.

Work on the compatibility issues and the PRs linked with the PR #17069 as mentioned.

PRs for the work done by June 15 latest.

Work on the missing implementations in PR #17069 and send a PR for additional implementation after discussion.

First set of PRs for the work done by June 19 latest.

June 22 to June 28: (Week 4)

Second set of PRs for the work done by June 22 latest. Add support for querying Relational expressions as mentioned in project details. Add missing functionalities in PR #17392. PR for the work done by June 26 latest. Add functionalities for querying Relational expressions.

June 29 to July 5: (Week 5)

Documentation of the work done in phase 1. Fixing bugs and issues on the same.

Phase 1 evaluation submission by July 1 latest Phase 2: July 6 to July 19: (Week 6 and 7)

PR for adding functionalities for querying Relational expressions by July 6 latest. Start the work for ‘ask’ handlers. Implementing basic logic deduction using a SAT solver and some known facts. PR for the work done by July 11 latest. Work on handler objects as mentioned in project details. PRs for the work done by July 17 latest

July 20 to August 2: (Week 8 and 9) Implementation and adding functionalities in refine function according to the workflow in project details. PRs for the work done by July 23 latest Implementing sat handlers for matrices as per project details.

Phase 2 evaluation submission by July 30 latest PRs for the work done in sat handlers by August 2 latest

August 3 to August 9: (Week 10) Remove assumptions from the core by a module by module approach. First set of PRs for the work done by August 6 latest. August 10 to August 23: (Week 11 and 12) Second set of PRs for the work done by August 10 latest. Implementing better interaction of inference engines and handlers. First set of PRs for the work done by August 14 latest.

Second set of PRs for the work done by August 17 latest. Working on Simplification and refinement code using LogPy. First set of PRs for the work done by August 21 latest. Second set of PRs for the work done by August 24 latest.

August 24 to August 31: (Week 13)

Finishing up documentation and blogs and work Report on SymPy wiki Checking for any issues or conflicts unattended Working on additional goals Submission for Final Evaluation by August 28 latest.

Additional Goals: (Time Permitting)

In computer science and mathematical logic, the satisfiability modulo theories (SMT) problem is a decision problem for logical formulas with respect to combinations of background theories expressed in classical first-order logic with equality. Some work is done in this area in a previous GSoC project. The aim here is to study the previous work done and solve the issues (mainly compatibility and syntax ones) and work out a plan with mentors for additional implementation of the same.