GSoC 2013 Application Tom Bachmann: Removing the old assumptions module - sympy/sympy GitHub Wiki

This application is work in progress. At this state of writing, it should nonetheless convey a good idea of what I am planning to do.

Preliminaries

My name is Tom Bachmann and I study mathematics (fourth year) at Cambridge university, England. After the summer I will be doing a PhD with Fabien Morel in Munich, Germany.

email: [email protected] (remove the NOSPAM)

alternate email: [email protected]

IRC: ness (authenticated with freenode)

github: ness01

I am currently studying for my final exams. These will finish at the beginning of June. Until then, the amount of time I am able to invest is very limited. This is also why I am almost never online in IRC (but I am virtually always reachable by email). This will change once my exams are over.

Of course, if my proposal is accepted, I will be working full time (i.e. 40h/week) on the project. I do not plan on taking any vacations. (I will be traveling occassionally for short times, which might impact my connectivity for a few days, but not my productivity.)

Personal background

I am about to finish my four year masters course at cambridge. My main interests are algebraic geometry and homological/homotopical algebra. After the summer I will be pursuing a PhD in the field of A^1-homotopy theory. Most of these topics are extremely theoretical (although commutative algebra also has an interesting computational side). Nonetheless I would like to assure the reader that I am both interested in and capable of very practical software development.

In 2011, I participated in GSOC for sympy for the first time. My proposal to implement an integration algorithm using Meijer G-functions was accepted and successful. Here is my report. All code is currently in master and working.

In early 2012, I started the "agca" ("algebraic geometry and commutative algebra") module for sympy. It aims to leverage the low-level groebner basis algorithms and package them into nice first-class objects, so as to create a system for computations in commutative algebra, similar to singular or macaulay.

In summer 2012, I participated again in GSOC, this time as a mentor for the category theory module. In parallel to that, I did a three month internship at Google Switzerland, working on google calendar (I was doing backend work, mostly in java).

Naturally, I have quite a lot of experience with both python and git. (I am also experienced in C and C++, java, and x86-64 assembler.) I do my coding on my x86-64 laptop, running debian (testing). My editor of choice is vim (several years ago I decided to learn either vim or emacs, for some reason I tried vim, and stuck with it). One thing I like about python is that it allows a programmer to turn ideas into code rather directly, as in for example statements like if all(x > 4 for x in L if x.is_Integer). The most advanced python language/library feature I remember using (off the top of my head) are custom decorators.

Patch requirement

Large bodies of my code are in master, so I obviously have a lot of merged pull requests. See here.

Project

A core part of any CAS is the assumptions system. This is the subsystem used to record and infer properties of expressions. For example, the expression (abs(x)) can be simplified to (x), provided (x) is known to be non-negative real number. As another example, the equation (ax + b = 4) has precisely one solution, provided (a) is a non-zero element of a field. This is the "recording" of assumptions part of any assumptions system. The other is inference. For example, if (x) is a real number, then (x2 + 1) is always a positive number. If the user has specified the first property ("(x) is real"), then the CAS should be able to figure out the second property ("(x2 + 1) is positive") on its own.

These examples make it clear that the assumptions system is an incredibly central part of the CAS, since virtually any algorithm will only work under some restrictions. It follows that significant changes to the assumptions system are, in all likelyhood, very very hard to implement. Sympy indeed has an assumptions system, in fact two. There is the "old" assumptions system, and a "new" assumptions system. The aim of this project is to remove the old one.

Since it was claimed above that significant changes to the assumptions system are very difficult to do, and replacing the entire system certainly counts as a significant change, we should naturally question the necessity of it. The main issues with the current assumptions system can be summarized as follows:

it is of limited expressiveness
it is very hard to extend
it is esoteric and under-documented
it is very tightly coupled to the core
it incurs significant performance overhead

These issues are interrelated, and I will explain them to some extent. It is important to recall the history of the current assumptions system: sympy has naturally always needed such a system, and basically the current system is the organic outgrowth of years of development (as far as I know). This explains the first four points. I have only cursory knowledge of the last issue mentioned above (performance impact). As far as I know, the story goes roughly like this: since the assumptions system is so tightly coupled to the core, (a) radical restructurings of the core for performance improvement are essentially impossible, and (b) in the bulk creation of "dumb" objects, setup of the assumptions becomes the dominant source of delay.

It is also important to keep in mind the strong suits of the old system: the current system is well-tested, easy to use, and highly optimized. It is also very resilient: the inference engine works in such a way that inference rules can be scattered all throughout sympy (typically near the types of objects they are about), can call each other recursively in any desired way, and still there will never be an infinite loop, and the derived conclusions are always consistent.

Any new system will have to be at least as good as this. Additionally, it has to address a significant subset of (1) to (5), in order to merit the large effort required to replace the old system. This is a formidable task indeed, which has been ongoing for four years now. It started (as far as I know) in 2009, with a GSOC project on assumptions. This project aimed to address issues (1) and (2). It was largely successful in doing so: the project created a highly expressive and very powerful new assumptions system. Replacing the old system by the new one was unseccessful though, largely for performance reasons. In 2010 then, naturally, there was a GSOC project to improve the new assumptions system, and replace the old one. Vast improvements were implemented to the core algorithms underlying the new assumptions system (i.e. the SAT solver). However, replacing the old system still did not work out. The attempt hit a variety of practical problems, including caching, uncovering bugs, incompatibilities with the old system, and performance issues.

Since then, few people have dared touch the old system significantly, and little work has been done on the non-integrated old system (understandably). However, anyone who has ever written any amount of code for sympy has interacted with the assumptions system, and almost everyone has stumbled across problems related to the issues above.

The same is true for me, and I want to remedy the problem. I am in the fairly unique position of being both an experienced sympy developer, and having a lot of free time over the summer. I want to use this GSOC project to fund my working full-time on finally getting rid of the old assumptions system.

Deliverables

The main aim of this project is deceptively modest: replace the usage of the old assumptions sytem by the new system, remove the old system, and do not negatively impact performance or functionality in doing so. (Of course, as discussed above, this will also have a large positive impact throughout sympy.)

As discussed below, this is actually a highly nontrivial task, whose solution will require improvenents throughout sympy. These include:

a redesigned caching system
optimization of new assumptions code to remove linear complexity
new features for the new assumptions code, to achieve backwards compatibility
robustness improvements for the new assumptions

There is a second core part to this project, which is to very carefully document the new assumptions system.

Stretch goals

Hopefully it should be possible to make progress on at least one of these points (see discussion at the end for details):

dynamically defined assumptions
transforming the current autosimplification system of sympy into autorefinement
improving refine() using logpy
reorganize the handlers code
better interaction of inference engine and handlers

Discussion

Advantages of the new system

The new assumptions system differs from the old one quite drastically. In particular:

assumptions are stored differently
assumptions are represented differently
relations are represented differently
they use completely different inference methods.

Most of these differences are by design, and are what makes the new system more powerful. But the most important property of the new system is its easy extensibility. The old sytem is very hard to understand, and seemingly tiny changes tend to have enormously counterintuitive consequences. This is why the new system attempts to separate as cleanly as possible the various operations it provides:

a syntax for posing questions and specfying assumptions: (ask(question, assumptions)) and (Q.foobar(expr)); also the logic connectives (And, Equivalent) etc
a logical inference engine (this is basically the SAT solver in the logic module)
a collection of relations, such as (Implies(Q.positive, Q.nonzero))
a collection of specialised logic rules, e.g. for determining positivity of some expression numerically
a system for coupling together these operations to answer questions: the current implementation of (ask)
a system for coupling together the operations to simplify expressions: the current implementation of (refine)

What is important here is that this system is modular. The first four parts are (essentially) independent from the others. It should be understood that neither of these parts is "perfect" in its current form, and probably not even better than the old system (this is certainly true for the last three points). Discussions as in pull request #1907 and issue 3687 (which does not work with the current implementation of the new system, but is no problem with the old one) make this clear.

The main aim of this project is not to implement the fancy new ideas we have for the new assumptions system. Instead I want to improve it to such an extent that it can replace the old sytem, and then get rid of the old system. Once the new system is in common use, I am confident it will be extended rapidly, thanks to its modular design.

Current shortcomings of the new system

The new assumptions system has the following conceptual and practical shortcomings over the old system:

less robust than the existing system
relies global state, so is not side-effect free
cannot express (symbol('x', positive=True)) without appending to global state
scales linearly with global state

The first point is only to be expected when comparing a new, essentially unused system to an old and evolved one. The old system uses to following measures to improve robustness: (a) inconsistency detection via aggressive forward-chaining, (b) recursion detection via caching.

(a) means the following: whenever a new fact about some expression becomes known, all facts about the expression which can no be proven are deduced on the spot. For example, suppose there is a special infinity object, and in its class definition we write (is_positive = True; is_real = False). As soon as this object is instantiated, the assumptions system tries to deduce all facts, and finds a contradiction. The new system does not forward-chain aggressively, leading to issues such as 2877. In general, there is really no way around this. Even in the old system, if we replace the (is_positive) and (is_real) variables by methods (_eval_is_positive) and (_eval_is_real), then whichever is asked first will take precedence (compare the discussion in the issue referenced). The underlying problem is that there is no (efficient) way to ensure that the specialised logic rules ("handlers") conform to our theory ("relations"). This does not mean that we should not make an effort; any way to detect such bugs without hurting the cost is desirable. One strategy to catch most of the problems the old system does would be to exhaustively test all handlers without additional assumptions (i.e. given any kind of sympy object (e), for each predicate (Q.p), run (ask(Q.p(e))); thus build a table of e's static properties ("default assumptions") and check that it is consistent with the theory).

To understand (b), observe that the old system stores (caches) any facts it ever proves. (This is ok since assumptions never change.) This then allows employing a neat trick for recursion detection: at the beginning of an inference, store "None" into the cache of the fact on the object. If the handlers check circular properties (e.g. the (positive) handler might check (is_negative) and vice versa), there still will be no infinite loop, since the second time we ask about any property the answer will just be None (i.e. "I don't know"). This technique should be employed by the new system as well. Instead of caching inferred facts forever, we can memoize them over the course of one ask invocation; this will both improve speed and avoid infinite loops. (See below for more details on caching.)

The second issue is almost by design: we want to disentangle assumptions from the core. But they still have to be stored somewhere. Always passing around all assumptions explicitly is quite cumbersome, and in any case would require drastic changes all throughout sympy. Hence the new system provides a list of global assumptions, which are always implicitly relied on. But this makes it essentially impossible to write sideeffect-free functions involving sympy objects (one is never allowed to use the assumptions system, or call any function using the assumptions system). Since caching is used heavily throughout sympy, this is actually a really big issue.

Naively one might consider just flushing the cache upon any global assumptions change. It seems initially plausible that this should be a rare occurrence. But that is not so: currently, the only way to implement (Symbol('x', positive=True)) using the new assumptions is to alter global state, so any creation of new symbols with assumptions on them would cause cache flushing. This is very undesirable, since it makes the performance of an algorithm which relies on caching sensitive to changes in almost completely unrelated parts of sympy.

The problems caused by the (Symbol('x', positive=True)) pattern are exacerbated by the last issue mentioned above: if it is imlemented naively (as described), then the global_assumptions will grow whenever such symbols are created (and not shrink again, at least naively), so because of linear scaling characteristics of the current implementation of (ask()), sympy will become unacceptably slow in situations the old system could handle just fine.

It seems that solving the last three issues requires a set of interrelated measures. First of all, the pervasiveness of caching in sympy has to be reduced. Specifically, we should remove the global, "always-on" cache and replace it by smaller, algorithm-specific caches, which are cleared upon completion of the algorithm. The algorithm then has to be written in such a way that the assumptions related to objects are irrelevant for caching. For example TODO gruntz, expand. Another example is the use of the pattern (@cacheit def sort_key). Here it is even more clear that the caching can be done locally (on each object separately). But probably the sort key should not depend on global assumptions in the first place, so there is not actually a real problem with caching here. In total, there are 83 cached functions in sympy. Note that the removal of the global cache has been identified as desirable a long time ago. It tends to create very subtle problems which are incredibly difficult to debug. In this project, I plan to do minimal changes which allow the system to maintain performance and correctness in presence of the new assumptions. Nonetheless, it should be a good start towards removing the cache altogether.

As discussed in this email, evidence seems to suggest that at least in its current form, even algorithms which do not explicitly use caching heavily rely on it for performance. In particular, bulk object creation seems to be slow. This is expected in the old assumptions system. It is not clear if removing the performance penalty of aggressive forward chaining will be enough to speed things up acceptably; instead it seems likely that some level of general caching has to be retained.

Secondly, the (Symbol('x', positive=True)) pattern cannot be implemented naively (i.e. by appending the assumption to the global list). The reason is that if we do this, existing client code which repeatedly creates and destroys many small objects with assumptions on them will swamp the global assumptions list. Even if (ask()) is improved to depend logarithmically on the number of global assumptions, this would likely not yield acceptable performance. One might additionally store only weak references in the assumptions list (so as to allow it to shrink again), but since there are no guarantees on garbage collection in python, one should not rely on it for performance. Instead, it seems sensible to store the (positive) assumption on the symbol itself, at least for an interim period while we still allow this pattern. This simply seems to yield the best and most stable performance. Note that this associates extra data with sympy objects in a non-standard way, and so should be done cautiously. In particular, in much of sympy we expect (expr.func(*expr.args)) to return an equivalent of (expr). This pattern does not work with the Symbol class, so it is OK to associate the data in a non-standard way there, but not in general.

So there would be several ways to associate facts to expressions:

pass them directly to (ask())
store them in the global assumptions
store them on the symbol

Traditionally, the only possibility was (3), so this is the only case where we have to deal with legacy clients. So for (1) and (2), we can impose any reasonable performance characteristics. In particular it is clear that method (1) cannot be used for bulk assumptions (i.e. assumptions passed to (ask()) directly can in principle be completely fresh, and must all be analyzed), and this should be documented. A sufficiently refined implementation of (ask()) (which indexes global assumptions by the symbols involved in them) can be logarithmic in the number of global assumptions + symbols they involve (for bounded size of the expression being asked about, and assuming good behaviour of the SAT solver). Hence (2) is OK to use with bulk assumptions as long as they are cleared consistently. But the new (assuming) context manager does precisely that, so we should make it clear that this is the correct way for client code to introduce assumptions (and in principle as many as they like).

Plan

The above discussion makes it clear that this GSOC project will naturally consist of three phases:

preparation: bring the new assumptions up to speed
switchover: remove the old assumptions
stretch goals

In phase (1), the old assumptions are still in place, and the improvements to the new system discussed above are implemented. That is:

improve robustness of the new system
implement new assumptions on symbols
improve the handling of global assumptions in (ask) to scale logarithmically
replace the global cache by local caches

None of these tasks is easy, but thankfully they are all isolated (the last one not quite as much as the others).

Likely the most demanding part of this project is phase (2). There have been numerous attempts at doing this (TODO links). These have made it clear that the new and old assumptions are subtly incompatible, and that such a switchover is bound to touch virtually all of sympy. In particular, it seems very unreasonable to expect to be able to do the entire switchover in one, reasonably-sized branch. Instead, it seems advisable to expend some effort into being able to use the new and old assumptions system alternately, controlled by a simple switch in the code. Since by default, objects will not be carrying any assumptions, this should not be difficult to implement (i.e. the assumptions system uses the Basic class, but Basic objects will be instantiated without needed to refer to the assumptions system). Then, all the modules of sympy can be updated one after another, in such a way that they work with both the new and the old system. Eventually, all tests will pass with both the new and the old system, and the old one can be retired. The point of having such a switch is is that it allows code fixing problems for the new system to be merged into master, without requiring it to fix all problems with the new system at once.

It should be made clear that this switch inside the code to use either the old or new system should exist for as short a period of time as possible, and never be exposed to users. It is purely a development tool.

Phase (3) will start reaping in benefits of the new system. TODO more detail?

Tentative schedule

In my experience, estimating a schedule for software projects is next to impossible. So the following is very tentative.

week 1 implement local caching (memoization) system, use it in ask; classify all uses of @cacheit (2 PRs)
week 2 use the new caching system in gruntz, expand (2 PRs)
week 3 fix remaining issues with disabled cache
week 4 remove linear scaling from ask (about 2 PRs)
week 5 add new assumptions to symbols (and have ask retrieve them), implement static consistency checking for handlers (2 PRs)
week 6 implement "the switch" (including depracating @cacheit) and start fixing issues (2 PRs?)
weeks 7 - ? fix issues, performance problems; remove old assumptions
week ?+1 write detailed documentation of the new assumptions system (1 PR)
weeks ?+2 - end stretch goals

I realise this covers only half of the GSOC period, but I believe being more specific, or planning further ahead, would be delusional.

Stretch goals

The following ideas have come up during discussions of the new assumptions system. To some extent these motivate the entire "removal of the old assumptions" business: they illustrate that people have many ideas in which direction to take the assumptions system, but none of them seem feasible in the old system. They are less well-defined than the plan above. My idea is that during the phases one and two of this GSOC project, I will understand in more detail the possibilities and weaknesses of the new system, allowing me to refine the ideas stated below and assess their feasibility. Then additional, informed discussion can take place as to which of them to implement.

dynamically defined assumptions

The current assumptions system is not extensible without changing the core. In the current implementation of the new assumptions, adding new types of assumptions requires changing the module, because all assumptions are statically defined there.

We should allow clients to dynamically define new assumptions, and how they relate to old assumptions. We want it to be possible to define new assumptions without changing sympy (so as to increase reusability, and improve the status of non-sympy modules). But also we want modules of sympy to be able to define their own assumptions, without needing to change the assumptions module. This is because (a) maintaining a huge list of assumptions and facts within the assumptions module is going to be very difficult, and perhaps more importantly, (b) we do not want optional modules (e.g. matrix expressions) to slow down the startup of sympy.

transforming the current autosimplification system of sympy into autorefinement

One idea which has always been closely linked to the new assumptions system is abolishing, or drastically changing, autosimplification. For example, if p is a positive symbol, then abs(p) will automatically evaluate to p, which sometimes leads to undesirable behaviour. The grand idea to replace this seems to be to have only extremely minimal (if any) simplification be done automatically, and everything else initiated by a new function called refine. In particular, refine should take as a second argument additional assumptions, which should be used in simplifying the expression.

Independent of whether or not we want to make as drastic a change as abolishing most autosimplification, once the old assumptions are removed, refine will become a first-class citizen of sympy. This is going to lead to code duplication: for example, the abs function knows how to simplify itself if the argument is positive. But refine also needs to know this, in case the positivity is passed in as an "after-the-fact" assumption.

One solution, which certainly seems useful in the short term, is to make refine use the existing autosimplification logic, by rebuilding the expression with new global assumptions:

def refine(expr, *facts):
  with assume(*facts)
      expr = rebuild_tree(expr)
  # do everything else we would normally do here
  return expr

This will make refine at least as powerful as the autosimplification. This is still not entirely satisfying, because simplification logic is now in two completely separate code locations. Further steps depend on where we want to go with autosimplification, and where we want the code to live. (For example, should sin(pi) automatically turn into 0? If not, should refine do this? It doesn't sound like refinement to me.)

What can be done as an easy next step is to identify logic inside sympy classes (typically eval) which definitely belongs to the refine module. Move the code there, and add a call to refine(self) (so using only the global assumptions) at the end of (e.g.) eval.

improving refine() using logpy

Currently all autosimplification and refinement code is procedural and ad-hoc. It seems useful to try and encode simplification rules in a more declarative fashion, and let a kind of general simplification engine attempt to do the simplifications. This will likely lead to much more maintainable code.

One attempt to do this is in pull request 1996, using a logic programming system for python. I could investigate how much of our current code can be offloaded in this fashion.

reorganize the handlers code

Special rules/code for answering queries about objects is, essentially, indexed by both expression type, and query. For example, methods of answering Q.is_positive(expr) depend on the fact that we want to know something is positive, and what that something is.

In the old system, implementations of such functionality are grouped by object type: classes can implement _eval_is_positive and the like to supply extra information. This makes a lot of sense for classes like special functions (certainly we expect code to tell us when the gamma function is positive to be near code telling us where it is real), but not so much for core data structures, like Add and Mul (surely we expect code to tell us when a sum is positive to live near code telling us when a product is positive).

In the (current implementation of) the new system, such code is grouped by query: all code to decide positivity lives in the AskPositiveHandler, all code to decide boundedness lives in the AskBoundedHandler, etc. This just switches around the two undesirable situations. (There is, however, support for having several handlers answering the positive query; this will just call all in turn.)

So it seems useful to allow both patterns. While we are at it, it probably makes sense to think about where such code should live, and in what ways we want it to operate. (Clearly, extending ask to support both patterns as special cases of something more general seems easy enough, but this should be preceeded by a discussion what kinds of patterns we actually want to encourage.)

better interaction of inference engine and handlers

The old assumptions system uses backwards chaining: given an expression e and a property foo, in order to answer is_foo(e), it first tries calling the foo handler for e. If this is not successful, it looks at a list of all properties bar which, under some circumstances, imply foo. Then, recursively, each of them is investigated (first using the handler, then looking at properties implying it), etc.

In the (current implementation of the) new system, the handlers and logic inference are completely separate. For example, given an expression e, suppose we ask(Q.is_nonzero(e)). In the absence of global assumptions, the current implementation of ask is going to call the nonzero handler for e, and if this is inconclusive, terminate. Even if there is a conclusive is_positive handler, it is not going to be called.

This surely (?) seems undesirable, and once the new system is in place, we should investigate how to link more closely together the handlers and logic inference code.