GSoC 2014 Application Aditya Shah SymPy Parsing Framework - sympy/sympy GitHub Wiki

#Personal Details

Name: Aditya Shah

University: Birla Institute of Technology and Science-Pilani

Email and Google Code: [email protected]

IRC Handle: adityashah30

GitHub: adityashah30

##Bio I am a third year undergraduate pursuing B.E. in Computer Science at Birla Institute of Technology and Science-Pilani, Goa Campus. I have covered a wide range of subjects ranging from Data Structures and Algorithms, Object oriented programming, Operating systems, Computer Architecture. I am currently pursuing Compiler construction which led to my interest in this particular project.

#Programming Details

##Platform and Editor Details I use an Ubuntu 12.04 LTS box as my primary work machine. I am quite familiar with the concept of version control especially Git. I use Windows 8.1 occasionally for developmental purposes (especially to test cross platform capabilities). I generally use Sublime Text 2 as my primary editor because of the immense repository of add-ons that enhance the capabilities of the editor greatly.

##Programming Experience I am quite proficient in C, C++, Java and Python for programming. I generally use Qt and PyQt for GUI development. As a part of my course work, I have used Java for Object Oriented Programming and Data Structures and Algorithms. I have used C/C++ for Operating Systems and Computer Networks. I have mostly used Python for academic research projects (done under the guidance of the faculty here at BITS). My first project was creation of a system that facilitates data migration between 2 incompatible database systems, where I used python for data pattern matching and data extraction. Regular expressions were extensively used in the project. My second project (which I am currently pursuing) is the creation of a NLP (Natural Language Processing) enabled Linux Super Shell, which uses feature extraction and vectorization of queries to generate the relevant output regarding to Linux command to be used. It allows any Linux newbie to learn Linux efficiently and quickly.

##Python and SymPy I have been using Python for over a year now and I have become quite proficient at it. I would say that the feature of Python that I miss the most in the other languages I use is the lambda statements borrowed from the functional language paradigm. Lambda statements allow very efficient code construction by locally defining non-recurring functions. I was introduced to SymPy by a friend, who too is working with Sympy. I admit that I am quite new to SymPy, but I am quite impressed by the range of functions that it can execute. I am drawn in particular to the parsing module of SymPy, having done projects on the subject myself.

###Contributions to SymPy All my past work for SymPy has been regarding improvement of the Parsing module present in SymPy, under the guidance of Aaron Meurer (@asmeurer) and Sergey Kirpichev (@skirpichev). Following of my PRs were merged-

Fixed a bug in mathematica parser module regarding its inability to parse functions when arithmetic expressions are passed as arguments. Issue: 1160 . PR: https://github.com/sympy/sympy/pull/2955
Added functionality to sympy_parser.py module to allow for conversion (“Sympification”) of standard Python lambda statements to their Lambda equivalents. Issue: 3051. PR: https://github.com/sympy/sympy/pull/2965

#The Project

##Project Abstract The project aims at standardizing the current parsing module present in SymPy. It allows the developer and the user to extend the immense computational power of SymPy by inclusion of code snippets and even entire programs (in time) written in contemporary CAS such as Mathematica or Math Spec Languages such as Latex, by developing a standard framework by which parsers for the aforementioned languages can be written efficiently. The project also aims at developing working parsers for Mathematica and Latex using the created framework. The last part of the project consists of a rudimentary natural language (English) parser for SymPy, which allows SymPy to gather functionality similar to that of exhibited by WolphramAlpha in the area of natural language query processing and producing relevant outputs.

##Why this project I am deeply impressed with the flawless code organization and implementation of SymPy, all except the parsing module. The current parsers for Mathematica and Maxima are more of a hack, as they don’t cover all the cases possible and they can parse and convert small expressions at best. So, in order to change this situation, according to me, a standard has to be enforced which will enforce completeness in the parsers and also enable rapid development of existing and new parsers.

##Qualifications I have completed a course on Theory of Computation, in which I learned all about Regular languages, Finite State Automata, Context Free Languages, Push Down Automata and Turing machines. So, I have a deep understanding of how languages are defined and how corresponding matching machines are created. I am currently pursuing a course on Compiler Construction, where I have learnt about the various parsing techniques used by the compilers and how they make use of standard algorithms to compile programs.

##Implementation Details Here I present the details of how the project will be implemented. The project comprises mainly of 4 parts.

Language Spec File
Spec to Grammar Converter
Parser Generator Framework (PGF)
Abstract Syntax Tree (AST) Processor

###Language Spec File

This file contains the details of the grammar to be processed by the PGF.
The grammar is specified in Extended Backus-Naur Form (EBNF).
It doesn’t contain the grammar in its entirety. The reason behind this is that I observed that most of the Math-Spec Languages tend to have similarities in their grammar which leads to unnecessary repetition of generic rules.
So here we define only the rules that differ from the predefined rules which are open to inspection and modulation whenever necessary. So the conflicting rules in the predefined file can be overridden and new rules can be added to the grammar.
This architecture allows for a flexible approach to grammar design by implementing the concept of code reusability.

###Spec to Grammar Converter

Most of the PGFs use python code to create the final parser. So we need to convert the simple spec file into equivalent python grammar.
This module also serves to analyze the spec file and combine the user’s spec file with the predefined spec file, while taking care of the overridden portions of the grammar as defined by the user.

###Parser Generator Framework

This module uses a preexisting parser generator such as “modgrammar” which takes grammar in form of python code and generates the parser that recognizes the specified grammar.
Here, I choose modgrammar because of its ability to recognize ambiguous grammars in case such a grammar is provided. (Modgrammar is a GLR grammar parser).

###AST Processor

This module processed the AST generates by the Parser and convert it to relevant tokens to pass to sympify module to be converted to SymPy expression.

###Final Block Diagram

Language Spec File ==> Spec to Grammar Converter + Predefined Spec File ==> Parser Generator Framework ==> Parser

###Parser in Action

Program ==> Parser ==> AST Processor ==> Sympify ==> SymPy Code

###Prototype API Here I demonstrate the essence of the framework using a very simple example.

####Predefined Spec File

@R1
E → E + T | T
@R2
T → T * F | F
@R3
F → (E) | id

####User's Language Spec File

@Name Sample_Parser
@Output grammar1.py
@Override R3
F → (E) | alphabet | num

####Spec To Grammar Converter The program combines both the grammar to yield the final grammar

@Name Sample_Parser
@Output grammar1.py
@R1
E → E + T | T
@R2
T → T * F | F
@R3
F → (E) | alphabet | num

The program actually generates the final grammar as “grammar1.py” file to be inputted to modgrammar.
Please note that this is not the final file to be passed an an input to PGF. The file which is passed as an input is a Python file which encapsulates the above grammar.
Also please note that, in the case that the user already knows the format of the input file to the PGF and doesn't want to be bothered with the Spec to Language Converter, he/she is at a complete liberty to do so. The user can directly write the input file (generally in Python) which then is passed as input to the PGF to generate the corresponding parser.

####Parser Generator Framework A PGF such a modgrammar takes in the “grammar1.py” and outputs a parser named “Sample_Parser.py” which recognizes the strings belonging to the grammar.

####AST Processor in Action

Suppose the string “(x*2) + 3” is given to the parser to convert to equivalent SymPy expression.
The parser generates the following AST (Abstract Syntax Tree):

					E
				E	+	T
				T		F
				F		num
			(	E	)	3
				T
			T	*	F
			F		num
		 alphabet    2
			x

This tree is then processed to finally generate the Sympy tokens by sympify.

(OP, (), (alphabet, 'x'), (OP, *), (num, 2), (OP, )), (OP, +), (num, 3)

These tokens are finally processed to generate the final SymPy expression

(Symbol('x')*2) + 3

###Final Implementations Using the above mentioned framework, I intend to create parsers for following

Mathematica
MathML
Latex
Natural Langauge (English)

##Timeline This is the tentative schedule that I intend to follow during the summer. This schedule is a bit pessimistic in nature to account for contingencies should any of them arise. This schedule has been created with a 40 hour week in mind. The slots have been created such that the work is compartmentalized and it corresponds to a PR from my side for evaluation and integration.

###Community Bonding Period Objectives-

Ensure that I am sufficiently acquainted with the concepts of Parsing and Code generation before I commence my coding tasks.
Chalk out basic structure of the modules and code organization.
Study the existing parsing module, “sympy_parser.py”, “sympy_tokenize.py” and “sympify.py” in great details and understand all the aspects of the code.

###Week 1 Objectives-

Research various aspects of syntax of Math-Spec languages.
Based on my research, formulate the generic set of grammatical rules.
This part concluded the formation of the Predefined Spec File to be provided with the SymPy codebase.

###Weeks 2-3 Objectives-

Implementation of Spec to Grammar Converter.
This part concludes with the successful creation of the python file that serves as input to the Parser Generator Framework (modgrammar).

###Week 4 Objectives-

Write unit tests for the Spec to Grammar Converter to ensure soundness and completeness.
Finish the documentation of Spec to Grammar Converter.
Buffer period to ensure PGF generates proper parser.

###Weeks 5-6 Objectives-

Implementation of AST Processor.
This part concludes with the successful generation of the final tokens to be passed to the existing SymPy parsing modules.

###MIDTERM EVALUATION

###Week 7 Objectives-

Write unit tests for the AST processor to ensure soundness and completeness.
Finish the documentation of AST processor.
Buffer period to ensure proper tokens are generated by the AST processor.

###Week 8 Objectives-

Creation of parser for Mathematica.
Write unit tests to ensure soundness and completeness.
Document the parser.

###Week 9 Objectives-

Creation of parser for MathML.
Write unit tests to ensure soundness and completeness.
Document the parser.

###Week 10 Objectives-

Creation of parser for Latex.
Write unit tests to ensure soundness and completeness.
Document the parser.

###Week 11 Objectives-

Creation of rudimentary Natural Language Parser.
Write unit tests to ensure soundness and completeness.
Document the parser.

###Week 12 Objectives-

Make sure that the code is bug free.
Make sure that tests are written properly.
Make sure that all the code is properly documented.

###Week 13 Objectives-

Buffer Period.
Rectify mistakes if found any.

##Post GSoC I intend to continue working in the field of parsing with SymPy after GSoC. I am particularly drawn to the study of Natural Language Processing and Natural Language Query analysis. I intend to contribute to the Natural Language Parser that I design during the GSoC project to cover more possibilities and take care of as many contingencies as possible.

##References

Compilers Principles, Techniques and Tools by Aho, Lam, Sethi and Ullman
http://pythonhosted.org/modgrammar/
http://library.wolfram.com/conferences/devconf99/lichtblau/Links/index_lnk_10.html
https://reference.wolfram.com/mathematica/tutorial/TheSyntaxOfTheMathematicaLanguage.html
http://www.w3.org/TR/MathML3/
http://en.wikibooks.org/wiki/LaTeX/Mathematics