ANTLR: Parsing SQL - go-sqlparser/current GitHub Wiki

Parsing SQL

https://tomassetti.me/parsing-sql/

Written by Gabriele Tomassetti
in ANTLR, Code processing, Parsing

The code for this tutorial is on GitHub: parsing-sql

Table of contents

Parsing SQL

SQL is a language to handle data in a relational database. If you worked with data you have probably worked with SQL.

It is in the same league of HTML: maybe you never learned it formally but you kinda know how to use it. That is great because if you know SQL, you know how to handle data. However, it has limitations and when you hit them your only course of action might be to work with a traditional programming language. This does not necessarily mean migrating away from SQL. You might need to move from one SQL dialect to another one or to analyze the SQL code you use. And to do that, you need to parse SQL, and that is what this article is about.

What are the limits of SQL? After a while you learn of a couple of issues:

  • there is not one SQL, but many variations of it. The SQLs implemented in SQLite, MySQL, PostgreSQL, etc. are all a bit different
  • you cannot do everything related to data in SQL. In some cases, you need a traditional programming language to work with the data

These issues became real problems when you need to make big changes. For instance, if you need to change the database used by a large application. It can also be a problem when you need to make transformations that SQL and your database cannot handle them. In that case, you have some transformations done in SQL (and run on the database) and some other in your source code. So you spend a lot of time working on glue code and around the limitations of SQL.

SQL can also be a constraint, simply for what it lacks: unit testing integrated with your source code and all the other tools that you can use with Java, C#, etc. SQL is an old language and not designed for large scale programming, it is not a language that developers will love. Even worse, they will not be very productive with it. And what happens if your application requires you to verify that something gets executed in a certain way, or to offer some guarantees? It can be a business need or a regulatory requirement. Then you need to parse SQL or to find a way to move from SQL world to your programming language world.

That is what this article is about: parsing SQL. We are going to see ready-to-use libraries and tools to parse SQL, and an example project in which we will build our own SQL parser.

What is SQL

SQL (Structured Query Language) is a domain-specific language designed to handle data in relational database. It is a declarative language, so you describe what you want to achieve (e.g., get me a row of data with id=5) and not how to achieve it (e.g. loop through the rows until id=5). It has a formal mathematical foundation in relational algebra and calculus. The language is an ISO/IEC standard that is periodically updated, the latest version is SQL2016. This means that there is a very formal and clear description of the language if you need it.

SQL is designed to handle many aspects of the life cycle of working with data: yes, you can query data, but you can also create the format of data (i.e., the schema of a table) and regulate access control to the data.

SQL Procedural Extensions

Actual SQL implementations come with procedural extensions that implement some form of procedural programming. These are even more varied from database engine to database engine. They were added to perform complex elaborations on the data directly on the database, so they can be as powerful as traditional programming languages. And this can also be an additional headache if you try to parse SQL. See, for instance, this is a PL/SQL (the procedural SQL from Oracle databases) example from Wikipedia.

DECLARE
    var NUMBER;
BEGIN
     /*N.B. for loop variables in pl/sql are new declarations, with scope only inside the loop */
     FOR var IN 0 .. 10 LOOP
          DBMS_OUTPUT.PUT_LINE(var);
     END LOOP;
     IF (var IS NULL) THEN
          DBMS_OUTPUT.PUT_LINE('var is null');
     ELSE
          DBMS_OUTPUT.PUT_LINE('var is not null');
     END IF;
END;

Resources

As we have seen, handling SQL can be a daunting task, so let's see a few resources to help you. They range from grammars to kick start your parsing efforts, to ready-to-use tools.

Knowledge

Official Sources

The latest official SQL standard is formally known as ISO/IEC 9075 SQL:2016. You can find every information you need about it in the official sources either the ISO or IEC websites. However, keep in mind that there are several documents describing the standard and you have to pay for each of them. This means, depending on the exchange rate, the total cost could be more than 2,000 USD.

Official References from Database Documentation

Other than the official SQL standard, you can look up the official documentation of the database producers, which contains a reference for their SQL implementation. Here is a list of the few major ones:

Grammars

There are a few (usually partial) grammars available in different formats. They can be a good starting point for getting where you need.

Libraries

There are many libraries to parse SQL in different languages. Some support different databases and different programming languages. Here it is a list of the most used ones. Unless otherwise noted, the libraries are released under an opensource license.

Multi-lingual and/or multi-databases

  • General SQL Parser is a commercial library that supports many databases (DB2, Greenplum, Hana, Hive, Impala, Informix, MySQL, Netezza, Oracle, PostgreSQL, Redshift SQL Server, Sybase, and Teradata) and languages (C#, VB.NET, Java, C/C++, Delphi, VB). It can validate SQL syntax, format SQL and work with the parse tree
  • JSqlParser parses an SQL statement and translates it into a hierarchy of Java classes. It is opensource, with a double LGPL and Apache license, and supports many databases: Oracle, SqlServer, MySQL, PostgreSQL, etc.

MySQL parsers

  • Pingcap parser is a MySQL parser in Go.
  • xwb1989/sqlparser is a MySQL parser for Go. This parser has been extracted from Vitess, a database clustering system for horizontal scaling of MySQL.
  • PHP SQL Parser is a (mainly) MySQL (non-validating) parser written in PHP. It can parse other SQL dialects with some modifications. It fully supports parsing the most used SQL statements, but it just returns some information on other statements.
  • The SQL Parser of phpmyadmin is a validating SQL lexer and parser with a focus on MySQL dialect. Given its use in PHPMyAdmin, it is certainly well tested.
  • js-sql-parser is SQL (only select) parser for JavaScript that parses the MySQL 5.7 version of SQL into an AST.

SQLite Parsers

  • sqlite-parser is a parser for SQLite v3 written in JavaScript that generates ASTs.

SQL Parsers

  • sql-parser is a parser for SQL written in pure JavaScript. It is not maintained anymore and it only supports some SELECT queries, but it is probably better than starting from scratch if you need to use JavaScript.
  • hyrise/sql-parser is a SQL parser for C++. It parses the given SQL query into C++ objects. It is developed together with Hyrise, an in-memory database, but it can be used on its own.
  • andialbrecht/sqlparseis a non-validating SQL parser for Python. It provides support for parsing, splitting and formatting SQL statements.
  • K2InformaticsGmbH/sqlparse is a production-ready SQL parser written in pure Erlang. It targets the Oracle PL/SQL dialect.
  • sqlparser-rs is SQL parser written in Rust. It supports the SQL-92, plus some addition for MS-SQL, PostgreSQL, and SQL:2011. The developers say: if you are assessing whether this project will be suitable for your needs, you'll likely need to experimentally verify whether it supports the subset of SQL that you need.
  • moz-sql-parser is a peculiar SQL parser in Python written by Mozilla. The primary objective of this library is to convert some subset of SQL-92 queries to JSON-izable parse trees.
  • queryparser is a parser written in Haskell for parsing and analysis of Vertica, Hive, and Presto SQL.

Tools

From SQL to a Programming Language or another SQL

  • IO64 offers a tool to convert PL/SQL to Java. We have also seen this tool in our previous article Convert PL/SQL code to Java
  • Ispirer is a company that provides a tool to perform database migration from different SQL dialects to another SQL dialect or a programming language: Ispirer MnMTK. We have seen it in our previous article Convert PL/SQL code to Java
  • SQLines SQL Converter is an open-source tool to convert a SQL dialect to a different SQL dialect. It is written in C++ and implements its own SQL parser.

How Hard Could it be to Parse a Declarative Language?

Declarative languages tend to have simpler structures than your average programming language. For instance, you do not see many nested lambdas used inside declarative languages. A program in a declarative language is usually a long list of simple statements. The issue is that any single statement can be fairly complicated in some cases. And that is what happens in SQL.

Furthermore, there are actually several versions of SQL. This is true both in the sense that each database can implement it differently, and that there are many versions of the SQL standard because the language is evolving.

As an example, you are certainly familiar with a SELECT statement. In its basic format, like SELECT name FROM customers WHERE id = 1 it is quite easy to parse. However, the complete statement can be fairly complex. This image represents the SELECT statement as implemented in SQLite.

So, to implement support for parsing the general case will certainly take some time.

Try to Parse for Your Application

The reality is that to implement support for parsing the whole SQL requires a lot of effort, probably as much as parsing any average programming language. Even more, if you need to parse many different SQL dialects. Then, if you need to parse SQL and cannot rely on a ready-to-use library, you should start with an analysis of your needs. You should try to understand first exactly what you need to parse and then try to parse in light of your application. This means both to parse just what you need and to structure the grammar in a way that makes your life easier.

For instance, if you just need to translate a series of SQL files just once, it may be acceptable to just ignore 40 complex statements. Maybe translating them manually would take less time than creating a large grammar to do it automatically. This does not work if you either have a lot of code or you need to do the translation repeatedly, not just once.

In this tutorial we are going to use this pragmatic approach: we are going to create a SQL grammar from scratch to parse just what we need and to ignore the rest.

Writing a (Very Partial) Grammar

In our example, we will parse a simple SQL file (an exported database) to generate classes that could represent that database. We are going to ignore everything else, even any exported data present in the file. That's it. The advantage of dealing with SQL files is that we do not need to interact with a database. In fact, the database might not even exist anymore.

Designing a Partial Grammar

To do that we are going to start with a grammar. We are going to write a very simple one, but there are a couple of things to keep in mind:

  • SQL is not case-sensitive, while ANTLR grammars by default are
  • Our input (the SQL files) will contain some data that we do not care about. We still need to parse the file successfully while ignoring the extraneous input

These are notable issues that are specific to SQL and to our approach; probably they would not appear with other languages or if we wanted to create a complete grammar.

Before starting, just to give you an idea of what we have to deal with, this is a sample SQL file.

.
.
.

.
.
.

Summary

In this article, we have seen how to parse SQL. In general, our advice is to:

  1. Consider if you can use existing tools or libraries to process SQL code. Some of them even support multiple SQL-dialects or multiple programming languages. If you can use any of them for your needs they should be your first option
  2. If you want to build a solution in-house you may consider starting from the few ANTLR grammars available for the major SQL databases. They can really help you get started in parsing SQL
  3. If you are stuck with a less common SQL database you might be in trouble. Parsing SQL from scratch it is going to be hard work: the language has a fairly simple structure, but it is large, it has many variations and in some cases even procedural extensions. In this article, we have seen a few tricks to start parsing even with a partial grammar. This can be a good way if you need to parse only small parts of the language

If none of these options work for you, and you need some commercial option supporting you in building a solution for your needs, we at Strumenta may be able to help. For example, through our commercial PL/SQL parser.