Understanding Fast parsers - tooltwist/xdata GitHub Wiki

Introduction

XData supports several data formats with "Fast" parsers. These are web-application friendly parsers that under some circumstances can operate up to an order or magnitude faster than common XML and JSON parsers. While they are not suitable for every application, their use is encouraged wherever possible.

Why the Fast parsers are fast

When considering performance, there are three overheads to be considered:

Parsing the document.
JSON can be parsed very quickly, as it is a straightforward protocol [http://www.ietf.org/rfc/rfc4627.txt]. XML is slightly slower, but still relatively fast. In most cases, scanning the document is a relatively fast process.
Creating an object representation of the data.
In many cases, and especially for large documents, the time to convert to an object representation can be large. In some applications the overhead builds quickly, especially when many list are loaded, or each item in the lists contains many fields. In many cases the application only needs several of the fields included in a JSON document, so creating a complete object of all the other data in the document is wasted effort.
Garbage collection. A common mistake while benchmarking parsers is to simply measure the time to parse a document, without consideration for garbage collection. If a parser creates large numbers of objects and many references between objects, it can take considerable time for the garbage collector to clean up and release the memory.

For many parsers, the time to parse a single document does not seem like much, but in a highly loaded web environment the numbers quickly add up (page views x documents x objects in the document). In this environment the cost of instantiating millions of objects that will never be used, and then later garbage collecting them, can seriously degrade throughput.

DOM representation of Data
Figure 1. DOM Representation of a Document

To avoid this problem, XData's fast parsers leave the text representation of JSON or XML data intact, and instead creates an index into the string for each of the nodes in the document.

"Fast Parser" representation of Data
Figure 2. Fast Parser indexing into a Document

The index object contains a large array of pointers into the original document. In many cases a single index object is used, but a linked list is used as required. Only at the time specific data values are accessed are the values in the original document instantiated as Java objects.

The end result? While parsing takes about the same time as a normal document, the cost of instantiating objects to contain data and later garbage collecting is completely avoided, for all fields except where a data value as actually used.

Limitations of fast parsers

XData's fast parsers are read only. Since they are accessing the data in it's unaltered String representation, there is no ability to insert data into the document.