XML Utilities - acfr/comma GitHub Wiki

Setup

expat

Most of the XML utilities in comma rely on the expat xml parser.

First you will need to get and uinstall the expat development libraries as appropriate for your operating system. For Example on ubuntu 14.04 $> sudo apt-get install libexpat1-dev. Otherwise see http://expat.sourceforge.net/

Then you will need to enable and configure expat for CMake by setting comma_BUILD_EXPAT to On and EXPAT_LIBRARY to refer to the installed path for expat. If you are using ccmake on Ubuntu 14.04 then the install path should be automatically detected.

Don't forget that you will need to rerun cmake to generate the build files.

Utilities

xml-grep

XML Grep uses the expat SAX parser to detect the start and end of chosen elements. Then it will output the contents of those elements to the standard output.

Example 1) cat new-york-zoo-stocktake.xml | xml-grep lion zebra

Will output the content between any or pairs.

Example 2) xml-grep --source=new-york-zoo-stocktake.xml lion zebra

Will output the same content but by reading the file directly, this can be slightly more efficient.

Example 3) xml-grep --source=new-york-zoo-stocktake.xml --limit=10 penguin

Will output (up to) the first ten penguin records. This is useful if you have very large data sets. The parser will stop and cause the program to return once limit is reached.

Example 4) xml-grep --source=new-york-zoo-stocktake.xml --range=11-20 penguin

Will output the second ten penguin records. You can use this to break a large data file into smaller accesses. However, this will still read and parse the first ten records. As such calling the program repeatedly to segment a large file is an O(n^2) operation, which can quickly leave your system I/O bound.

xml-split

Parses the input like xml-grep looking for given elements but will create directories named after the element and Into these directories it will output sequentially numbered files for each set of elements encountered.

This program was designed to break up massive xml files so DOM based xslt processors didn't load the entire massive tree into memory. Note that the program parses the input every time it is called.

Example 1) cat new-york-zoo-stocktake.xml | xml-split penguin

The first 1000 penguins will be in a file penguin/000001.xml, while the second 1000 penguins will be in a file penguin/000002.xml

Example 1) xml-split --source=new-york-zoo-stocktake.xml --limit=10 penguin

The first 10 penguins in the file new-york-zoo-stocktake.xml will be output to a file penguin/000001.xml, while the second 10 penguins will be output to a file penguin/000002.xml; and so on.

Note: In order to protect data The program will abort if the directory name already exists. i.e. the following will fail because the output directory already exists mkdir -p penguin && xml-split --source=new-york-zoo-stocktake.xml --limit=10 penguin

Note: If there are no matching elements in the input file then no directory will be output. So xml-split --source=new-york-zoo-stocktake.xml --limit=10 pterodactyl will output nothing.

xml-map with xml-map-split

This pair of applications was created to provide the functionality of xml-grep / xml-split for large mainly static data sources.

For example if you have a multi-gigabyte xml file that specifies baseline state of the roadway and cadastral boundaries of a large city. Thereafter you are processing deltas that request a change to a specific subset of that information, such as business names. To reprocess the baseline each time would require loading a multi-gigabyte DOM tree into memory for xslt transforms. These programs are designed as pre-processors.

Exmaple 1) xml-map --compact --source=megafile.xml > megafile.map

Will generate a byte map that specifies the xpath of each element in the input file.

Exmaple 2) xml-map --compact --maxdepth=3 --source=megafile.xml > megafile.map

If the xml structure is quite detailed and complex you may not need a map of elements below a certain depth, so ignore them. Also leads to faster processing because of a smaller map.

Example 3) xml-map-slit megafile.xml megafile.map 1-1000 business > business/000001.xml

Will output the first first thousand businesses to the named file. Note that this will parse the entire map, O(n), but will use seek to find it's input and output only the business element O(log n).

However, for simplicity the program will attempt to read each element in it's entirety. As such if you have chosen a multi-gigabyte element top-level element then the program will create a buffer to load that entire element.

⚠️ **GitHub.com Fallback** ⚠️