csv - acfr/comma GitHub Wiki

Table of Contents motivation an example 5-minutes tutorial visiting visiting traits ascii binary streams notes class overview utilities overview utilities tutorials

motivation

Tabular and comma-separated values are a simple and portable data model for streams of sensor readings, market quotes, data base records, etc. It is a trivial protocol for any homogeneous fixed-width data streams. It decouples the csv data and the meaning of its fields. Therefore, different classes or utilities can assign different meanings to the fields.

The other force is: keeping arbitrary csv data human-readable, which is crucial for prototyping, debugging, and simple data operations, without performance loss. Therefore, the csv library uses both ASCII and binary fixed-width data with simple converters.

Thus, the idea is to:

keep the unstructured fixed-width data separate from their meaning: the data user assigns the meaning to the data ("pull model" for the meaning)
be able to tap to those flows at any point, either to redirect them or to simply see what is going on
a generally, represent the whole system operation a bunch of csv data streams and transformations on them, which applies to very many real-life use cases

an example

The csv-style filters can be put together into a processing pipelines, mixed with standard utilities like cut, grep, sed, etc.

Say, we have a sensor publishing timestamped 3D points in polar coordinates (for brevity assume they are ASCII). The fields are t,range,bearing,elevation,scan,intensity where one line looks like:

 20121006T122433,12.345,98.765,33.22,25,234

The pipeline below converts them to Cartesian coordinates and visualises them in real time. This is an actual pipeline with some utilities implemented elsewhere, using comma/csv library. The meaning of the fields is defined in the --fields command line option.

 netcat localhost 12345 | points-to-cartesian --fields=,r,b,e | view-points --fields=,x,y,z,block,scalar

Now, suppose, we have a log collected from this sensor and want to play it back:

 cat points.csv | csv-play --fields=t | points-to-cartesian --fields=,r,b,e | view-points --fields=,x,y,z,block,scalar

5-minutes tutorial

Try to copy the examples below into your source files, build and run them.

visiting

There are two basic visitor facades: csv::ascii and csv::binary.

visiting traits

Assume, we have a structure foo with visiting traits defined (see visiting for details):

 // an arbitrary structure
 #include <comma/visiting/traits.h>
 struct foo 
 {
     std::string hello;
     std::string world;
     int index;
 }
 
 // the only routine part: define how we visit foo in some header
 // assume, hello.h contains traits definition for hello_type, same as below
 namespace comma { namespace visiting {
 template <> struct traits< foo >
 {
     // non-const visitor
     template < typename K, typename V > static void visit( const K& k, foo& f, V& v )
     {
         v.apply( "hello", f.hello );
         v.apply( "world", f.world );
         v.apply( "index", f.index );
     }
     // const visitor
     template < typename K, typename V > static void visit( const K& k, const foo& f, V& v )
     {
         v.apply( "hello", f.hello );
         v.apply( "world", f.world );
         v.apply( "index", f.index );
     }
 };
 } }

ascii

For ASCII visiting:

 #include <comma/csv/ascii.h>
 #include "foo.h"
 
 int main( int ac, char** av )
 {
     foo f;
     {
         std::string s = "25,world,is,so,gloomy";
         // ascii csv visitor for given csv fields
         // the fields can be either flat: --fields=x,y,z
         // or hierarchical, e.g: --fields=from/x,from/y,from/z,to/x,to/y,to/z
         // where hierarchical fields can be collapsed, e.g: --fields=from,to means the same as above
         comma::csv::ascii< foo > ascii( "index,,,,world" );
         ascii.get( f, s );
         std::cerr << f.index << std::endl; // prints: 25
         std::cerr << f.world << std::endl; // prints: gloomy
         f.index = 99;
         ascii.put( f, s );
         std::cerr << s << std::endl; // prints: 99,world,is,so,gloomy
     }
     {
         std::string s = "88,world,is,so,happy";
         comma::csv::ascii< foo > ascii( ",,,,happy" );
         ascii.get( f, s );
         std::cerr << f.index << std::endl; // still prints 99
         std::cerr << f.world << std::endl; // prints: happy
    }
 }

binary

For binary, we need to specify not only fields, but also the binary format.

 #include <sstream>
 #include <comma/csv/binary.h>
 #include "foo.h"
 // ...
 foo f;
 {
     // binary fixed-width visitor for integer, fixed-length string and two unsigned ints
     comma::csv::binary< foo > binary( "i,s[8],2ui", "index,world" );
     std::vector< char > buf( binary.size() );
     // put index in the first 4 bytes and f.world value in the next 8 bytes
     binary.put( f, &buf[0] );
 }

streams

comma::csv::ascii and comma::csv::binary are less often, than the csv streams, which are useful for any utility that takes an input stream of csv data, does something to it, and outputs as a csv stream of another type.

For convenience they use csv::options to specify fields, binary format, delimiter, and full_xpath flag.

E.g. we want to read Cartesian points and output points in polar coordinates. Assume, we have point_xyz, point_polar, and conversions between them defined elsewhere.

 #include "point.h" // suppose, it defines point_xyz, etc
 #include <comma/csv/stream.h>
 int main( int, char** )
 {
     comma::csv::input_stream< point_xyz > is( std::cin );
     comma::csv::output_stream< point_polar > os( std::cout );
     while( std::cin.good() )
     {
         const point_xyz* = is.read();
         if( !point_xyz ) { break; }
         os.write( point_xyz.to_polar() );
     }
 }

Assume now that the input has timestamp as the first field and that timestamp should be output with each point without change:

 #include "point.h" // suppose, it defines point_xyz, etc
 #include <comma/csv/stream.h>
 int main( int, char** )
 {
     comma::csv::options input_options( ",x,y,z" );
     comma::csv::options output_options( ",range,bearing,elevation" );
     comma::csv::input_stream< point_xyz > is( std::cin, input_options );
     comma::csv::output_stream< point_polar > os( std::cout, output_options );
     while( std::cin.good() )
     {
         const point_xyz* = is.read();
         if( !point_xyz ) { break; }
         // substitute x,y,z for range,bearing,elevation, leaving timestamp unchanged
         // in the line just read is.last()
         os.ascii().write( point_xyz.to_polar(), is.last() );
     }
 }

The binary streams work in the same way, except you need to specify binary format in csv::options. See csv utilities and unit tests for more generic usage examples. See doxygen documentation for details.

notes

When using the standard input/output streams, the comma::csv streams rely on the standard C++ streams to not be synchronized with the standard C streams (i.e. std::cin.sync_with_stdio( false ); is called upon construction). If all input/output operations in the program are performed through the comma::csv stream this should not be of any concern. However if there are also input/output operations performed directly on the stream (e.g. std::getline( std::cin, str ); ), then std::cin.sync_with_stdio( false ); is required in the program before any input/output operations are performed.

class overview

ascii: conversions between an ascii csv string and a class
binary: conversions between a binary fixed-width buffer and a class
input_stream,output_stream: strongly typed input/output streams of csv or binary fixed-width data
format: binary format definitions and operations
options: csv options
names: a field name visitor

This is just an overview of main classes. See doxygen documentation for details.

utilities overview

All utilities can handle both ascii csv and binary data in the same way, unless stated otherwise, e.g:

Play back ascii csv data with the timestamp in the 3rd field:

 cat timestamped.csv | csv-play --fields=,,t

Play back ascii csv data with the timestamp in the 3rd field, where the fields are unsigned int, double, time,int:

 cat timestamped.bin | csv-play --fields=,,t --binary=ui,d,t,i

csv-analyse: try to guess binary size (width) of data in unknown binary stream
csv-bin-cut: same as standard linux cut, but on binary data
csv-bin-reverse: reverse byte order of given fields (to change endianness)
csv-blocks: operations on blocks of data based on the block field
csv-calc: column-wise operations on csv files: min, max, mean, stderr, diameter, etc
csv-cast: take binary in given format, output binary in another format
csv-crc: append/check/strip crc field
csv-eval: evaluate expression and append computed values to csv stream
csv-fields: convert comma-separated fields to field numbers (to combine with utilities like cut)
csv-from-bin: take binary in given format, output ascii csv
csv-interval: take intervals and separates them at points of overlap
csv-join: join two csv streams by one or more fields
csv-paste: same as linux paste, but works on ascii, binary, and constants
csv-play: play back by time field one or more inputs
csv-quote: take csv string, quote/unquote anything that is not a number (useful for scripting)
csv-repeat: periodically repeat the last input record after a given timeout
csv-reshape: perform reshaping operation(s) on input data e.g. concatenate
csv-select: output only rows that match given constraints
csv-size: take binary format, output record size (e.g. csv-size t,d,ui will output 20)
csv-sort: sort csv files by one or more fields
csv-split: split input by timestamp or other field
csv-select: output only rows that match the constraints on specific columns (todo: define a simple grammar for command-line boolean expressions)
csv-shuffle: append, remove, swap csv columns
csv-thin: randomly thin the input at given rate
csv-time-delay: add given time delay to a timestamp
csv-time-join: same as csv-join, but joins by monotonous timestamp column and thus works on streams
csv-time-stamp: add system time as a timestamp column
csv-time: convert between a few time formats; see time formats for details and limits
csv-update: take two files or streams, output values of the first stream updated with values from the second; reminiscent of sql update; see csv-update --help for more
csv-to-bin: take ascii csv, output binary in given format

todo: brush up the csv operations for completeness

utilities tutorials

tutorials