csv - acfr/comma GitHub Wiki
Tabular and comma-separated values are a simple and portable data model for streams of sensor readings, market quotes, data base records, etc. It is a trivial protocol for any homogeneous fixed-width data streams. It decouples the csv data and the meaning of its fields. Therefore, different classes or utilities can assign different meanings to the fields.
The other force is: keeping arbitrary csv data human-readable, which is crucial for prototyping, debugging, and simple data operations, without performance loss. Therefore, the csv library uses both ASCII and binary fixed-width data with simple converters.
Thus, the idea is to:
- keep the unstructured fixed-width data separate from their meaning: the data user assigns the meaning to the data ("pull model" for the meaning)
- be able to tap to those flows at any point, either to redirect them or to simply see what is going on
- a generally, represent the whole system operation a bunch of csv data streams and transformations on them, which applies to very many real-life use cases
The csv-style filters can be put together into a processing pipelines, mixed with standard utilities like cut, grep, sed, etc.
Say, we have a sensor publishing timestamped 3D points in polar coordinates (for brevity assume they are ASCII). The fields are t,range,bearing,elevation,scan,intensity where one line looks like:
20121006T122433,12.345,98.765,33.22,25,234
The pipeline below converts them to Cartesian coordinates and visualises them in real time. This is an actual pipeline with some utilities implemented elsewhere, using comma/csv library. The meaning of the fields is defined in the --fields command line option.
netcat localhost 12345 | points-to-cartesian --fields=,r,b,e | view-points --fields=,x,y,z,block,scalar
Now, suppose, we have a log collected from this sensor and want to play it back:
cat points.csv | csv-play --fields=t | points-to-cartesian --fields=,r,b,e | view-points --fields=,x,y,z,block,scalar
Try to copy the examples below into your source files, build and run them.
There are two basic visitor facades: csv::ascii and csv::binary.
Assume, we have a structure foo with visiting traits defined (see visiting for details):
// an arbitrary structure #include <comma/visiting/traits.h> struct foo { std::string hello; std::string world; int index; } // the only routine part: define how we visit foo in some header // assume, hello.h contains traits definition for hello_type, same as below namespace comma { namespace visiting { template <> struct traits< foo > { // non-const visitor template < typename K, typename V > static void visit( const K& k, foo& f, V& v ) { v.apply( "hello", f.hello ); v.apply( "world", f.world ); v.apply( "index", f.index ); } // const visitor template < typename K, typename V > static void visit( const K& k, const foo& f, V& v ) { v.apply( "hello", f.hello ); v.apply( "world", f.world ); v.apply( "index", f.index ); } }; } }
For ASCII visiting:
#include <comma/csv/ascii.h> #include "foo.h" int main( int ac, char** av ) { foo f; { std::string s = "25,world,is,so,gloomy"; // ascii csv visitor for given csv fields // the fields can be either flat: --fields=x,y,z // or hierarchical, e.g: --fields=from/x,from/y,from/z,to/x,to/y,to/z // where hierarchical fields can be collapsed, e.g: --fields=from,to means the same as above comma::csv::ascii< foo > ascii( "index,,,,world" ); ascii.get( f, s ); std::cerr << f.index << std::endl; // prints: 25 std::cerr << f.world << std::endl; // prints: gloomy f.index = 99; ascii.put( f, s ); std::cerr << s << std::endl; // prints: 99,world,is,so,gloomy } { std::string s = "88,world,is,so,happy"; comma::csv::ascii< foo > ascii( ",,,,happy" ); ascii.get( f, s ); std::cerr << f.index << std::endl; // still prints 99 std::cerr << f.world << std::endl; // prints: happy } }
For binary, we need to specify not only fields, but also the binary format.
#include <sstream> #include <comma/csv/binary.h> #include "foo.h" // ... foo f; { // binary fixed-width visitor for integer, fixed-length string and two unsigned ints comma::csv::binary< foo > binary( "i,s[8],2ui", "index,world" ); std::vector< char > buf( binary.size() ); // put index in the first 4 bytes and f.world value in the next 8 bytes binary.put( f, &buf[0] ); }
comma::csv::ascii and comma::csv::binary are less often, than the csv streams, which are useful for any utility that takes an input stream of csv data, does something to it, and outputs as a csv stream of another type.
For convenience they use csv::options to specify fields, binary format, delimiter, and full_xpath flag.
E.g. we want to read Cartesian points and output points in polar coordinates. Assume, we have point_xyz, point_polar, and conversions between them defined elsewhere.
#include "point.h" // suppose, it defines point_xyz, etc #include <comma/csv/stream.h> int main( int, char** ) { comma::csv::input_stream< point_xyz > is( std::cin ); comma::csv::output_stream< point_polar > os( std::cout ); while( std::cin.good() ) { const point_xyz* = is.read(); if( !point_xyz ) { break; } os.write( point_xyz.to_polar() ); } }
Assume now that the input has timestamp as the first field and that timestamp should be output with each point without change:
#include "point.h" // suppose, it defines point_xyz, etc #include <comma/csv/stream.h> int main( int, char** ) { comma::csv::options input_options( ",x,y,z" ); comma::csv::options output_options( ",range,bearing,elevation" ); comma::csv::input_stream< point_xyz > is( std::cin, input_options ); comma::csv::output_stream< point_polar > os( std::cout, output_options ); while( std::cin.good() ) { const point_xyz* = is.read(); if( !point_xyz ) { break; } // substitute x,y,z for range,bearing,elevation, leaving timestamp unchanged // in the line just read is.last() os.ascii().write( point_xyz.to_polar(), is.last() ); } }
The binary streams work in the same way, except you need to specify binary format in csv::options. See csv utilities and unit tests for more generic usage examples. See doxygen documentation for details.
When using the standard input/output streams, the comma::csv streams rely on the standard C++ streams to not be synchronized with the standard C streams (i.e. std::cin.sync_with_stdio( false );
is called upon construction). If all input/output operations in the program are performed through the comma::csv stream this should not be of any concern. However if there are also input/output operations performed directly on the stream (e.g. std::getline( std::cin, str );
), then std::cin.sync_with_stdio( false );
is required in the program before any input/output operations are performed.
- ascii: conversions between an ascii csv string and a class
- binary: conversions between a binary fixed-width buffer and a class
- input_stream,output_stream: strongly typed input/output streams of csv or binary fixed-width data
- format: binary format definitions and operations
- options: csv options
- names: a field name visitor
All utilities can handle both ascii csv and binary data in the same way, unless stated otherwise, e.g:
Play back ascii csv data with the timestamp in the 3rd field:
cat timestamped.csv | csv-play --fields=,,t
Play back ascii csv data with the timestamp in the 3rd field, where the fields are unsigned int, double, time,int:
cat timestamped.bin | csv-play --fields=,,t --binary=ui,d,t,i
- csv-analyse: try to guess binary size (width) of data in unknown binary stream
- csv-bin-cut: same as standard linux cut, but on binary data
- csv-bin-reverse: reverse byte order of given fields (to change endianness)
- csv-blocks: operations on blocks of data based on the block field
- csv-calc: column-wise operations on csv files: min, max, mean, stderr, diameter, etc
- csv-cast: take binary in given format, output binary in another format
- csv-crc: append/check/strip crc field
- csv-eval: evaluate expression and append computed values to csv stream
- csv-fields: convert comma-separated fields to field numbers (to combine with utilities like cut)
- csv-from-bin: take binary in given format, output ascii csv
- csv-interval: take intervals and separates them at points of overlap
- csv-join: join two csv streams by one or more fields
- csv-paste: same as linux paste, but works on ascii, binary, and constants
- csv-play: play back by time field one or more inputs
- csv-quote: take csv string, quote/unquote anything that is not a number (useful for scripting)
- csv-repeat: periodically repeat the last input record after a given timeout
- csv-reshape: perform reshaping operation(s) on input data e.g. concatenate
- csv-select: output only rows that match given constraints
- csv-size: take binary format, output record size (e.g. csv-size t,d,ui will output 20)
- csv-sort: sort csv files by one or more fields
- csv-split: split input by timestamp or other field
- csv-select: output only rows that match the constraints on specific columns (todo: define a simple grammar for command-line boolean expressions)
- csv-shuffle: append, remove, swap csv columns
- csv-thin: randomly thin the input at given rate
- csv-time-delay: add given time delay to a timestamp
- csv-time-join: same as csv-join, but joins by monotonous timestamp column and thus works on streams
- csv-time-stamp: add system time as a timestamp column
- csv-time: convert between a few time formats; see time formats for details and limits
- csv-update: take two files or streams, output values of the first stream updated with values from the second; reminiscent of sql update; see csv-update --help for more
- csv-to-bin: take ascii csv, output binary in given format