csv - acfr/comma GitHub Wiki

Table of Contents

motivation

Tabular and comma-separated values are a simple and portable data model for streams of sensor readings, market quotes, data base records, etc. It is a trivial protocol for any homogeneous fixed-width data streams. It decouples the csv data and the meaning of its fields. Therefore, different classes or utilities can assign different meanings to the fields.

The other force is: keeping arbitrary csv data human-readable, which is crucial for prototyping, debugging, and simple data operations, without performance loss. Therefore, the csv library uses both ASCII and binary fixed-width data with simple converters.

Thus, the idea is to:

  • keep the unstructured fixed-width data separate from their meaning: the data user assigns the meaning to the data ("pull model" for the meaning)
  • be able to tap to those flows at any point, either to redirect them or to simply see what is going on
  • a generally, represent the whole system operation a bunch of csv data streams and transformations on them, which applies to very many real-life use cases

an example

The csv-style filters can be put together into a processing pipelines, mixed with standard utilities like cut, grep, sed, etc.

Say, we have a sensor publishing timestamped 3D points in polar coordinates (for brevity assume they are ASCII). The fields are t,range,bearing,elevation,scan,intensity where one line looks like:

 20121006T122433,12.345,98.765,33.22,25,234

The pipeline below converts them to Cartesian coordinates and visualises them in real time. This is an actual pipeline with some utilities implemented elsewhere, using comma/csv library. The meaning of the fields is defined in the --fields command line option.

 netcat localhost 12345 | points-to-cartesian --fields=,r,b,e | view-points --fields=,x,y,z,block,scalar

Now, suppose, we have a log collected from this sensor and want to play it back:

 cat points.csv | csv-play --fields=t | points-to-cartesian --fields=,r,b,e | view-points --fields=,x,y,z,block,scalar

5-minutes tutorial

Try to copy the examples below into your source files, build and run them.

visiting

There are two basic visitor facades: csv::ascii and csv::binary.

visiting traits

Assume, we have a structure foo with visiting traits defined (see visiting for details):

 // an arbitrary structure
 #include <comma/visiting/traits.h>
 struct foo 
 {
     std::string hello;
     std::string world;
     int index;
 }
 
 // the only routine part: define how we visit foo in some header
 // assume, hello.h contains traits definition for hello_type, same as below
 namespace comma { namespace visiting {
 template <> struct traits< foo >
 {
     // non-const visitor
     template < typename K, typename V > static void visit( const K& k, foo& f, V& v )
     {
         v.apply( "hello", f.hello );
         v.apply( "world", f.world );
         v.apply( "index", f.index );
     }
     // const visitor
     template < typename K, typename V > static void visit( const K& k, const foo& f, V& v )
     {
         v.apply( "hello", f.hello );
         v.apply( "world", f.world );
         v.apply( "index", f.index );
     }
 };
 } }

ascii

For ASCII visiting:

 #include <comma/csv/ascii.h>
 #include "foo.h"
 
 int main( int ac, char** av )
 {
     foo f;
     {
         std::string s = "25,world,is,so,gloomy";
         // ascii csv visitor for given csv fields
         // the fields can be either flat: --fields=x,y,z
         // or hierarchical, e.g: --fields=from/x,from/y,from/z,to/x,to/y,to/z
         // where hierarchical fields can be collapsed, e.g: --fields=from,to means the same as above
         comma::csv::ascii< foo > ascii( "index,,,,world" );
         ascii.get( f, s );
         std::cerr << f.index << std::endl; // prints: 25
         std::cerr << f.world << std::endl; // prints: gloomy
         f.index = 99;
         ascii.put( f, s );
         std::cerr << s << std::endl; // prints: 99,world,is,so,gloomy
     }
     {
         std::string s = "88,world,is,so,happy";
         comma::csv::ascii< foo > ascii( ",,,,happy" );
         ascii.get( f, s );
         std::cerr << f.index << std::endl; // still prints 99
         std::cerr << f.world << std::endl; // prints: happy
    }
 }

binary

For binary, we need to specify not only fields, but also the binary format.

 #include <sstream>
 #include <comma/csv/binary.h>
 #include "foo.h"
 // ...
 foo f;
 {
     // binary fixed-width visitor for integer, fixed-length string and two unsigned ints
     comma::csv::binary< foo > binary( "i,s[8],2ui", "index,world" );
     std::vector< char > buf( binary.size() );
     // put index in the first 4 bytes and f.world value in the next 8 bytes
     binary.put( f, &buf[0] );
 }

streams

comma::csv::ascii and comma::csv::binary are less often, than the csv streams, which are useful for any utility that takes an input stream of csv data, does something to it, and outputs as a csv stream of another type.

For convenience they use csv::options to specify fields, binary format, delimiter, and full_xpath flag.

E.g. we want to read Cartesian points and output points in polar coordinates. Assume, we have point_xyz, point_polar, and conversions between them defined elsewhere.

 #include "point.h" // suppose, it defines point_xyz, etc
 #include <comma/csv/stream.h>
 int main( int, char** )
 {
     comma::csv::input_stream< point_xyz > is( std::cin );
     comma::csv::output_stream< point_polar > os( std::cout );
     while( std::cin.good() )
     {
         const point_xyz* = is.read();
         if( !point_xyz ) { break; }
         os.write( point_xyz.to_polar() );
     }
 }

Assume now that the input has timestamp as the first field and that timestamp should be output with each point without change:

 #include "point.h" // suppose, it defines point_xyz, etc
 #include <comma/csv/stream.h>
 int main( int, char** )
 {
     comma::csv::options input_options( ",x,y,z" );
     comma::csv::options output_options( ",range,bearing,elevation" );
     comma::csv::input_stream< point_xyz > is( std::cin, input_options );
     comma::csv::output_stream< point_polar > os( std::cout, output_options );
     while( std::cin.good() )
     {
         const point_xyz* = is.read();
         if( !point_xyz ) { break; }
         // substitute x,y,z for range,bearing,elevation, leaving timestamp unchanged
         // in the line just read is.last()
         os.ascii().write( point_xyz.to_polar(), is.last() );
     }
 }

The binary streams work in the same way, except you need to specify binary format in csv::options. See csv utilities and unit tests for more generic usage examples. See doxygen documentation for details.

notes

When using the standard input/output streams, the comma::csv streams rely on the standard C++ streams to not be synchronized with the standard C streams (i.e. std::cin.sync_with_stdio( false ); is called upon construction). If all input/output operations in the program are performed through the comma::csv stream this should not be of any concern. However if there are also input/output operations performed directly on the stream (e.g. std::getline( std::cin, str ); ), then std::cin.sync_with_stdio( false ); is required in the program before any input/output operations are performed.

class overview

  • ascii: conversions between an ascii csv string and a class
  • binary: conversions between a binary fixed-width buffer and a class
  • input_stream,output_stream: strongly typed input/output streams of csv or binary fixed-width data
  • format: binary format definitions and operations
  • options: csv options
  • names: a field name visitor
This is just an overview of main classes. See doxygen documentation for details.

utilities overview

All utilities can handle both ascii csv and binary data in the same way, unless stated otherwise, e.g:

Play back ascii csv data with the timestamp in the 3rd field:

 cat timestamped.csv | csv-play --fields=,,t

Play back ascii csv data with the timestamp in the 3rd field, where the fields are unsigned int, double, time,int:

 cat timestamped.bin | csv-play --fields=,,t --binary=ui,d,t,i
  • csv-analyse: try to guess binary size (width) of data in unknown binary stream
  • csv-bin-cut: same as standard linux cut, but on binary data
  • csv-bin-reverse: reverse byte order of given fields (to change endianness)
  • csv-blocks: operations on blocks of data based on the block field
  • csv-calc: column-wise operations on csv files: min, max, mean, stderr, diameter, etc
  • csv-cast: take binary in given format, output binary in another format
  • csv-crc: append/check/strip crc field
  • csv-eval: evaluate expression and append computed values to csv stream
  • csv-fields: convert comma-separated fields to field numbers (to combine with utilities like cut)
  • csv-from-bin: take binary in given format, output ascii csv
  • csv-interval: take intervals and separates them at points of overlap
  • csv-join: join two csv streams by one or more fields
  • csv-paste: same as linux paste, but works on ascii, binary, and constants
  • csv-play: play back by time field one or more inputs
  • csv-quote: take csv string, quote/unquote anything that is not a number (useful for scripting)
  • csv-repeat: periodically repeat the last input record after a given timeout
  • csv-reshape: perform reshaping operation(s) on input data e.g. concatenate
  • csv-select: output only rows that match given constraints
  • csv-size: take binary format, output record size (e.g. csv-size t,d,ui will output 20)
  • csv-sort: sort csv files by one or more fields
  • csv-split: split input by timestamp or other field
  • csv-select: output only rows that match the constraints on specific columns (todo: define a simple grammar for command-line boolean expressions)
  • csv-shuffle: append, remove, swap csv columns
  • csv-thin: randomly thin the input at given rate
  • csv-time-delay: add given time delay to a timestamp
  • csv-time-join: same as csv-join, but joins by monotonous timestamp column and thus works on streams
  • csv-time-stamp: add system time as a timestamp column
  • csv-time: convert between a few time formats; see time formats for details and limits
  • csv-update: take two files or streams, output values of the first stream updated with values from the second; reminiscent of sql update; see csv-update --help for more
  • csv-to-bin: take ascii csv, output binary in given format
todo: brush up the csv operations for completeness

utilities tutorials

tutorials

⚠️ **GitHub.com Fallback** ⚠️