csv - acfr/comma GitHub Wiki
Tabular and comma-separated values are a simple and portable data model for streams of sensor readings, market quotes, data base records, etc. It is a trivial protocol for any homogeneous fixed-width data streams. It decouples the csv data and the meaning of its fields. Therefore, different classes or utilities can assign different meanings to the fields.
The other force is: keeping arbitrary csv data human-readable, which is crucial for prototyping, debugging, and simple data operations, without performance loss. Therefore, the csv library uses both ASCII and binary fixed-width data with simple converters.
Thus, the idea is to:
- keep the unstructured fixed-width data separate from their meaning: the data user assigns the meaning to the data ("pull model" for the meaning)
- be able to tap to those flows at any point, either to redirect them or to simply see what is going on
- a generally, represent the whole system operation a bunch of csv data streams and transformations on them, which applies to very many real-life use cases
The csv-style filters can be put together into a processing pipelines, mixed with standard utilities like cut, grep, sed, etc.
Say, we have a sensor publishing timestamped 3D points in polar coordinates (for brevity assume they are ASCII). The fields are t,range,bearing,elevation,scan,intensity where one line looks like:
20121006T122433,12.345,98.765,33.22,25,234
The pipeline below converts them to Cartesian coordinates and visualises them in real time. This is an actual pipeline with some utilities implemented elsewhere, using comma/csv library. The meaning of the fields is defined in the --fields command line option.
netcat localhost 12345 | points-to-cartesian --fields=,r,b,e | view-points --fields=,x,y,z,block,scalar
Now, suppose, we have a log collected from this sensor and want to play it back:
cat points.csv | csv-play --fields=t | points-to-cartesian --fields=,r,b,e | view-points --fields=,x,y,z,block,scalar
Try to copy the examples below into your source files, build and run them.
There are two basic visitor facades: csv::ascii and csv::binary.
Assume, we have a structure foo with visiting traits defined (see visiting for details):
// an arbitrary structure
#include <comma/visiting/traits.h>
struct foo
{
std::string hello;
std::string world;
int index;
}
// the only routine part: define how we visit foo in some header
// assume, hello.h contains traits definition for hello_type, same as below
namespace comma { namespace visiting {
template <> struct traits< foo >
{
// non-const visitor
template < typename K, typename V > static void visit( const K& k, foo& f, V& v )
{
v.apply( "hello", f.hello );
v.apply( "world", f.world );
v.apply( "index", f.index );
}
// const visitor
template < typename K, typename V > static void visit( const K& k, const foo& f, V& v )
{
v.apply( "hello", f.hello );
v.apply( "world", f.world );
v.apply( "index", f.index );
}
};
} }
For ASCII visiting:
#include <comma/csv/ascii.h>
#include "foo.h"
int main( int ac, char** av )
{
foo f;
{
std::string s = "25,world,is,so,gloomy";
// ascii csv visitor for given csv fields
// the fields can be either flat: --fields=x,y,z
// or hierarchical, e.g: --fields=from/x,from/y,from/z,to/x,to/y,to/z
// where hierarchical fields can be collapsed, e.g: --fields=from,to means the same as above
comma::csv::ascii< foo > ascii( "index,,,,world" );
ascii.get( f, s );
std::cerr << f.index << std::endl; // prints: 25
std::cerr << f.world << std::endl; // prints: gloomy
f.index = 99;
ascii.put( f, s );
std::cerr << s << std::endl; // prints: 99,world,is,so,gloomy
}
{
std::string s = "88,world,is,so,happy";
comma::csv::ascii< foo > ascii( ",,,,happy" );
ascii.get( f, s );
std::cerr << f.index << std::endl; // still prints 99
std::cerr << f.world << std::endl; // prints: happy
}
}
For binary, we need to specify not only fields, but also the binary format.
#include <sstream>
#include <comma/csv/binary.h>
#include "foo.h"
// ...
foo f;
{
// binary fixed-width visitor for integer, fixed-length string and two unsigned ints
comma::csv::binary< foo > binary( "i,s[8],2ui", "index,world" );
std::vector< char > buf( binary.size() );
// put index in the first 4 bytes and f.world value in the next 8 bytes
binary.put( f, &buf[0] );
}
comma::csv::ascii and comma::csv::binary are less often, than the csv streams, which are useful for any utility that takes an input stream of csv data, does something to it, and outputs as a csv stream of another type.
For convenience they use csv::options to specify fields, binary format, delimiter, and full_xpath flag.
E.g. we want to read Cartesian points and output points in polar coordinates. Assume, we have point_xyz, point_polar, and conversions between them defined elsewhere.
#include "point.h" // suppose, it defines point_xyz, etc
#include <comma/csv/stream.h>
int main( int, char** )
{
comma::csv::input_stream< point_xyz > is( std::cin );
comma::csv::output_stream< point_polar > os( std::cout );
while( std::cin.good() )
{
const point_xyz* = is.read();
if( !point_xyz ) { break; }
os.write( point_xyz.to_polar() );
}
}
Assume now that the input has timestamp as the first field and that timestamp should be output with each point without change:
#include "point.h" // suppose, it defines point_xyz, etc
#include <comma/csv/stream.h>
int main( int, char** )
{
comma::csv::options input_options( ",x,y,z" );
comma::csv::options output_options( ",range,bearing,elevation" );
comma::csv::input_stream< point_xyz > is( std::cin, input_options );
comma::csv::output_stream< point_polar > os( std::cout, output_options );
while( std::cin.good() )
{
const point_xyz* = is.read();
if( !point_xyz ) { break; }
// substitute x,y,z for range,bearing,elevation, leaving timestamp unchanged
// in the line just read is.last()
os.ascii().write( point_xyz.to_polar(), is.last() );
}
}
The binary streams work in the same way, except you need to specify binary format in csv::options. See csv utilities and unit tests for more generic usage examples. See doxygen documentation for details.
When using the standard input/output streams, the comma::csv streams rely on the standard C++ streams to not be synchronized with the standard C streams (i.e. std::cin.sync_with_stdio( false ); is called upon construction). If all input/output operations in the program are performed through the comma::csv stream this should not be of any concern. However if there are also input/output operations performed directly on the stream (e.g. std::getline( std::cin, str ); ), then std::cin.sync_with_stdio( false ); is required in the program before any input/output operations are performed.
- ascii: conversions between an ascii csv string and a class
- binary: conversions between a binary fixed-width buffer and a class
- input_stream,output_stream: strongly typed input/output streams of csv or binary fixed-width data
- format: binary format definitions and operations
- options: csv options
- names: a field name visitor
All utilities can handle both ascii csv and binary data in the same way, unless stated otherwise, e.g:
Play back ascii csv data with the timestamp in the 3rd field:
cat timestamped.csv | csv-play --fields=,,t
Play back ascii csv data with the timestamp in the 3rd field, where the fields are unsigned int, double, time,int:
cat timestamped.bin | csv-play --fields=,,t --binary=ui,d,t,i
- csv-analyse: try to guess binary size (width) of data in unknown binary stream
- csv-bin-cut: same as standard linux cut, but on binary data
- csv-bin-reverse: reverse byte order of given fields (to change endianness)
- csv-blocks: operations on blocks of data based on the block field
- csv-calc: column-wise operations on csv files: min, max, mean, stderr, diameter, etc
- csv-cast: take binary in given format, output binary in another format
- csv-crc: append/check/strip crc field
- csv-eval: evaluate expression and append computed values to csv stream
- csv-fields: convert comma-separated fields to field numbers (to combine with utilities like cut)
- csv-from-bin: take binary in given format, output ascii csv
- csv-interval: take intervals and separates them at points of overlap
- csv-join: join two csv streams by one or more fields
- csv-paste: same as linux paste, but works on ascii, binary, and constants
- csv-play: play back by time field one or more inputs
- csv-quote: take csv string, quote/unquote anything that is not a number (useful for scripting)
- csv-repeat: periodically repeat the last input record after a given timeout
- csv-reshape: perform reshaping operation(s) on input data e.g. concatenate
- csv-select: output only rows that match given constraints
- csv-size: take binary format, output record size (e.g. csv-size t,d,ui will output 20)
- csv-sort: sort csv files by one or more fields
- csv-split: split input by timestamp or other field
- csv-select: output only rows that match the constraints on specific columns (todo: define a simple grammar for command-line boolean expressions)
- csv-shuffle: append, remove, swap csv columns
- csv-thin: randomly thin the input at given rate
- csv-time-delay: add given time delay to a timestamp
- csv-time-join: same as csv-join, but joins by monotonous timestamp column and thus works on streams
- csv-time-stamp: add system time as a timestamp column
- csv-time: convert between a few time formats; see time formats for details and limits
- csv-update: take two files or streams, output values of the first stream updated with values from the second; reminiscent of sql update; see csv-update --help for more
- csv-to-bin: take ascii csv, output binary in given format