python csv module - acfr/comma GitHub Wiki

Table of Contents

intro

Refer to csv for the motivation behind the module and to snark (and comma) for dummies, on Ubuntu/Debian for enabling comma python modules.

The csv module provides some basic functionality for processing csv-style ascii and binary streams in python. Its implementation relies on numpy package for describing data structures as well as reading and writing csv streams.

Two classes are provided:

  • struct: creating data structures
  • stream: reading and writing csv-style streams

creating data structures

basic

An object of type comma.csv.struct represents the meaning and type of data contained in a csv stream. As such, the user is required to provide field names and their numpy types when creating these objects. For instance, a struct for representing timestamped coordinates of a 3d point can be created like this:

import comma
event_t = comma.csv.struct( 't,x,y,z', 'datetime64[us]', 'float64', 'float64', 'float64' )

where the first argument describes the fields of a csv stream ('t,x,y,z') and the following four arguments specify the numpy types of the fields (or strings corresponding to numpy types). For more details on numpy types, consult the following pages:

http://docs.scipy.org/doc/numpy/user/basics.rec.html

http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html

Alternatively, one can make use of comma.csv.format.to_numpy function to convert comma format string to numpy types (comma.csv.format.from_numpy does the opposite conversion):

import comma
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )

Note that event_t represents a type; it does not contain any data. It can be used in place of numpy dtype, for instance, to instantiate numpy arrays:

import comma
import numpy as np
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )
events = np.empty( 10, dtype=event_t ) # create an array of 10 objects of type event_t.dtype

hierarchical

Types defined with comma.csv.struct can be used along with numpy types to describe the types of fields in other data structures. For instance, the definition of record_t below uses event_t and observer_t defined previously:

import comma
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )
observer_t = comma.csv.struct( 'name,id', *comma.csv.format.to_numpy( 's[10],ui' ) )
record_t = comma.csv.struct( 'observer,event', observer_t, event_t )

A csv stream for the data of type record_t will have fields observer/name,observer/id,event/t,event/x,event/y,event/z and format s[10],ui,t,3d. Complex hierarchical types can be created this way.

creating and manipulating struct objects

A 1d numpy array of the dtype defined by record_t can be defined by

records = record_t(size=10)
where size is the number of records in the array (note that the fields of records are not assigned any particular values). When size is omitted, a 1d array with one record is created. Such an array can be converted to a tuple as follows
record = record_t()
record_t.to_tuple(record)
For example,
color_t = comma.csv.struct( 'r,g,b', *comma.csv.format.to_numpy( '3ub' ) )
color = color_t()
color['r']=1
color['g']=2
color['b']=3
color_t.to_tuple(color)
yields
(1, 2, 3)

defining streams

simple streams

The following command defines a csv stream for data of type record_t:

record_stream = comma.csv.stream( record_t )

By default, it is an ascii stream with entries delimited by a comma. The latter can be changed by specifying delimiter keyword, e.g.

record_stream = comma.csv.stream( record_t, delimiter='|', precision=4 )

Precision of the floating point output is controlled by precision keyword, which by default is set to 12.

For a binary stream, set binary keyword to True, e.g.

record_stream = comma.csv.stream( record_t, binary=True )

By default, the source of the stream is stdin and the target is stdout. If the user wishes to read data from a file and/or write to another file, set source and target keywords to the suitable file objects, e.g.

record_stream = comma.csv.stream( record_t, binary=True, source=open( 'input.bin', 'r' ), target=open( 'output.bin', 'w' ), flush=True )

The flush keyword used above ensures the output stream is flushed on every record (by default, the output is buffered). Note that flushing may negatively impact performance and hence should only be used on relatively slow streams.

streams with extra fields or mixed fields

An input stream where the csv data matches the fields of a struct type is simple. However, it is common to have streams that have more fields than required or where the expected fields are in the wrong order. To deal with such streams, one needs to specify the appropriate fields keyword ( and format if the stream is binary) when creating a stream. For instance,

import comma
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )
observer_t = comma.csv.struct( 'name,id', *comma.csv.format.to_numpy( 's[10],ui' ) )
record_t = comma.csv.struct( 'observer,event', observer_t, event_t )

fields = ',event/x,event/y,event/z,event/t,observer/name,,,observer/id'
format = ','.join( comma.csv.format.to_numpy( 't,3d,t,s[10],2i,ui' ) )
record_stream = comma.csv.stream( record_t, fields=fields, format=format )

defines a binary stream of format t,3d,t,s[10],2i,ui (specified by format keyword) where the positions of the expected fields are given by fields keyword. Omitting format keyword would create an ascii stream. It is also possible to specify the binary format via binary keyword, in which case format is ignored.

setting stream parameters with command-line options

comma provides comma.csv.add_options( parser, defaults={} ) which provides a simple mechanism to set stream parameters via command-line options. An example of use in code would be:

import comma
import argparse

parser = argparse.ArgumentParser()
comma.csv.add_options( parser )
args = parser.parse_args()

class stream:
    def __init__( self, args ):
        self.args = args
        self.csv_options = dict( full_xpath=False,
                                 flush=self.args.flush,
                                 delimiter=self.args.delimiter,
                                 precision=self.args.precision )

and on the command-line:

$ myprogram --delimiter :

comma.csv.options() adds options for:

-f <names>, --fields <names>    field names of input stream
-b <format>, --binary <format>  format for binary stream (default: ascii)
-d <char>, --delimiter <char>   csv delimiter of ascii stream (default: ,)
--precision <precision>         floating point precision of ascii output (default: 12)
--flush, --unbuffered           flush stdout after each record (stream is unbuffered)

It also add help text for the options which is shown when calling <program name> -h.

reading and writing data

reading

The following command will read a single record from record_stream and save it in a 1d numpy array record:

record = record_stream.read( size=1 )

In general, the size keyword tells read() how many records to read. The dtype of elements in record is the same as record_t.dtype and the number of elements equals or is less than size (it will be less than size if the stream ends before the specified number of records is read). When size is omitted, read() will read one record if the record stream is unbuffered (flush=True) or it will attempt to read as many records as is necessary to fill the 64Kb buffer if the stream is buffered (flush=False, which is the default behaviour). When reading from a file, all records can be read at once by using size=-1. For instance,

record_stream = comma.csv.stream( record_t, source=open( 'input.csv', 'r' ) )
records = record_stream.read( size=-1 )

will read all records from file input.csv.

processing and writing

The records contained in records can be manipulated in the usual way as any numpy array. For instance, one can apply the following commands

record_stream = comma.csv.stream( record_t, source=open( 'input.csv', 'r' ), target=open( 'output.csv', 'w' ) )
records = record_stream.read( size=-1 )
records['event']['t'] += numpy.timedelta64( 1, 's' )
records['event']['x'] -= 0.1
to add 1 second to the time and subtract 0.1 from the x-coordinate of every event recorded in 'input.csv'. Then the modified records can be written to 'output.csv':
record_stream.write( records )

iterating over input stream records

To iterate over an input stream, make use of iter() as follows:

record_stream = comma.csv.stream( record_t )
for records in record_stream.iter():
  records['event']['t'] += numpy.timedelta64( 1, 's' )
  records['event']['x'] -= 0.1
  record_stream.write( records )

iter() accepts size keyword with the same meaning as in read(). By default, it will try to read many records at once. To read records one by one, use iter( size=1 ).

In the example above, the same stream is used for input and output. If several streams are used, e.g. one for reading and another for writing, then flush keyword should be applied consistently to all stream to ensure buffered or unbuffered operation.

comments

the effect of size keyword

Suppose the input stream contains the following

0
0
0
0
0
0
0
0

Then, running the following code

import comma

point_t = comma.csv.struct( 'x', 'float64' )
record_stream = comma.csv.stream( point_t )

for i,points in enumerate( record_stream.iter( size=3 ), start=1 ):
  points['x'] += i
  record_stream.write( points )

reads points in batches of three and, therefore, yields

1
1
1
2
2
2
3
3

Note that the last read points contains only two elements.

arrays as part of records

Suppose the input stream starts like this

20150101T000000.123456,0,1,2,3,4,5
20150101T000001.123456,0,1,2,3,4,5
20150101T000002.123456,0,1,2,3,4,5
...

where the six numbers represent values of a 2x3 matrix. Then the matrix can be read into a record containing a 2d numpy array by using numpy type '(2,3)float64'. For instance, running the following code

import comma

event_t = comma.csv.struct( 't,signal', 'datetime64[us]', '(2,3)float64' )
stream = comma.csv.stream( event_t )

event = stream.read( size=1 )
event['signal'] += [ [0,-1,-2], [-3,-4,-5] ]
stream.write( event )

yields

20150101T000000.123456,0.0,0.0,0.0,0.0,0.0,0.0

concise fields

Normally, full xpath of each field needs to be provided to identify the field in a hierarchical struct. This may become tedious if there are many fields with deep hierarchy. Fortunately, if several adjacent fields follow the order used in the definition of the struct, only the field name of the parent need to be given. It will automatically be expanded to individual fields. For instance,

import comma
coordinates_t = comma.csv.struct( 'x,y', 'float64', 'float64' )
orientation_t = comma.csv.struct( 'yaw', 'float64' )
position_t = comma.csv.struct( 'coordinates,orientation', coordinates_t, orientation_t )
timestamped_position_t = comma.csv.struct( 't,position', 'datetime64[us]', position_t )

input_stream = comma.csv.stream( timestamped_position_t, fields='position,t' )

defines a stream with fields position/coordinates/x,position/coordinates/y,position/orientation/yaw,t.

Alternatively, if leaves of the full xpath of the required fields are unambiguous, it is possible to use the leaves instead of the full xpath fields provided that full_xpath keyword is set to False (by default, it is set to True). For instance,

import comma
coordinates_t = comma.csv.struct( 'x,y', 'float64', 'float64' )
orientation_t = comma.csv.struct( 'yaw', 'float64' )
position_t = comma.csv.struct( 'coordinates,orientation', coordinates_t, orientation_t )
timestamped_position_t = comma.csv.struct( 't,position', 'datetime64[us]', position_t )

input_stream = comma.csv.stream( timestamped_position_t, fields='x,y,yaw,t', full_xpath=False )

defines the same stream as the one above.

tied input and output streams

It is sometimes desirable to pass the records from the input stream to the output with some extra fields attached at the end. This is accomplished by using tied keyword. This capability is illustrated in the example below.

Copy and save the following code in a file called attach-min-max:

#!/usr/bin/python
import comma
import numpy as np

point_t = comma.csv.struct( 'x,y,z', 'float64', 'float64', 'float64' )
event_t = comma.csv.struct( 't,coordinates', 'datetime64[us]', point_t )

fields = ',coordinates/y,coordinates/z,,,t,coordinates/x,,'
format = ','.join( comma.csv.format.to_numpy( 'i,d,d,s[3],s[7],t,d,ui,ui' ) )
input_stream = comma.csv.stream( event_t, fields=fields, format=format )

output_t = comma.csv.struct( 'min,max', 'float64', 'float64' )
output_stream = comma.csv.stream( output_t, binary=True, tied=input_stream )

for events in input_stream.iter():
  output = np.empty( events.size, dtype=output_t )
  output['min'] = np.min( events['coordinates'].view( '3float64' ), axis=1 )
  output['max'] = np.max( events['coordinates'].view( '3float64' ), axis=1 )
  output_stream.write( output )

and create a file called input.csv with the following content:

-1,1,-2,abc,nothing,20130101T010100.123456,3,10,20
-2,1,-3,def,nothing,20140202T020200.123456,4,10,20
-3,1,-4,ghi,nothing,20150303T030300.123456,5,10,20

Then, executing

chmod u+x attach-min-max
cat input.csv |  csv-to-bin i,2d,s[3],s[7],t,d,2ui | ./attach-min-max | csv-from-bin i,2d,s[3],s[7],t,d,2ui,2d

yields

-1,1,-2,abc,nothing,20130101T010100.123456,3,10,20,-2,3
-2,1,-3,def,nothing,20140202T020200.123456,4,10,20,-3,4
-3,1,-4,ghi,nothing,20150303T030300.123456,5,10,20,-4,5

missing fields

If some of the expected fields are not present in the input stream, these missing fields will be populated with zero values (a blank string is used for string types and the zero epoch for the time type). For instance, on an input stream

1
2
3

the following code

import comma

t = comma.csv.struct( 's,x,y,t', 'S2', 'i4', 'i4', 'datetime64[us]' )
s = comma.csv.stream( t, fields='x' )

for r in s.iter():
  s.write( r )

yields

,1,0,19700101T000000
,2,0,19700101T000000
,3,0,19700101T000000

If the default values used for the missing fields are different from zero, they can be specified with default_values keyword when creating the stream. For example,

import comma

t = comma.csv.struct( 's,x,y,t', 'S2', 'i4', 'i4', 'datetime64[us]' )
s = comma.csv.stream( t, fields='x', default_values={ 'y': 1, 't': '20150102T123456' } )

for r in s.iter():
  s.write( r )
yields
,1,1,20150102T123456
,2,1,20150102T123456
,3,1,20150102T123456

local time zone

This section is relevant to those using numpy version 1.10 or earlier. In numpy 1.11.0, the behaviour of datetime64 objects, which are used by comma csv module to represent timestamps, has been changed. They no longer carry time zone info and time is always interpreted as being in the UTC time zone. Nevertheless, it is still preferable to set time zone to UTC explicitly to ensure compatibility with systems using earlier versions of numpy.

By default, time imported from the input stream is converted to the local time zone. For instance, feeding input stream

20140101T000000
20150101T000000

to

import comma

t = comma.csv.struct( 't', 'datetime64[us]' )
s = comma.csv.stream( t )

for r in s.iter( size=1 ):
  print r['t'][0]

yields

2014-01-01T00:00:00.000000+1100
2015-01-01T00:00:00.000000+1100

where the time offset of the local time zone is +11 hours. A convenience function to change the time zone used by python is provided in comma.csv.time module and needs to be invoked before reading from a stream. For instance, feeding the same input stream to

import comma

comma.csv.time.zone( 'UTC' ) # set time zone to UTC

t = comma.csv.struct( 't', 'datetime64[us]' )
s = comma.csv.stream( t )

for r in s.iter( size=1 ):
  print r['t'][0]

yields

2014-01-01T00:00:00.000000+0000
2015-01-01T00:00:00.000000+0000

Note that the write() function of the stream class ignores the time zone and, therefore,

for r in s.iter( size=1 ):
  s.write( r )

will yield the same output regardless of time zone.

performance

Python utilities using comma.csv.stream are comparable or a few times slower than the equivalent c++ utilities for binary streams, provided that a large enough size is used (the default size is usually sufficient). For ascii streams, the performance of comma.csv.stream is quite poor, so it is only feasible for small files. A further degradation in performance (a factor of a few) may be expected if size=1 is used.

⚠️ **GitHub.com Fallback** ⚠️