python csv module - acfr/comma GitHub Wiki
Refer to csv for the motivation behind the module and to snark (and comma) for dummies, on Ubuntu/Debian for enabling comma python modules.
The csv module provides some basic functionality for processing csv-style ascii and binary streams in python. Its implementation relies on numpy package for describing data structures as well as reading and writing csv streams.
Two classes are provided:
- struct: creating data structures
- stream: reading and writing csv-style streams
An object of type comma.csv.struct represents the meaning and type of data contained in a csv stream. As such, the user is required to provide field names and their numpy types when creating these objects. For instance, a struct for representing timestamped coordinates of a 3d point can be created like this:
import comma
event_t = comma.csv.struct( 't,x,y,z', 'datetime64[us]', 'float64', 'float64', 'float64' )
where the first argument describes the fields of a csv stream ('t,x,y,z') and the following four arguments specify the numpy types of the fields (or strings corresponding to numpy types). For more details on numpy types, consult the following pages:
http://docs.scipy.org/doc/numpy/user/basics.rec.html
http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
Alternatively, one can make use of comma.csv.format.to_numpy
function to convert comma format string to numpy types (comma.csv.format.from_numpy
does the opposite conversion):
import comma
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )
Note that event_t
represents a type; it does not contain any data. It can be used in place of numpy dtype, for instance, to instantiate numpy arrays:
import comma
import numpy as np
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )
events = np.empty( 10, dtype=event_t ) # create an array of 10 objects of type event_t.dtype
Types defined with comma.csv.struct
can be used along with numpy types to describe the types of fields in other data structures. For instance, the definition of record_t
below uses event_t
and observer_t
defined previously:
import comma
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )
observer_t = comma.csv.struct( 'name,id', *comma.csv.format.to_numpy( 's[10],ui' ) )
record_t = comma.csv.struct( 'observer,event', observer_t, event_t )
A csv stream for the data of type record_t
will have fields
observer/name,observer/id,event/t,event/x,event/y,event/z
and format s[10],ui,t,3d
. Complex hierarchical types can be created this way.
A 1d numpy array of the dtype defined by record_t
can be defined by
records = record_t(size=10)
size
is the number of records in the array (note that the fields of records are not assigned any particular values). When size is omitted, a 1d array with one record is created. Such an array can be converted to a tuple as follows
record = record_t()
record_t.to_tuple(record)
color_t = comma.csv.struct( 'r,g,b', *comma.csv.format.to_numpy( '3ub' ) )
color = color_t()
color['r']=1
color['g']=2
color['b']=3
color_t.to_tuple(color)
(1, 2, 3)
The following command defines a csv stream for data of type record_t
:
record_stream = comma.csv.stream( record_t )
By default, it is an ascii stream with entries delimited by a comma. The latter can be changed by specifying delimiter
keyword, e.g.
record_stream = comma.csv.stream( record_t, delimiter='|', precision=4 )
Precision of the floating point output is controlled by precision
keyword, which by default is set to 12.
For a binary stream, set binary
keyword to True
, e.g.
record_stream = comma.csv.stream( record_t, binary=True )
By default, the source of the stream is stdin and the target is stdout. If the user wishes to read data from a file and/or write to another file, set source
and target
keywords to the suitable file objects, e.g.
record_stream = comma.csv.stream( record_t, binary=True, source=open( 'input.bin', 'r' ), target=open( 'output.bin', 'w' ), flush=True )
The flush
keyword used above ensures the output stream is flushed on every record (by default, the output is buffered). Note that flushing may negatively impact performance and hence should only be used on relatively slow streams.
An input stream where the csv data matches the fields of a struct
type is simple. However, it is common to have streams that have more fields than required or where the expected fields are in the wrong order. To deal with such streams, one needs to specify the appropriate fields
keyword ( and format
if the stream is binary) when creating a stream. For instance,
import comma
event_t = comma.csv.struct( 't,x,y,z', *comma.csv.format.to_numpy( 't,3d' ) )
observer_t = comma.csv.struct( 'name,id', *comma.csv.format.to_numpy( 's[10],ui' ) )
record_t = comma.csv.struct( 'observer,event', observer_t, event_t )
fields = ',event/x,event/y,event/z,event/t,observer/name,,,observer/id'
format = ','.join( comma.csv.format.to_numpy( 't,3d,t,s[10],2i,ui' ) )
record_stream = comma.csv.stream( record_t, fields=fields, format=format )
defines a binary stream of format t,3d,t,s[10],2i,ui
(specified by format
keyword) where the positions of the expected fields are given by fields
keyword. Omitting format
keyword would create an ascii stream. It is also possible to specify the binary format via binary
keyword, in which case format
is ignored.
comma
provides comma.csv.add_options( parser, defaults={} )
which provides a simple mechanism to set stream parameters via command-line options. An example of use in code would be:
import comma
import argparse
parser = argparse.ArgumentParser()
comma.csv.add_options( parser )
args = parser.parse_args()
class stream:
def __init__( self, args ):
self.args = args
self.csv_options = dict( full_xpath=False,
flush=self.args.flush,
delimiter=self.args.delimiter,
precision=self.args.precision )
and on the command-line:
$ myprogram --delimiter :
comma.csv.options()
adds options for:
-f <names>, --fields <names> field names of input stream
-b <format>, --binary <format> format for binary stream (default: ascii)
-d <char>, --delimiter <char> csv delimiter of ascii stream (default: ,)
--precision <precision> floating point precision of ascii output (default: 12)
--flush, --unbuffered flush stdout after each record (stream is unbuffered)
It also add help text for the options which is shown when calling <program name> -h
.
The following command will read a single record from record_stream
and save it in a 1d numpy array record
:
record = record_stream.read( size=1 )
In general, the size
keyword tells read()
how many records to read. The dtype of elements in record
is the same as record_t.dtype
and the number of elements equals or is less than size
(it will be less than size
if the stream ends before the specified number of records is read). When size
is omitted, read()
will read one record if the record stream is unbuffered (flush=True
) or it will attempt to read as many records as is necessary to fill the 64Kb buffer if the stream is buffered (flush=False
, which is the default behaviour). When reading from a file, all records can be read at once by using size=-1
. For instance,
record_stream = comma.csv.stream( record_t, source=open( 'input.csv', 'r' ) )
records = record_stream.read( size=-1 )
will read all records from file input.csv.
The records contained in records
can be manipulated in the usual way as any numpy array. For instance, one can apply the following commands
record_stream = comma.csv.stream( record_t, source=open( 'input.csv', 'r' ), target=open( 'output.csv', 'w' ) )
records = record_stream.read( size=-1 )
records['event']['t'] += numpy.timedelta64( 1, 's' )
records['event']['x'] -= 0.1
record_stream.write( records )
To iterate over an input stream, make use of iter()
as follows:
record_stream = comma.csv.stream( record_t )
for records in record_stream.iter():
records['event']['t'] += numpy.timedelta64( 1, 's' )
records['event']['x'] -= 0.1
record_stream.write( records )
iter()
accepts size
keyword with the same meaning as in read()
. By default, it will try to read many records at once. To read records one by one, use iter( size=1 )
.
In the example above, the same stream is used for input and output. If several streams are used, e.g. one for reading and another for writing, then flush
keyword should be applied consistently to all stream to ensure buffered or unbuffered operation.
Suppose the input stream contains the following
0
0
0
0
0
0
0
0
Then, running the following code
import comma
point_t = comma.csv.struct( 'x', 'float64' )
record_stream = comma.csv.stream( point_t )
for i,points in enumerate( record_stream.iter( size=3 ), start=1 ):
points['x'] += i
record_stream.write( points )
reads points in batches of three and, therefore, yields
1
1
1
2
2
2
3
3
Note that the last read points
contains only two elements.
Suppose the input stream starts like this
20150101T000000.123456,0,1,2,3,4,5
20150101T000001.123456,0,1,2,3,4,5
20150101T000002.123456,0,1,2,3,4,5
...
where the six numbers represent values of a 2x3 matrix. Then the matrix can be read into a record containing a 2d numpy array by using numpy type '(2,3)float64'. For instance, running the following code
import comma
event_t = comma.csv.struct( 't,signal', 'datetime64[us]', '(2,3)float64' )
stream = comma.csv.stream( event_t )
event = stream.read( size=1 )
event['signal'] += [ [0,-1,-2], [-3,-4,-5] ]
stream.write( event )
yields
20150101T000000.123456,0.0,0.0,0.0,0.0,0.0,0.0
Normally, full xpath of each field needs to be provided to identify the field in a hierarchical struct. This may become tedious if there are many fields with deep hierarchy. Fortunately, if several adjacent fields follow the order used in the definition of the struct, only the field name of the parent need to be given. It will automatically be expanded to individual fields. For instance,
import comma
coordinates_t = comma.csv.struct( 'x,y', 'float64', 'float64' )
orientation_t = comma.csv.struct( 'yaw', 'float64' )
position_t = comma.csv.struct( 'coordinates,orientation', coordinates_t, orientation_t )
timestamped_position_t = comma.csv.struct( 't,position', 'datetime64[us]', position_t )
input_stream = comma.csv.stream( timestamped_position_t, fields='position,t' )
defines a stream with fields position/coordinates/x,position/coordinates/y,position/orientation/yaw,t
.
Alternatively, if leaves of the full xpath of the required fields are unambiguous, it is possible to use the leaves instead of the full xpath fields provided that full_xpath
keyword is set to False
(by default, it is set to True
). For instance,
import comma
coordinates_t = comma.csv.struct( 'x,y', 'float64', 'float64' )
orientation_t = comma.csv.struct( 'yaw', 'float64' )
position_t = comma.csv.struct( 'coordinates,orientation', coordinates_t, orientation_t )
timestamped_position_t = comma.csv.struct( 't,position', 'datetime64[us]', position_t )
input_stream = comma.csv.stream( timestamped_position_t, fields='x,y,yaw,t', full_xpath=False )
defines the same stream as the one above.
It is sometimes desirable to pass the records from the input stream to the output with some extra fields attached at the end. This is accomplished by using tied
keyword. This capability is illustrated in the example below.
Copy and save the following code in a file called attach-min-max:
#!/usr/bin/python
import comma
import numpy as np
point_t = comma.csv.struct( 'x,y,z', 'float64', 'float64', 'float64' )
event_t = comma.csv.struct( 't,coordinates', 'datetime64[us]', point_t )
fields = ',coordinates/y,coordinates/z,,,t,coordinates/x,,'
format = ','.join( comma.csv.format.to_numpy( 'i,d,d,s[3],s[7],t,d,ui,ui' ) )
input_stream = comma.csv.stream( event_t, fields=fields, format=format )
output_t = comma.csv.struct( 'min,max', 'float64', 'float64' )
output_stream = comma.csv.stream( output_t, binary=True, tied=input_stream )
for events in input_stream.iter():
output = np.empty( events.size, dtype=output_t )
output['min'] = np.min( events['coordinates'].view( '3float64' ), axis=1 )
output['max'] = np.max( events['coordinates'].view( '3float64' ), axis=1 )
output_stream.write( output )
and create a file called input.csv with the following content:
-1,1,-2,abc,nothing,20130101T010100.123456,3,10,20
-2,1,-3,def,nothing,20140202T020200.123456,4,10,20
-3,1,-4,ghi,nothing,20150303T030300.123456,5,10,20
Then, executing
chmod u+x attach-min-max
cat input.csv | csv-to-bin i,2d,s[3],s[7],t,d,2ui | ./attach-min-max | csv-from-bin i,2d,s[3],s[7],t,d,2ui,2d
yields
-1,1,-2,abc,nothing,20130101T010100.123456,3,10,20,-2,3
-2,1,-3,def,nothing,20140202T020200.123456,4,10,20,-3,4
-3,1,-4,ghi,nothing,20150303T030300.123456,5,10,20,-4,5
If some of the expected fields are not present in the input stream, these missing fields will be populated with zero values (a blank string is used for string types and the zero epoch for the time type). For instance, on an input stream
1
2
3
the following code
import comma
t = comma.csv.struct( 's,x,y,t', 'S2', 'i4', 'i4', 'datetime64[us]' )
s = comma.csv.stream( t, fields='x' )
for r in s.iter():
s.write( r )
yields
,1,0,19700101T000000
,2,0,19700101T000000
,3,0,19700101T000000
If the default values used for the missing fields are different from zero, they can be specified with default_values
keyword when creating the stream. For example,
import comma
t = comma.csv.struct( 's,x,y,t', 'S2', 'i4', 'i4', 'datetime64[us]' )
s = comma.csv.stream( t, fields='x', default_values={ 'y': 1, 't': '20150102T123456' } )
for r in s.iter():
s.write( r )
,1,1,20150102T123456
,2,1,20150102T123456
,3,1,20150102T123456
This section is relevant to those using numpy version 1.10 or earlier. In numpy 1.11.0, the behaviour of datetime64 objects, which are used by comma csv module to represent timestamps, has been changed. They no longer carry time zone info and time is always interpreted as being in the UTC time zone. Nevertheless, it is still preferable to set time zone to UTC explicitly to ensure compatibility with systems using earlier versions of numpy.
By default, time imported from the input stream is converted to the local time zone. For instance, feeding input stream
20140101T000000
20150101T000000
to
import comma
t = comma.csv.struct( 't', 'datetime64[us]' )
s = comma.csv.stream( t )
for r in s.iter( size=1 ):
print r['t'][0]
yields
2014-01-01T00:00:00.000000+1100
2015-01-01T00:00:00.000000+1100
where the time offset of the local time zone is +11 hours. A convenience function to change the time zone used by python is provided in comma.csv.time
module and needs to be invoked before reading from a stream. For instance, feeding the same input stream to
import comma
comma.csv.time.zone( 'UTC' ) # set time zone to UTC
t = comma.csv.struct( 't', 'datetime64[us]' )
s = comma.csv.stream( t )
for r in s.iter( size=1 ):
print r['t'][0]
yields
2014-01-01T00:00:00.000000+0000
2015-01-01T00:00:00.000000+0000
Note that the write()
function of the stream class ignores the time zone and, therefore,
for r in s.iter( size=1 ):
s.write( r )
will yield the same output regardless of time zone.
Python utilities using comma.csv.stream
are comparable or a few times slower than the equivalent c++ utilities for binary streams, provided that a large enough size is used (the default size is usually sufficient). For ascii streams, the performance of comma.csv.stream
is quite poor, so it is only feasible for small files. A further degradation in performance (a factor of a few) may be expected if size=1 is used.