readCsv - cfsimplicity/spreadsheet-cfml GitHub Wiki
Reads a csv file and processes and/or returns the data.
Performance-optimized for large files and offering a range of options available via the Apache Commons CSV library.
By default the method returns an object to which further optional calls can be chained before finally calling execute()
to process the file and (if required) return the data.
readCsv( filepath )[...optional configuration calls].execute();
Chainable? No.
Reading a csv file
filepath = "c:/temp/my.csv";
result = spreadsheet.readCsv( filepath )
.intoAnArray()
.execute();
Return value
An ordered struct with 2 keys:
columns
: array of header/column names (empty if none specified)data
: array of row value arrays (native java arrays by default - see below)
The format of the struct makes it simple to convert the result to a query should you wish:
filepath = "c:/temp/my.csv";
result = spreadsheet.readCsv( filepath )
.intoAnArray()
.withFirstRowIsHeader()
.execute();
resultAsQuery = DeserializeJson( SerializeJson( result ), false );
Row values are JAVA arrays by default
Important: For performance reasons, each row is returned or processed as a native Java array rather than a CFML array. This means that you may not be able to use "member functions" such as row.Find( "string" )
or other CFML operations on them, depending on your engine. Generally speaking using standard CFML array functions should be ok, e.g. ArrayFind( row, "string" )
.
If you prefer to to work with CFML arrays you can override the default as follows:
filepath = "c:/temp/my.csv";
result = spreadsheet.readCsv( filepath )
.intoAnArray()
.processRowsAsJavaArrays( false )
.execute();
Note that this may affect performance.
Specifying predefined formats
Predefined Formats are preset Commons CSV configuration combinations optimized for for different contexts, such as tab-delimited data.
filepath = "c:/temp/myTabDelimited.csv";
result = spreadsheet.readCsv( filepath )
.intoAnArray()
.withPredefinedFormat( "TDF" )
.execute();
If not specified the DEFAULT predefined format will be used.
Other configuration options
withAllowMissingColumnNames( boolean )
Whether to allow missing column names in the header line. Default: true
. Commons CSV documentation
withAutoFlush( boolean )
Whether to flush on close. Default: true
. Commons CSV documentation
withCommentMarker( required string marker )
Sets the comment start marker to the specified character. Commons CSV documentation
withDelimiter( required string delimiter )
Sets the delimiter character. Commons CSV documentation
To set tab as the delimiter you can use any of the following values as parameters: "#Chr( 9 )#"
, "\t"
, "tab"
, "TAB"
withDuplicateHeaderMode( required string value )
Sets the duplicate header names behavior. Possible values: "ALLOW_ALL", "ALLOW_EMPTY", "DISALLOW". Commons CSV documentation
withEscapeCharacter( required string character )
Sets the escape character. Commons CSV documentation
withFirstRowIsHeader( boolean state )
Whether to use the first row as the header and exclude it from being processed as part of the data. Default: true
.
withHeader( required array header )
Manually sets the row values which will be detected as the header. To auto-detect the header from the first row, use withFirstRowIsHeader()
(see above) Commons CSV documentation
withIgnoreEmptyLines( boolean )
Sets the empty line skipping behavior, true to ignore the empty lines between the records, false to translate empty lines to empty records. Default: true
. Commons CSV documentation
withIgnoreHeaderCase( boolean )
Sets the parser case mapping behavior, true to access name/values, false to leave the mapping as is. Default: true
. Commons CSV documentation
withIgnoreSurroundingSpaces( boolean )
Sets the parser trimming behavior, true to remove the surrounding spaces, false to leave the spaces as is. Default: true
. Commons CSV documentation
withNullString( required string value )
Converts strings equal to the given nullString to null when reading records. Commons CSV documentation
withQuoteCharacter( required string character )
Sets the quote character. Commons CSV documentation
withSkipFirstRows( required numeric numberOfRowsToSkip )
Ignore the specified number of rows at the start of the file. Should be a positive integer.
withSkipHeaderRecord( boolean )
Sets whether to skip the header record. Default: true
. Commons CSV documentation
withTrailingDelimiter( boolean )
Sets whether to add a trailing delimiter. Default: true
. Commons CSV documentation
withTrim( boolean )
Sets whether to trim leading and trailing blanks. Default: true
. Commons CSV documentation
Filtering rows
If you would like to exclude or only include certain rows in your CSV file from being processed you can use withRowFilter()
to supply a User Defined Function (UDF) which accepts the array of row values and returns true if it should be included. For example to skip any row that contains columns with the word "tobacco" in them:
filter = function( row ){
return !ArrayFindNoCase( row, "tobacco" );
};
result = spreadsheet.readCsv( filepath )
.intoAnArray()
.withRowFilter( filter )
.execute();
Important: As described above, by default rows are native Java rather than a CFML arrays so your UDF should avoid using "member functions". For example instead of:
return !row.FindNoCase( "tobacco" );
use
return !ArrayFindNoCase( row, "tobacco" );
Accessing columns headers
If your CSV has headers and you have specified this in your call, the array of headers/columns will be passed as a second argument to your UDF. For example:
filter = function( row, columns ){
// return a boolean
};
result = spreadsheet.readCsv( filepath )
.intoAnArray()
.withFirstRowIsHeader()
.withRowFilter( filter )
.execute();
Passing a UDF/Closure to process rows
Rather than return the CSV data into an array for subsequent processing, you may wish to process each row directly as it is read from the file. This is especially suited to very large files as it avoids the need to load the data into memory.
You can do this by using withRowProcessor()
to pass in a UDF which will be executed on each row and accepts the current row values, the current row number and the array of headers/columns if they are defined.
processor = function( row, rowNumber, columns ){
//insert into DB or whatever
}
spreadsheet.readCsv( filepath )
.withRowProcessor( processor )
.withFirstRowIsHeader()
.execute();
You can perform row processing and return the processed data if you wish.
processor = function( row, rowNumber, columns ){
//insert into DB or whatever
}
result = spreadsheet.readCsv( filepath )
.intoAnArray()
.withFirstRowIsHeader()
.withRowProcessor( processor )
.execute();