XML configuration - lgeorgieff/csv-processor GitHub Wiki
CsvProcessor works with XML configuration files. So before coding you have to define a CSV workflow. The XML schema, therefore, a XSD file is exists in the project CsvProcessor titled "csv.config.xsd". If you use the binary version, the XSD file is embedded in the DLL.
The configuration file and the CsvProcessor is organized as a single CSV job containing a chain of CSV workflows, each contains again a list of CSV tasks. Before defining any workflow or tasks you have to define column-definitions, i.e. the definition of a single CSV row. The single elements are described below. Additionally, you have the possibility to use the the documentation of the XSD file.
The element "csv-job" is the root element of your configuration file is located in the namespace "http://ztt.fh-worms.de/georgieff/csv/" and contains the following child elements:
You have to define a "column-definitions" element for each workflow that uses a different CSV format.
Such a definition is used when a CSV file is read within a workflow by a "read task" or when a result of a previous workflow is set as input to a following workflow. So the read file or the passed input must corresponds to the defined column-definitions.
Each "column-definitions" element must have an attribute titled "name" which allows to reference a particular "column-definitions" element for a specific workflow.
A "column-definitions" element contains a list of "column" elements that define the single columns of a CSV file's line. The "column" elements must be in the same order as the columns in the file. A "column" element contains the attributes
- "name" which defines the name that is used when referencing the column values/cells in the code
- "from" which must correspond to the name of a cell of the CSV header line. If this attribute is not set, the value set as "name" is used instead.
For example, the file with the header line "col1, col2, col3" must be declared as
<column-definitions name="my-col-defs">
<column name="column 1" from="col1"/>
<column name="column 2" from="col2"/>
<column name="col3"/>
</column-definitions>
A "column-mappings" element is used for writing a CSV line from a workflow/task into a file. Thus, a "column-mappings" defines which elements are written into the file and which are not.
Each "column-mappings" element must have an attribute titled "name" which allows to reference a particular "column-mappings" element for a specific write task.
A "column" element is used within a "column-mappings" element to define the columns that are written back into a file. A "column" element contains the attributes:
- "ref" which references columns described by the column-definition with the same name
- "as" which defines the final column name
To write the data read in section "column-definitions" in reversed order and with different column titles you have to define the following "column-mappings" element.
Line: column a, column b, column c
<column-mappings name="my-col-mappings">
<column ref="col3" as="column a"/>
<column ref="column 2" as="column b"/>
<column ref="column 1" as="column c"/>
</column-definitions>
A workflow wraps several tasks for processing a CSV file. Therefore, three different kinds of tasks are available: read-task, write-task, generic-task. A job can contain several workflows that are organized as a chain, i.e. workflow A may read a CSV file, workflow B and C can use the results of workflow A and process on it finally workflow D can merge the results of B and C and write it back into a file. Concurrent workflows are executed in parallel as different threads. In contrast tasks within a workflow must form a single chain and cannot define multiple paths.
A workflow must have the attribute "name" and "column-definitions" to reference a specific "column-definitions" element. The optional attribute "previous-workflows" contains a list of workflow names which results are used as input for this workflow.
A read task must be used as first task of a workflow to generate the data for the next task. There are several elements for defining a read-task.
<read-task name="csv-reader">
<file path="..\..\Identify Hyponymous Collections.csv"/>
<split char=","/>
<quote char='"'/>
<meta-quote char='\'/>
<trim-whitespace-start value="true"/>
<trim-whitespace-end value="true"/>
<read-multi-line value="true"/>
</read-task>
The previous read-task would read the file "....\Identify Hyponymous Collections.csv" relative to your execution directory. The split character for separating several columns is set to ',', the quotation character for quoting ',' is set to '"' and the meta quotation character whcih allows to quote the character '"' is set to '\'. If you want to uset the meta-quote character as common character you have to quote it, i.e. "\" means "". All column values are trimmed for whitespace on the left and the right side. If "read-multi-line" is set to true a logical CSV line may be distributed over multiple lines. Note: the read file must corresponds to the "column-definitions" referenced by the "read-task"'s parent workflow.
A "write-task" must be used as last task within a workflow. Note, if a workflow contains a write-task, the workflow's result is None. The following example demonstrates the usage of a "write-task":
<write-task name="printer" column-mappings="my-col-mappings">
<file path="tmp.txt" mode="Append"/>
<split char=" "/> <!-- tab -->
<quote char='"'/>
<meta-quote char='\'/>
</write-task>
The previous write-task would write the file ".\tmp.txt" relative to your execution directory and append the generated data to this file. You can also change the mode value, e.g. for creating a new file. The definitions of split char and meta-quote are equivalent to the elements in the "read-task". The resulting file is of the form defined by the "column-mappings" element referenced via the attribute "column-mappings".
A "generic-task" must be used after a "read-task" to consume data. If a "generic-task" is used as last task within a "workflow", the "generic-task"'s results are used as workflow results. A "generic-task" can have either a "line-operation" or a "document-operation" identifier, i.e. the actual operation must be implemented and registered in code. The following two examples demonstrate the usage of "generic-task"s:
Line operations process single lines until the entire document is processed.
The following XML configuration demonstrates the definition of a "generic-task" using a line operation.
<generic-task name="make-upper-case">
<line-operation identifier="upper-case-transform"/>
</generic-task>
let private makeUpperCase(line: Line): option<Line> =
List.map(fun(cell: ICell) -> { Cell.Name = cell.Name; Cell.Value = cell.Value.ToUpper() } :> ICell) line
|> Some
GenericTask.RegisterOperation("upper-case-transform", makeUpperCase)
The signature is Line -> option, whereas each Line is a list if ICell instances cotaining the properties:
- "Name" containing the column name
- "Value" containing the actual value
If you want to filter a line, return the value "None".
Document operation process entire documents, i.e. all lines at one glance.
The following XML configuration demonstrates the definition of a "generic-task" using a document operation.
<generic-task name="mask-space">
<document-operation identifier="space-transform"/>
</generic-task>
let private maskSpace(document: Lines): Lines =
List.map(fun(line: Line) ->
List.map(fun(cell: ICell) ->
{ Cell.Name = cell.Name; Cell.Value = cell.Value.Replace(' ', '_') } :> ICell) line) document
GenericTask.RegisterOperation("space-transform", maskSpace)
The signature is Lines -> Lines.