CSV Input - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki
The CSV Input module reads a corpus of documents from a CSV file.
The module type
to use is inputCSV
.
Name | Description | Optional | Default |
---|---|---|---|
source |
Path to the input CSV file | No | |
output |
Path to the output corpus JSON file | No | |
fields |
Document attributes to read from the input file | No | |
name |
Corpus name | Yes | Name of the module in the YAML configs |
-
source
is relative to the source directory and should point to a.csv
file -
output
is relative to the data directory and should point to a.json
file -
fields
lets you provide a list of key-value pairs. Keys represent the name of document attributes as saved in the corpus file. Values are the name of document attributes as described in the input CSV file.
The input is a CSV file, dataset.csv
, with documents organised in rows:
title, year, description, source
"title of first doc", 2014, "description of first doc", "portfolio 1"
...
The CSV input module (named csvCorpusReader
) can be configured as follows:
csvCorpusReader:
type: inputCSV
name: my_corpus
source: dataset.csv
output: corpus.json
fileds:
title: title
date: year
text: description
The module will produce a JSON file, corpus.json
, containing a structured corpus to use in subsequent modules:
{
"name": "my_corpus",
"documents": [
{
"id": "1",
"i": 1,
"d": {
"title": "title of first doc",
"date": "2024",
"text": "description of first doc"
}
},
...
]
}