CSV Input - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

The CSV Input module reads a corpus of documents from a CSV file.

Module Parameters

The module type to use is inputCSV.

Name Description Optional Default
source Path to the input CSV file No
output Path to the output corpus JSON file No
fields Document attributes to read from the input file No
name Corpus name Yes Name of the module in the YAML configs
  • source is relative to the source directory and should point to a .csv file
  • output is relative to the data directory and should point to a .json file
  • fields lets you provide a list of key-value pairs. Keys represent the name of document attributes as saved in the corpus file. Values are the name of document attributes as described in the input CSV file.

Example

Input

The input is a CSV file, dataset.csv, with documents organised in rows:

title, year, description, source
"title of first doc", 2014, "description of first doc", "portfolio 1" 
...

Module Configurations

The CSV input module (named csvCorpusReader) can be configured as follows:

csvCorpusReader:
  type: inputCSV
  name: my_corpus
  source: dataset.csv
  output: corpus.json
  fileds:
    title: title
    date: year
    text: description

Output

The module will produce a JSON file, corpus.json, containing a structured corpus to use in subsequent modules:

{
  "name": "my_corpus",
  "documents": [
    {
      "id": "1",
      "i": 1,
      "d": {
        "title": "title of first doc",
        "date": "2024",
        "text": "description of first doc"
      }
    },
    ...
  ]
}
⚠️ **GitHub.com Fallback** ⚠️