The CSV Input module reads a corpus of documents from a CSV file.

Module Parameters

The module type to use is inputCSV.

Name	Description	Optional	Default
`source`	Path to the input CSV file	No
`output`	Path to the output corpus JSON file	No
`fields`	Document attributes to read from the input file	No
`name`	Corpus name	Yes	Name of the module in the YAML configs

source is relative to the source directory and should point to a .csv file
output is relative to the data directory and should point to a .json file
fields lets you provide a list of key-value pairs. Keys represent the name of document attributes as saved in the corpus file. Values are the name of document attributes as described in the input CSV file.

Example

Input

The input is a CSV file, dataset.csv, with documents organised in rows:

title, year, description, source
"title of first doc", 2014, "description of first doc", "portfolio 1" 
...

Module Configurations

The CSV input module (named csvCorpusReader) can be configured as follows:

csvCorpusReader:
  type: inputCSV
  name: my_corpus
  source: dataset.csv
  output: corpus.json
  fileds:
    title: title
    date: year
    text: description

Output

The module will produce a JSON file, corpus.json, containing a structured corpus to use in subsequent modules:

{
  "name": "my_corpus",
  "documents": [
    {
      "id": "1",
      "i": 1,
      "d": {
        "title": "title of first doc",
        "date": "2024",
        "text": "description of first doc"
      }
    },
    ...
  ]
}

CSV Input - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

Module Parameters

Example

Input

Module Configurations

Output

⚠️ GitHub.com Fallback ⚠️

CSV Input - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

Module Parameters

Example

Input

Module Configurations

Output

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️