Inputs - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

Input Modules

The purpose of Input modules is to format the input data of varying formats (PDF, CSV, etc.) into a standardised Corpus file in JSON format. This file (and its schema) will be read by other modules.

The Input modules are all contained within the input package.

List of Input Modules

There are currently 6 input modules available:

  • CSV Input which reads document data structured in a CSV file
  • PDF Input which parses a collection of PDT files in a directory
  • TXT Input which reads documents from a text (.txt) file or from a directory of text files
  • BIB Input which reads items from a BibTex (.bib) file
  • HTML Input which crawls text from HTML pages using a list of URLs provided in a CSV file
  • GTR Input is specific to the Gateway to Research (GtR) API, it reads document data from a CSV file, which must contain a GtR project ID, and will also crawl for additional data from GtR's website
⚠️ **GitHub.com Fallback** ⚠️