Inputs - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki
The purpose of Input modules is to format the input data of varying formats (PDF, CSV, etc.) into a standardised Corpus file in JSON format. This file (and its schema) will be read by other modules.
The Input modules are all contained within the input
package.
There are currently 6 input modules available:
- CSV Input which reads document data structured in a CSV file
- PDF Input which parses a collection of PDT files in a directory
-
TXT Input which reads documents from a text (
.txt
) file or from a directory of text files -
BIB Input which reads items from a BibTex (
.bib
) file - HTML Input which crawls text from HTML pages using a list of URLs provided in a CSV file
- GTR Input is specific to the Gateway to Research (GtR) API, it reads document data from a CSV file, which must contain a GtR project ID, and will also crawl for additional data from GtR's website