PDF Input - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

The PDF Input module reads a corpus of documents from a directory of PDF files.

Module Parameters

The module type to use is inputPDF.

Name Description Optional Default
source Path to the input CSV file No
output Path to the output corpus JSON file No
name Corpus name Yes Name of the module in the YAML configs
splitPages Split a file into subset of X pages Yes 0 (no splitting)
  • source is relative to the source directory and should point to a directory containing .pdf documents
    • If .pdf documents are contained within subdirectories, those will be read too, the subdirectory name will be used to set the folder attribute of documents in the output corpus JSON file
  • output is relative to the data directory and should point to a .json file

Example

Input

The input is a directory, dataset, with documents organised as follows:

dataset
 |-- 2021
 |    |-- document1.pdf
 |    |-- document2.pdf
 |    |-- ...
 |-- 2022
      |-- document57.pdf
      |-- document58.pdf
      |-- ...

Module Configurations

The PDF input module (named pdfCorpusReader) can be configured as follows:

pdfCorpusReader:
  type: inputPDF
  name: my_corpus
  source: dataset
  output: corpus.json
  splitPages: 5

Output

The module will produce a JSON file, corpus.json, containing a structured corpus to use in subsequent modules:

{
  "name": "my_corpus",
  "documents": [
    {
      "id": "1",
      "i": 1,
      "d": {
        "text": "text found in pdf document",
        "folder": "2021",
        "file": "document1",
        "subDoc": "0",
        "pages": "1-5"
      }
    },
    ...
  ]
}

If splitPages was not defined or set to 0, then subDoc and pages would not be generated in the document data.

⚠️ **GitHub.com Fallback** ⚠️