The PDF Input module reads a corpus of documents from a directory of PDF files.

Module Parameters

The module type to use is inputPDF.

Name	Description	Optional	Default
`source`	Path to the input CSV file	No
`output`	Path to the output corpus JSON file	No
`name`	Corpus name	Yes	Name of the module in the YAML configs
`splitPages`	Split a file into subset of X pages	Yes	0 (no splitting)

source is relative to the source directory and should point to a directory containing .pdf documents
- If .pdf documents are contained within subdirectories, those will be read too, the subdirectory name will be used to set the folder attribute of documents in the output corpus JSON file
output is relative to the data directory and should point to a .json file

Example

Input

The input is a directory, dataset, with documents organised as follows:

dataset
 |-- 2021
 |    |-- document1.pdf
 |    |-- document2.pdf
 |    |-- ...
 |-- 2022
      |-- document57.pdf
      |-- document58.pdf
      |-- ...

Module Configurations

The PDF input module (named pdfCorpusReader) can be configured as follows:

pdfCorpusReader:
  type: inputPDF
  name: my_corpus
  source: dataset
  output: corpus.json
  splitPages: 5

Output

The module will produce a JSON file, corpus.json, containing a structured corpus to use in subsequent modules:

{
  "name": "my_corpus",
  "documents": [
    {
      "id": "1",
      "i": 1,
      "d": {
        "text": "text found in pdf document",
        "folder": "2021",
        "file": "document1",
        "subDoc": "0",
        "pages": "1-5"
      }
    },
    ...
  ]
}

If splitPages was not defined or set to 0, then subDoc and pages would not be generated in the document data.

PDF Input - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

Module Parameters

Example

Input

Module Configurations

Output

⚠️ GitHub.com Fallback ⚠️

PDF Input - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki

Module Parameters

Example

Input

Module Configurations

Output

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️