PDF Input - Strategic-Futures-Lab/Topic_Mapping_Pipeline GitHub Wiki
The PDF Input module reads a corpus of documents from a directory of PDF files.
The module type
to use is inputPDF
.
Name | Description | Optional | Default |
---|---|---|---|
source |
Path to the input CSV file | No | |
output |
Path to the output corpus JSON file | No | |
name |
Corpus name | Yes | Name of the module in the YAML configs |
splitPages |
Split a file into subset of X pages | Yes | 0 (no splitting) |
-
source
is relative to the source directory and should point to a directory containing.pdf
documents- If
.pdf
documents are contained within subdirectories, those will be read too, the subdirectory name will be used to set thefolder
attribute of documents in the output corpus JSON file
- If
-
output
is relative to the data directory and should point to a.json
file
The input is a directory, dataset
, with documents organised as follows:
dataset
|-- 2021
| |-- document1.pdf
| |-- document2.pdf
| |-- ...
|-- 2022
|-- document57.pdf
|-- document58.pdf
|-- ...
The PDF input module (named pdfCorpusReader
) can be configured as follows:
pdfCorpusReader:
type: inputPDF
name: my_corpus
source: dataset
output: corpus.json
splitPages: 5
The module will produce a JSON file, corpus.json
, containing a structured corpus to use in subsequent modules:
{
"name": "my_corpus",
"documents": [
{
"id": "1",
"i": 1,
"d": {
"text": "text found in pdf document",
"folder": "2021",
"file": "document1",
"subDoc": "0",
"pages": "1-5"
}
},
...
]
}
If splitPages
was not defined or set to 0, then subDoc
and pages
would not be generated in the document data.