MTE Parsers - wkiri/MTE GitHub Wiki
MTE Parser Indexer
Introduction
The MTE Parser Indexer contains 1 base parser and 7 parsers created for different purposes.
TODO: UPDATE the following links once the parser scripts are merged from the parser
to the master
branch.
Base Parser: The base parser that all parsers should inherit.
TIKA Parser: The TIKA parser utilizes Apache TIKA service to convert PDF files to text files.
ADS Parser: The ADS parser utilizes the search API of the Astrophysics Data System (ADS) to extract information including title, author, primary author, affiliation, publication venue, and publication date.
CoreNLP Parser: The CoreNLP parser utilizes the Named Entity Recognition (NER) sub-module of the Stanford CoreNLP package to categorize words into named entities (e.g., target, mineral, element)
JSRE Parser: The JSRE parser utilizes the Java Simple Relation Extraction (JSRE) toolkit to extract relations between named entities.
Paper Parser: The Paper parser is a generic parser suitable for papers from all publication venues. The Paper parser is implemented to augment/remove contents (e.g., translate some UTF8 punctuation to ASCII, remove hyphenation at the end of lines, etc.) general to all papers.
LPSC Parser: The LPSC parser is created for the two-page abstract from Lunar and Planetary Science Conference (LPSC). It utilizes regular expression matches to remove contents specific (e.g., abstract id, conference header) to the LPSC abstract.
JGR Parser: The JGR parser is created for the papers from Journal of Geophysical Research.
The class diagram of the parsers is shown below:
Usage
- TIKA Parser
>>> python tika_parser.py -h
usage: tika_parser.py [-h] (-i IN_FILE | -li IN_LIST) -o OUT_FILE
[-l LOG_FILE] [-p TIKA_SERVER_URL]
optional arguments:
-h, --help show this help message and exit
-i IN_FILE, --in_file IN_FILE
Path to input file
-li IN_LIST, --in_list IN_LIST
Path to input list
-o OUT_FILE, --out_file OUT_FILE
Path to output JSON file
-l LOG_FILE, --log_file LOG_FILE
Log file that contains processing information. It is
default to ./tika-parser-log.txt unless otherwise
specified.
-p TIKA_SERVER_URL, --tika_server_url TIKA_SERVER_URL
Tika server URL
Note that the -p
TIKA_SERVER_URL argument is optional. The following command is an example of using TIKA parser:
python tika_parser.py -li /PATH/TO/LIST/OF/PDF/FILES -o /PATH/TO/OUTPUT/JSONL/FILE -l /PATH/TO/OUTPUT/LOG/FILE
- ADS Parser
>>> python ads_parser.py -h
usage: ads_parser.py [-h] (-i IN_FILE | -li IN_LIST) -o OUT_FILE [-l LOG_FILE]
[-p TIKA_SERVER_URL] [-a ADS_URL] [-t ADS_TOKEN]
optional arguments:
-h, --help show this help message and exit
-i IN_FILE, --in_file IN_FILE
Path to input file
-li IN_LIST, --in_list IN_LIST
Path to input list
-o OUT_FILE, --out_file OUT_FILE
Path to output JSON file
-l LOG_FILE, --log_file LOG_FILE
Log file that contains processing information. It is
default to ./ads-parser-log.txt unless otherwise
specified.
-p TIKA_SERVER_URL, --tika_server_url TIKA_SERVER_URL
Tika server URL
-a ADS_URL, --ads_url ADS_URL
ADS RESTful API. The ADS RESTful API should not need
to be changed frequently unless someting at the ADS is
changed.
-t ADS_TOKEN, --ads_token ADS_TOKEN
The ADS token, which is required to use the ADS
RESTful API. The token was obtained using the
instructions at https://github.com/adsabs/adsabs-dev-
api#access. The ADS token should not need to be
changed frequently unless something at the ADS is
changed.
The example command is shown below:
python ads_parser.py -li /PATH/TO/LIST/OF/PDF/FILES -o /PATH/TO/OUTPUT/JSONL/FILE -l /PATH/TO/OUTPUT/LOG/FILE
- CoreNLP Parser
>>> python corenlp_parser.py -h
usage: corenlp_parser.py [-h] (-i IN_FILE | -li IN_LIST) -o OUT_FILE
[-l LOG_FILE] [-p TIKA_SERVER_URL]
[-c CORENLP_SERVER_URL] [-n NER_MODEL] [-a ADS_URL]
[-t ADS_TOKEN]
optional arguments:
-h, --help show this help message and exit
-i IN_FILE, --in_file IN_FILE
Path to input file
-li IN_LIST, --in_list IN_LIST
Path to input list
-o OUT_FILE, --out_file OUT_FILE
Path to output JSON file
-l LOG_FILE, --log_file LOG_FILE
Log file that contains processing information. It is
default to ./corenlp-parser-log.txt unless otherwise
specified.
-p TIKA_SERVER_URL, --tika_server_url TIKA_SERVER_URL
Tika server URL
-c CORENLP_SERVER_URL, --corenlp_server_url CORENLP_SERVER_URL
CoreNLP Server URL
-n NER_MODEL, --ner_model NER_MODEL
Path to a Named Entity Recognition (NER) model
-a ADS_URL, --ads_url ADS_URL
ADS RESTful API. The ADS RESTful API should not need
to be changed frequently unless someting at the ADS is
changed.
-t ADS_TOKEN, --ads_token ADS_TOKEN
The ADS token, which is required to use the ADS
RESTful API. The token was obtained using the
instructions at https://github.com/adsabs/adsabs-dev-
api#access. The ADS token should not need to be
changed frequently unless something at the ADS is
changed.
The example command is shown below:
python ads_parser.py -li /PATH/TO/LIST/OF/PDF/FILES -o /PATH/TO/OUTPUT/JSONL/FILE -l /PATH/TO/OUTPUT/LOG/FILE -n /PATH/TO/TRAINED/NER/MODEL
- JSRE Parser
>>> python jsre_parser.py -h
usage: jsre_parser.py [-h] (-i IN_FILE | -li IN_LIST) -o OUT_FILE
[-l LOG_FILE] [-p TIKA_SERVER_URL]
[-c CORENLP_SERVER_URL] [-n NER_MODEL] [-jr JSRE_ROOT]
-jm JSRE_MODEL [-jt JSRE_TMP_DIR] [-a ADS_URL]
[-t ADS_TOKEN]
optional arguments:
-h, --help show this help message and exit
-i IN_FILE, --in_file IN_FILE
Path to input file
-li IN_LIST, --in_list IN_LIST
Path to input list
-o OUT_FILE, --out_file OUT_FILE
Path to output JSON file
-l LOG_FILE, --log_file LOG_FILE
Log file that contains processing information. It is
default to ./jsre-parser-log.txt unless otherwise
specified.
-p TIKA_SERVER_URL, --tika_server_url TIKA_SERVER_URL
Tika server URL
-c CORENLP_SERVER_URL, --corenlp_server_url CORENLP_SERVER_URL
CoreNLP Server URL
-n NER_MODEL, --ner_model NER_MODEL
Path to a Named Entity Recognition (NER) model
-jr JSRE_ROOT, --jsre_root JSRE_ROOT
Path to jSRE installation directory. Default is
/proj/mte/jSRE/jsre-1.1
-jm JSRE_MODEL, --jsre_model JSRE_MODEL
Path to jSRE model
-jt JSRE_TMP_DIR, --jsre_tmp_dir JSRE_TMP_DIR
Path to a directory for jSRE to temporarily store
input and output files. Default is /tmp
-a ADS_URL, --ads_url ADS_URL
ADS RESTful API. The ADS RESTful API should not need
to be changed frequently unless someting at the ADS is
changed.
-t ADS_TOKEN, --ads_token ADS_TOKEN
The ADS token, which is required to use the ADS
RESTful API. The token was obtained using the
instructions at https://github.com/adsabs/adsabs-dev-
api#access. The ADS token should not need to be
changed frequently unless something at the ADS is
changed.
The example command is shown below:
python jsre_parser.py -li /PATH/TO/LIST/OF/PDF/FILES -o /PATH/TO/OUTPUT/JSONL/FILE -l /PATH/TO/OUTPUT/LOG/FILE -n /PATH/TO/TRAINED/NER/MODEL -jr /PATH/TO/TRAINED/JSRE/MODEL
- Paper Parser
>>> python paper_parser.py -h
usage: paper_parser.py [-h] (-i IN_FILE | -li IN_LIST) -o OUT_FILE
[-l LOG_FILE] [-p TIKA_SERVER_URL]
[-c CORENLP_SERVER_URL] [-n NER_MODEL] [-jr JSRE_ROOT]
-jm JSRE_MODEL [-jt JSRE_TMP_DIR] [-a ADS_URL]
[-t ADS_TOKEN]
optional arguments:
-h, --help show this help message and exit
-i IN_FILE, --in_file IN_FILE
Path to input file
-li IN_LIST, --in_list IN_LIST
Path to input list
-o OUT_FILE, --out_file OUT_FILE
Path to output JSON file
-l LOG_FILE, --log_file LOG_FILE
Log file that contains processing information. It is
default to ./paper-parser-log.txt unless otherwise
specified.
-p TIKA_SERVER_URL, --tika_server_url TIKA_SERVER_URL
Tika server URL
-c CORENLP_SERVER_URL, --corenlp_server_url CORENLP_SERVER_URL
CoreNLP Server URL
-n NER_MODEL, --ner_model NER_MODEL
Path to a Named Entity Recognition (NER) model
-jr JSRE_ROOT, --jsre_root JSRE_ROOT
Path to jSRE installation directory. Default is
/proj/mte/jSRE/jsre-1.1
-jm JSRE_MODEL, --jsre_model JSRE_MODEL
Path to jSRE model
-jt JSRE_TMP_DIR, --jsre_tmp_dir JSRE_TMP_DIR
Path to a directory for jSRE to temporarily store
input and output files. Default is /tmp
-a ADS_URL, --ads_url ADS_URL
ADS RESTful API. The ADS RESTful API should not need
to be changed frequently unless someting at the ADS is
changed.
-t ADS_TOKEN, --ads_token ADS_TOKEN
The ADS token, which is required to use the ADS
RESTful API. The token was obtained using the
instructions at https://github.com/adsabs/adsabs-dev-
api#access. The ADS token should not need to be
changed frequently unless something at the ADS is
changed.
The example command is shown below:
python paper_parser.py -li /PATH/TO/LIST/OF/PDF/FILES -o /PATH/TO/OUTPUT/JSONL/FILE -l /PATH/TO/OUTPUT/LOG/FILE -n /PATH/TO/TRAINED/NER/MODEL -jr /PATH/TO/TRAINED/JSRE/MODEL
- LPSC parser
python lpsc_parser.py -h
usage: lpsc_parser.py [-h] (-i IN_FILE | -li IN_LIST) -o OUT_FILE
[-l LOG_FILE] [-p TIKA_SERVER_URL]
[-c CORENLP_SERVER_URL] [-n NER_MODEL] [-jr JSRE_ROOT]
-jm JSRE_MODEL [-jt JSRE_TMP_DIR] [-a ADS_URL]
[-t ADS_TOKEN]
optional arguments:
-h, --help show this help message and exit
-i IN_FILE, --in_file IN_FILE
Path to input file
-li IN_LIST, --in_list IN_LIST
Path to input list
-o OUT_FILE, --out_file OUT_FILE
Path to output JSON file
-l LOG_FILE, --log_file LOG_FILE
Log file that contains processing information. It is
default to ./lpsc-parser-log.txt unless otherwise
specified.
-p TIKA_SERVER_URL, --tika_server_url TIKA_SERVER_URL
Tika server URL
-c CORENLP_SERVER_URL, --corenlp_server_url CORENLP_SERVER_URL
CoreNLP Server URL
-n NER_MODEL, --ner_model NER_MODEL
Path to a Named Entity Recognition (NER) model
-jr JSRE_ROOT, --jsre_root JSRE_ROOT
Path to jSRE installation directory. Default is
/proj/mte/jSRE/jsre-1.1
-jm JSRE_MODEL, --jsre_model JSRE_MODEL
Path to jSRE model
-jt JSRE_TMP_DIR, --jsre_tmp_dir JSRE_TMP_DIR
Path to a directory for jSRE to temporarily store
input and output files. Default is /tmp
-a ADS_URL, --ads_url ADS_URL
ADS RESTful API. The ADS RESTful API should not need
to be changed frequently unless someting at the ADS is
changed.
-t ADS_TOKEN, --ads_token ADS_TOKEN
The ADS token, which is required to use the ADS
RESTful API. The token was obtained using the
instructions at https://github.com/adsabs/adsabs-dev-
api#access. The ADS token should not need to be
changed frequently unless something at the ADS is
changed.
The example command is shown below:
python lpsc_parser.py -li /PATH/TO/LIST/OF/PDF/FILES -o /PATH/TO/OUTPUT/JSONL/FILE -l /PATH/TO/OUTPUT/LOG/FILE -n /PATH/TO/TRAINED/NER/MODEL -jr /PATH/TO/TRAINED/JSRE/MODEL
- JGR Parser
python jgr_parser.py -h
usage: jgr_parser.py [-h] (-i IN_FILE | -li IN_LIST) -o OUT_FILE [-l LOG_FILE]
[-p TIKA_SERVER_URL] [-c CORENLP_SERVER_URL]
[-n NER_MODEL] [-jr JSRE_ROOT] -jm JSRE_MODEL
[-jt JSRE_TMP_DIR] [-a ADS_URL] [-t ADS_TOKEN]
optional arguments:
-h, --help show this help message and exit
-i IN_FILE, --in_file IN_FILE
Path to input file
-li IN_LIST, --in_list IN_LIST
Path to input list
-o OUT_FILE, --out_file OUT_FILE
Path to output JSON file
-l LOG_FILE, --log_file LOG_FILE
Log file that contains processing information. It is
default to ./jgr-parser-log.txt unless otherwise
specified.
-p TIKA_SERVER_URL, --tika_server_url TIKA_SERVER_URL
Tika server URL
-c CORENLP_SERVER_URL, --corenlp_server_url CORENLP_SERVER_URL
CoreNLP Server URL
-n NER_MODEL, --ner_model NER_MODEL
Path to a Named Entity Recognition (NER) model
-jr JSRE_ROOT, --jsre_root JSRE_ROOT
Path to jSRE installation directory. Default is
/proj/mte/jSRE/jsre-1.1
-jm JSRE_MODEL, --jsre_model JSRE_MODEL
Path to jSRE model
-jt JSRE_TMP_DIR, --jsre_tmp_dir JSRE_TMP_DIR
Path to a directory for jSRE to temporarily store
input and output files. Default is /tmp
-a ADS_URL, --ads_url ADS_URL
ADS RESTful API. The ADS RESTful API should not need
to be changed frequently unless someting at the ADS is
changed.
-t ADS_TOKEN, --ads_token ADS_TOKEN
The ADS token, which is required to use the ADS
RESTful API. The token was obtained using the
instructions at https://github.com/adsabs/adsabs-dev-
api#access. The ADS token should not need to be
changed frequently unless something at the ADS is
changed.
The example command is shown below:
python jgr_parser.py -li /PATH/TO/LIST/OF/PDF/FILES -o /PATH/TO/OUTPUT/JSONL/FILE -l /PATH/TO/OUTPUT/LOG/FILE -n /PATH/TO/TRAINED/NER/MODEL -jr /PATH/TO/TRAINED/JSRE/MODEL