Scopus Module - NETESOLUTIONS/ERNIE GitHub Wiki

This page gives a walk-through of using NeteLabs' open-source Scopus parser.

All scripts related to downloading, storing, and processing patent-related XML files can be found in the Scopus directory of the Scopus repository.

Downloading and Updating

The Scopus ETL pipeline has two main scripts jobs: Scopus_download and Scopus_update.

The Scopus_download job is triggered by incoming email; this job uses scopus_update_email_parser.py to take an email and scan it for url. Once the url is found, it opens it up and saves it to a specified directory. This is part of the code for an automated process which will leverage Jenkins to set off a process when triggered by the reception of an email with a url-link.

The Scopus_update job will be triggered by Scopus_download job; it uses the load.sh and ​process_pub_zips.sh to extract all publication ZIP files from the specified working directory to a temporary directory, process extracted ZIP files one-by-one and parse all XML files and update data in the database in parallel.

There are several customizable command-line parameters available for these scripts. These options can be found in the documentation within the scripts or by using the "help" option -h while executing them.

Tables

The updated Scopus data will be stored into 22 tables in our Postgres database and all tables can be connected by primary keys and foreign keys. The detailed information about each table is in scopus_tables.sql.

Entity-Relationship Diagram

Refer to entity-relationship diagram (ERD) below for all the tables and corresponding columns that have been created using the scripts above: