LexisNexis IPDD XML Parsers - NETESOLUTIONS/ERNIE GitHub Wiki

This page gives a walk-through of using NeteLabs' open-source Lexis/Nexis IPDD parsers.

All scripts related to downloading, storing, and processing patent-related XML files can be found in the LexisNexis directory of the ERNIE repository.

Table of Contents Requirements IPDD Credentials Technology Stack Lexis/Nexis Patent Data Accessing the API Getting the Database Ready Downloading and Updating Download and Store Process Continuous Integration Delete Records Entity-Relationship Diagram

Requirements

API access from Lexis/Nexis Intellectual Property Data Direct (IPDD)

IPDD Credentials

Service reference
User name
Password

Technology Stack

OS: CentOS 7 Linux
RDBMS: Postgres 12
ETL: Python 3 and Bash
(Optional) Continuous Integration: Jenkins

Lexis/Nexis Patent Data

Accessing the API

This script can be used to retrieve documents (XML files) in batches which are updated based on previously received data (if any) once the IPDD credentials have been obtained.

Getting the Database Ready

In order to parse data from the XML files, we need to ensure that the database has all the required tables. In Postgres, running this tables script will create all the tables. Currently, we have 15 Lexis/Nexis patent-related tables.
Once the tables have been created, a data-definition language (DDL) can be established. This consists of all the SQL stored procedures that can be used each time new data is downloaded to update parse the XML files and update corresponding tables. All stored procedures used to update each of the 15 tables can be found in this directory.
After the procedures have been saved, this parser script can be used call them and parse data.

Downloading and Updating

Assuming that the IPDD credentials have been obtained, all the tables have been created, and the stored procedures and parser scripts are in place, we can use the following scripts to (a) download data and (b) update tables.

There are several customizable command-line parameters available for these scripts. These options can be found in the documentation within the scripts or by using the "help" option -h while executing them.

Download and Store

This downloading shell script is used to

create a folder in the working directory that will store the downloaded files.
call the API script above to download XML documents (ZIP files) in batches.
download any new files, if available, via lftp.

These files can then be accessed and processed.

Process

This updating shell script is used to

unzip the downloaded files and move them to a temporary processing directory.
call the parser script to read data into the tables.
update a text file that logs the names of all processed XML files - processed.log.
add any files that failed to get processed into a separate failed directory.

Running this script will automatically update the existing tables with available data.

Note: This script utilizes GNU parallel to update the tables in parallel.

Continuous Integration

NeteLabs uses Jenkins to automate the download and update processes at regular intervals. The download and update scripts can be executed manually or in an automated manner.

Delete Records

On the rare occasion that specific records need to be deleted (patents may have become obsolete), this ad-hoc delete script can be used to specify the records to be deleted.

Entity-Relationship Diagram

Refer to entity-relationship diagram (ERD) below for all the tables and corresponding columns that have been created using the scripts above: