LexisNexis IPDD XML Parsers - NETESOLUTIONS/ERNIE GitHub Wiki
This page gives a walk-through of using NeteLabs' open-source Lexis/Nexis IPDD parsers.
All scripts related to downloading, storing, and processing patent-related XML files can be found in the LexisNexis directory of the ERNIE repository.
API access from Lexis/Nexis Intellectual Property Data Direct (IPDD)
- Service reference
- User name
- Password
- OS: CentOS 7 Linux
- RDBMS: Postgres 12
- ETL: Python 3 and Bash
- (Optional) Continuous Integration: Jenkins
This script can be used to retrieve documents (XML files) in batches which are updated based on previously received data (if any) once the IPDD credentials have been obtained.
- In order to parse data from the XML files, we need to ensure that the database has all the required tables. In Postgres, running this tables script will create all the tables. Currently, we have 15 Lexis/Nexis patent-related tables.
- Once the tables have been created, a data-definition language (DDL) can be established. This consists of all the SQL stored procedures that can be used each time new data is downloaded to update parse the XML files and update corresponding tables. All stored procedures used to update each of the 15 tables can be found in this directory.
- After the procedures have been saved, this parser script can be used call them and parse data.
Assuming that the IPDD credentials have been obtained, all the tables have been created, and the stored procedures and parser scripts are in place, we can use the following scripts to (a) download data and (b) update tables.
There are several customizable command-line parameters available for these scripts. These options can be found in the documentation within the scripts or by using the "help" option -h
while executing them.
This downloading shell script is used to
- create a folder in the working directory that will store the downloaded files.
- call the API script above to download XML documents (ZIP files) in batches.
- download any new files, if available, via
lftp
.
This updating shell script is used to
- unzip the downloaded files and move them to a temporary processing directory.
- call the parser script to read data into the tables.
- update a text file that logs the names of all processed XML files -
processed.log
. - add any files that failed to get processed into a separate
failed
directory.
Note: This script utilizes GNU parallel
to update the tables in parallel.
NeteLabs uses Jenkins to automate the download and update processes at regular intervals. The download and update scripts can be executed manually or in an automated manner.
On the rare occasion that specific records need to be deleted (patents may have become obsolete), this ad-hoc delete script can be used to specify the records to be deleted.
Refer to entity-relationship diagram (ERD) below for all the tables and corresponding columns that have been created using the scripts above: