Data Integrity Testing - OpenAPC/openapc-de GitHub Wiki

Background

Since most of the metadata submitted to OpenAPC has been manually created at some point in its life cycle, it will inevitably contain errors. Furthermore, even data imported from external sources like CrossRef cannot be relied on to be correct or up-to-date in all cases. We address this problem by employing a software test suite which checks the whole dataset for potential errors on a regular basis.

Technical details

The test script is written in Python and based on the pytest testing framework. Upon execution the script imports both the OpenAPC core data file and the offsetting file and sends every entry through a set of test functions. A report lists any encountered errors after finishing.

There are 2 work modes for the test suite: First, it can be simply called from the command line to verify data integrity in the local git repository (This should always be done before pushing back any changes to the APC data files back to github!). Second, it is automatically called whenever a push or pull request occurs in the OpenAPC repository by hooking into a continuous integration service (Travis, in our case). The test suite is executed on a remote server and results are reported to the OpenAPC team via mail/Slack integration. A small widget on the OpenAPC README page also informs about the latest test status:

Test cases

The following tests are applied to every article (csv lines) in the OpenAPC core data file and the offsetting file:

Standalone tests

These tests are independent of other lines in the file:

(syntax) Every line must consist of exactly 18 columns.
(content) The columns publisher and journal_full_title may not be empty or NA. publisher and journal_full_title may not contain leading or trailing whitespaces.
(content) The columns is_hybrid, indexed_in_crossref and doaj must either be TRUE or FALSE.
(content) The column doi must either be NA or contain a valid DOI (checked against a regular expression).
(content) If the column doi is NA, the column url may not be NA.
(content) The column issn may not be empty or NA. Its content must represent an ISSN which is both checked for syntactical (regular expression) and semantical correctness (ISSN check digit calculation). The other ISSN fields (issn_print, issn_electronic and issn_l) may be NA, but if they contain a value, it must pass the same checks.
(content) The column euro must contain a valid numerical value (dot (".") as decimal point, no thousands separator) which must be larger than 0. Entries from the offsetting file skip this test.
(logical*) If the column doaj is TRUE, the column is_hybrid must be FALSE.

Interdependent tests

These are consistency tests which check a line against all other lines.

(duplicate) Two entries may not share the same content in the doi column (except for NA)
(consistency*) If two entries share the same content in either the issn, issn_print or issn_electronic columns, the columns publisher, journal_full_title, is_hybrid and issn_l must also be identical.

The tests marked with a (*) are not entirely accurate and will return false positives in some cases (a journal's title, publisher or hybrid status is not always consistent over the course of time). In this case an ISSN can be whitelisted to skip some of those tests.