4. Additional data validation features - grahamjevon/ReG GitHub Wiki
While the core function of ReG is to generate catalogue references, it also performs some data validation.
While generating references is the core function, to that end it includes several validation features to help you spot and correct problems with your data.
Unexpected hierarchy data
For catalogue references to be calculated, all the data in the level column must match a term within the configured hierarchy. The program therefore checks this and if a discrepancy is found, users will be notified and they have two options to proceed.
Option 1: Rename unexpected terms
First, users have the option to rename any unexpected terms. This is useful for correcting typographical errors, such as this example - where “Files” should be “File”.
Option 2: Build a one-off hierarchy
Alternatively, users can create a one-off hierarchy that matches the terms in the dataset. In the following example, the unexpected hierarchical term “Specimen” is a bona fide term. It is just not part of the configured hierarchy.
Rather than force the user to quit the program and amend the configuration file, they can simply set the hierarchy within the program.
This hierarchy will not be saved for future instances. It is just used for this one-off occasion. If the user wants “Specimen” to be recognised in the future, the configuration file will also need to be updated.
Single child records
To avoid redundant information, it is sometimes advisable for an archivist to eliminate single child records from a collection. ReG will identify any such records (e.g. A "Series" that contains only one "File"), and it will warn you. It will:
- Tell you how many single child records have been found per level
- Tell you the row numbers where these can be found in the original dataset (to help you manually validate the records)
- Export a subset of data containing containing just the single child records and their parents (to help you manually validate the records).
It will offer three options to proceed:
- Delete single child records
- Delete the parents of single child records
- Keep the single child records and/or their parents
These decisions can be made in two ways:
- Applied to all applicable records in the whole dataset
- Applied to all applicable records on a level-by-level basis (e.g. keep all single child files, but delete all single child items).
Depending on how the user chooses to proceed, ReG will produce one of three results, which affects the rows that remain and the structure of the generated references. In this example, the third series in the original dataset contains a single child - a single file.
The most notable result is option B, where the parent was deleted. At face value, the single child now appears to be a sibling of the files from the second series. But the reference indicates that this file is part of a different branch within the tree structure.
This is more clearly illustrated by the following three tree diagrams.
This functionality means that ReG will help you spot any single child records that you may otherwise have been unaware of.
But it also gives you a means of creating an appropriate hierarchical structure when cataloguing in a spreadsheet. If you intentionally insert dummy parents for otherwise single child records, ReG can generate references that map the appropriate tree structure and then remove the dummy parent records in one seamless process.