MetaArchive QA Tools - MetaArchive/public-documentation GitHub Wiki
Scripts to help with validating the quality of files in a collection before being preserved. The scripts are available here: https://github.com/MetaArchive/metaarchive-qa-tools
The find-bad-files.py script identifies hidden files and alerts filenames with problematic characters like empty spaces:
- Prior to creating a Bag of collection data it is important to isolate any hidden files, config files, .DS_store, tilde-prefixed backups, and other likely-undesired files that often have their contents modified on endpoints during a transfer or ingest.
- Inclusion of these files and their checksums in a BagIt manifest file will result in a failed validation of the Bag at some future point.
- This is because each endpoint will modify such files for its own localized management purposes. When the contents change the checksums will change. The BagIt Library will flag these mismatches during a future export/recovery.
- In addition to isolating problematic system files the find-bad-files.py script alerts to file names that contain empty spaces so that these can be remedied with underscores in accordance with best practices for file naming conventions.
- Of special significance, because LOCKSS makes use of http to retrieve Bagged collection files at a URL, any empty spaces in file names will be URL encoded and result in a file stored with inserted characters that were not present in the file name at the source URL.
- Upon an export of Bagged collections these encodings will be present and differ from what was recorded in the original BagIt manifest.
Making use of the find-bad-files.py script occurs during Phase 2, Step 2 in Getting Started with BagIt for MetaArchive.
The find-bad-files.py script can be obtained from our MetaArchive GitHub repository here: https://github.com/MetaArchive/metaarchive-qa-tools. Usage is documented in the README, which is also provided below.
This tool will recursively scan a directory for filenames that violate a set of naming standards meant to prevent problems when ingesting collections into LOCKSS over HTTP. The relative paths will be printed to standard output; No files are moved or modified by this tool.
If -v or --verbose is provided, the reasons for the results being matched will be printed with the output.
python find-bad-files.py [-h|--help] [-v|--verbose] <directory>
- Characters must be URL-safe. For our purposes, we strictly limit the characters present in filenames to letters, numbers, dots (.), hyphens (-), and underscores (_).
- Filenames must start with a letter or number. This prevents inclusion of various hidden files, config files, .DS_store, tilde-prefixed backups, and other likely-undesired files.
- Filenames must not equal "Thumbs.db".
This tool allows you to compare a HashCUS.txt manifest generated by LOCKSS with a BagIt MD5 manifest from the same title to check for any discrepancies.
Once you have your HashCUS.txt and manifest-md5.txt files ready, use either the HTML/JS graphical tool or the Python command-line interface script provided to run the comparison.
- Open the
lockss-manifest-validate.htmlfile using Firefox, Chrome, IE, or any reliable web browser not listed here. - Use the form on the page to select your HashCUS.txt and manifest-md5.txt files from their location on your hard drive.
- Click Compare. You will soon see an alert window indicating how many records were compared, and how many errors were found.
- Click OK. In the 'Output' box, you will see a detailed log with information about which records had errors, as well as some additional statistics.
When invoked, the lockss-manifest-validate.py script will output a log (similar to the one written by the HTML/JS version of the tool) containing a detailed report of the comparison results.
python lockss-manifest-validate.py [-h|--help] <HashCUS> <manifest-md5>