History of Ingestion into Figgy - pulibrary/geniza GitHub Wiki
-
First Figgy Issue posted. Files had been copied to the Isilon (by Roel Munoz?) and comprised two directories:
- pudl/geniza
- pudl/geniza2
A preliminary analysis of these directories suggested they had been dumped from folders on one or more PC hard drives: the folder names suggested they were themselves agglomerations of downloads from some other source. This raw data set was enormous: over 3 terabytes comprising 20,000 tif files and 38,000 dng (digital negative) files. The full provenance of these files has never been established.
The files constitute a genizah of their very own. The directories appear to be hard-drive dumps, with haphazard directory names. Most filenames adhere to a shelf-mark-based naming convention. About half are tifs; the other half are .dng.
One subset of files are different. NEH_Geniza comprises 5 deliveries in six directories; some directories contains items organized by volume; each of these directories is subdivided into images of individual fragments and images of sheets. Others are tiffs of MSS pages (bound and unbound); one directory contains images of oversized items, sub-divided into various stages of stitching. These must be processed differently from the rest.
There were two spreadsheets (also of unknown provenance) containing some metadata: one that had been compiled for a Magic Grant; the other some sort of working list from a project at the McGraw center.
- First-pass file conversion and re-arrangement. Shell scripts generated (via Python script) to convert dng files to tif using dcraw. The shell commands were of the form
mkdir -p /mnt/diglibdata/pudl/gniza_working/converted/ena_2709_to_ena_3234 dcraw -c -v -T -w -o 2 ena_2709_to_ena_3234/ENA_3022_ruler.dng > /mnt/diglibdata/pudl/gniza_working/converted/ena_2709_to_ena_3234/ENA_3022_ruler.tiff