History of Ingestion into Figgy - pulibrary/geniza Wiki



  • First Figgy Issue posted. Files had been copied to the Isilon (by Roel Munoz?) and comprised two directories:
    1. pudl/geniza
    2. pudl/geniza2

    A preliminary analysis of these directories suggested they had been dumped from folders on one or more PC hard drives: the folder names suggested they were themselves agglomerations of downloads from some other source. This raw data set was enormous: over 3 terabytes comprising 20,000 tif files and 38,000 dng (digital negative) files. The full provenance of these files has never been established.

    The files constitute a genizah of their very own. The directories appear to be hard-drive dumps, with haphazard directory names. Most filenames adhere to a shelf-mark-based naming convention. About half are tifs; the other half are .dng.

    One subset of files are different. NEH_Geniza comprises 5 deliveries in six directories; some directories contains items organized by volume; each of these directories is subdivided into images of individual fragments and images of sheets. Others are tiffs of MSS pages (bound and unbound); one directory contains images of oversized items, sub-divided into various stages of stitching. These must be processed differently from the rest.

    There were two spreadsheets (also of unknown provenance) containing some metadata: one that had been compiled for a Magic Grant; the other some sort of working list from a project at the McGraw center.


  • First-pass file conversion and re-arrangement. Shell scripts generated (via Python script) to convert dng files to tif using dcraw. The shell commands were of the form
    mkdir -p /mnt/diglibdata/pudl/gniza_working/converted/ena_2709_to_ena_3234
    dcraw -c -v -T -w -o 2 ena_2709_to_ena_3234/ENA_3022_ruler.dng > /mnt/diglibdata/pudl/gniza_working/converted/ena_2709_to_ena_3234/ENA_3022_ruler.tiff
⚠️ **GitHub.com Fallback** ⚠️