file organization - pulibrary/geniza Wiki

File Arrangement

The files in the cairogeniza filesystem have been organized according to arrangements and shelfmark conventions found in several references:

  • Lerner, Heidi G., and Seth Jerchower. “The Penn/Cambridge Genizah Fragment Project: Issues in Description, Access, and Reunification.” Cataloging & Classification Quarterly 42, no. 1 (May 8, 2006): 21–39. https://doi.org/10.1300/J104v42n01_04.
  • Choueka, Yaacov. “Computerizing the Cairo Genizah: Aims, Methodologies and Achievements.” Ginzei Qedem, no. 8 (2012): 9–30.
    • Zinger, Oded. “Finding a Fragment in a Pile of Geniza: A Practical Guide to Collections, Editions, and Resources.” Jewish History 32, no. 2–4 (December 2019): 279–309. http://dx.doi.org.ezproxy.princeton.edu/10.1007/s10835-019-09314-6.

The files came to us in several distinct tranches, having been digitized and assembled through several processes at several institutions. Each tranche had its own organization and file-naming convention. We were asked to determine the shelfmark for each object, based on those varied (and often inconsistent) conventions, so that it (the shelfmark) could be associated with the digital object at ingest.

We wrote several scripts to both regularize the names and to organize the objects hierarchically. Here are some of the pattern-matching and renaming rules we compiled (there were considerably more)

    add_rule(/^(?<lib>ENA|NS|MS)_(?<id>[^_]+)_0*(?<leaf>\d+)_[rv]\.tiff?$/,
                       ->(m) { "#{m[:lib]} #{m[:id]}.#{m[:leaf]}" },
                       ->(m) { File.join( m[:lib], m[:id], m[:leaf].rjust(3, "0")) })

    add_rule(/^(?<lib>ENA|NS|MS)_(?<id>[^_]+)_ruler\.tiff?$/,
             ->(m) { "" },
             ->(m) { File.join(m[:lib], m[:id]) })

    add_rule(/^(?<lib>ENA|NS|MS)_(?<id>[^_]+)_0*(?<leaf>\d+)\.tiff?$/,
             ->(m) { "#{m[:lib]} #{m[:id]}.#{m[:leaf]}" },
             ->(m) { File.join(m[:lib], m[:id], m[:leaf].rjust(3, "0")) })

    add_rule(/^(?<lib>ENA|NS|MS)_(?<series>[^_]+)_(?<id>[^_]+)_0*(?<leaf>\d+)_[rv]\.tiff?$/,
             ->(m) { "#{m[:lib]} #{m[:series]} #{m[:id]}.#{m[:leaf]}" },
             ->(m) { File.join( m[:lib], m[:series], m[:id], m[:leaf].rjust(3, "0")) })

    add_rule(/^(?<lib>ENA|NS|MS)_(?<series>[^_]+)_(?<id>[^_]+)_ruler\.tiff?$/,
             ->(m) { "" },
             ->(m) { File.join(m[:lib], m[:series], m[:id]) })

    add_rule(/^(?<lib>ENA|NS|MS)_(?<id>[^_]+)_(?<sub>[AB])_0*(?<leaf>\d+)\.tiff?$/,
             ->(m) { "#{m[:lib]} #{m[:id]}.#{m[:sub]}.#{m[:leaf]}" },
             ->(m) { File.join(m[:lib], m[:id], m[:sub]) })

    add_rule(/^(?<lib>ENA|NS|MS)_(?<id>[^_]+)_(?<sub>[AB])_ruler\.tiff?$/,
             ->(m) { "" },
             ->(m) { File.join(m[:lib], m[:id], m[:sub]) })

# ENA 2826
add_rule(/^(?<lib>ENA|NS)\s+(?<id>\d+)$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0")) })

# ENA 2056.1
add_rule(/^(?<lib>ENA|NS)\s+(?<id>\d+)\.0*(?<leaf>\d+(-\d+)?)$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0"), m[:leaf].rjust(6, "0")) })


# ENA 1822A.1
add_rule(/^(?<lib>ENA|NS)\s+(?<id>\d+[A-Za-z])\.(?<leaf>\d+\w?)$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0"), m[:leaf].rjust(6, "0")) })

# ENA 4096a
add_rule(/^(?<lib>ENA)\s+(?<id>\d+[a-z]\d?)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0")) })

# ENA 1235.17.sleeve2
add_rule(/^(?<lib>ENA)\s+(?<id>\d+)\.0*(?<leaf>\d+)\.(?<rest>.*)$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0"), m[:leaf].rjust(6, "0"), m[:rest]) })

# ENA 2727.18d
add_rule(/^(?<lib>ENA)\s+(?<id>\d+)\.0*(?<leaf>\d+\w*)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0"), m[:leaf].rjust(6, "0")) })

# ENA 4096.e1
add_rule(/^(?<lib>ENA)\s+(?<id>\d+)\.0*(?<leaf>\w+\d*)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0"), m[:leaf].rjust(6, "0")) })



add_rule(/^(?<lib>ENA|NS)\s+(?<id>\w?\d+)uler.*?$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0")) })

add_rule(/^(?<lib>ENA)\s+(?<id>\d+)\.(?<sub>[A-Z])\.(?<leaf>\d+)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0"), m[:sub], m[:leaf].rjust(6, "0")) })

add_rule(/^(?<lib>ENA)\s+(?<id>\d+)\.(?<sub>[A-Z])uler.*?$/,
         ->(m) { File.join(dest, m[:lib], m[:id], m[:sub]) })



add_rule(/^(?<lib>ENA)\s+(?<sub>NS)\s+(?<id>I)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:sub], m[:id]) })

add_rule(/^(?<lib>ENA)\s+(?<sub>NS)\s+(?<id>I)\.0*(?<leaf>\d+\w?)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:sub], m[:id], m[:leaf].rjust(6, "0")) })

add_rule(/^(?<lib>ENA)\s+(?<sub>NS)\s+(?<id>I)\.0*(?<leaf>\d+\w?)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:sub], m[:id], m[:leaf].rjust(6, "0")) })

add_rule(/^(?<lib>ENA)\s+(?<sub>NS)\s+(?<id>\d+)\.0*(?<leaf>\d+\w?)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:sub], m[:id].rjust(6, "0"), m[:leaf].rjust(6, "0")) })

# ENA NS 13.1-2
add_rule(/^(?<lib>ENA)\s+(?<sub>NS)\s+(?<id>\d+)\.\.?0*(?<leaf>\d+(-\d+)?)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:sub], m[:id].rjust(6, "0"), m[:leaf].rjust(6, "0")) })

# ENA NS 60
add_rule(/^(?<lib>ENA)\s+(?<sub>NS)\s+(?<id>\d+\w?)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:sub], m[:id].rjust(6, "0"),) })

add_rule(/^(?<lib>Schechter|KE|Krengel)\.(?<id>\d+\w?)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0")) })

add_rule(/^(?<lib>Schechter|KE|Krengel)uler\.(?<id>\d+\w?)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0")) })

# MS.L501.19; MS.4607.6
add_rule(/^(?<lib>MS)\.(?<id>[A-Z]?\d+\w?)\.0*(?<leaf>\d+\w?(\.\d+)?)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0"), m[:leaf].rjust(6, "0")) })

# MS.L596uler.1
add_rule(/^(?<lib>MS)\.(?<id>\w?\d+)uler.*?$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0")) })

# MS.L590.Vol.1.1
add_rule(/^(?<lib>MS)\.(?<id>\w?\d+)\.(?<vol>Vol\.\d)\.(?<leaf>\d+)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0"), m[:vol], m[:leaf].rjust(6, "0")) })

# MS.10809
add_rule(/^(?<lib>MS)\.(?<id>\d+\w?)(uler.*)?$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0")) })

# MS.10674..fol..5
add_rule(/^(?<lib>MS)\.(?<id>[A-Z]?\d+\w?)\.+fol\.+(?<fol>\d+\w*)$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0"), "fol" + m[:fol]) })

# MS.L590.Vol..3..fol..14
add_rule(/^(?<lib>MS)[_.](?<id>[A-Z]?\d+\w?)[_.]+Vol[_.]+(?<vol>\d+)[_.]+fol[_.]+(?<fol>\d+)$/,
         ->(m) { File.join(dest, m[:lib], m[:id].rjust(6, "0"), "Vol" + m[:vol],  "fol" + m[:fol]) })

In general, the names could be broken down into library, series, subcollection, identifier, and leaf. Not all names contained all these elements; many contained additional characters. We iterated over these names, writing matching rules, until we could match every filename.

At the bottom level was always a directory named with the object identifier (the fragment number or, occasionally, the page number), containing two image files: one recto and one verso.

The transformation rules added leading zeros to numeric subdirectories so that they would sort properly.

Here are some examples:

- ENA_2056_001_r.tiff -> ENA/00256/000001/ENA_2056_001_r.tif

- ENA_NS_I_93a_v.tiff -> ENA/NS/I/00093a/ENA_NS_I_093a_v.tif
⚠️ **GitHub.com Fallback** ⚠️