Home - Git-Lit/git-lit GitHub Wiki

Welcome to the git-lit wiki!

We'll add more structure to this later, but for now it's just a single page to record various interesting/useful bits of information.

Developer Tips

Quickstart

git clone [email protected]:Git-Lit/git-lit.git
cd git-lit
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

If use MacPorts and get a complaint about an old version of the libxml2 dynamic library when importing lxml, using the following command may help:

export DYLD_LIBRARY_PATH=/opt/local/lib

This, however, can break other things (like git), so a better solution is to preface commands that need it with the definition, e.g.

DYLD_LIBRARY_PATH=/opt/local/lib python stats.py -r data

File naming conventions

The zip files, as delivered from the British Library, live in the data directory (samples only) and the structure looks like:

data/000000037/000000037_0_1-42pgs__944211_dat.zip
data/000000196/000000196_0_1-164pgs__1031646_dat.zip
data/000000206/000000206_0_1-256pgs__594984_dat.zip
data/000000216/000000216_1_1-318pgs__632698_dat.zip

The file name format is:

{book id}_{volume}_{version?}-{page count}pgs__{?unknown?}_dat.zip

  • book id is the Aleph system number (sysnum) of the catalog record for the original. This is different from the sysnum associated with catalog record for the electronic resource created by the scanning.
  • volume is 0 for single volume editions or 1-N for N volume editions
  • version is always 1. My guess is that it's to allow for rescans, but this should be confirmed with BL.
  • page count is per volume
  • unknown is ... ??? doesn't appear to be length or date

The zip file contains a {book id}_metadata.xml file at the top level which contains limited metadata in MODS format. The OCR text is in the ALTO subdirectory in using one of the following naming schemes:

ALTO/000000216_01_000001.xml
ALTO/000000216_01_000002.xml

ALTO/000000206_000001.xml
ALTO/000000206_000002.xml
ALTO/000000206_000003.xml

The first example is volume 1 of a multi-volume scan and the second is a single volume scan.