Home - AdventistMediaMinistries/KeyfilePaperlessDocumentStorageExtraction GitHub Wiki

Introduction

The project consists of a ruby gem, a handful of ruby utilities, and a library of code the utilities use, and which could form the building blocks for additional extractions.

Utilities

Prerequisites: For ideal performance, a recent version of jRuby is recommended. jRuby is capable of true multithreading, and when dealing with millions of files, performance is an issue. While the software should work under your stock ruby, unless you're dealing with a very small database, any extraction of data will take forever. For example, on a particular dataset, system ruby took 8+ hours. jRuby took less than 3.

diff

use to subtract one index file from another

pf

extract a folder of documents creating a pile of files organized by date without regard to structure. Multi-page documents are kept together, but there is no other structure. Documents are extracted and organized in a date-based folder structure

pm

Extract a top level metadata file and everything under it. Document organization is preserved

ex

Extract everything referenced by the passed index file.

sherlock

A general purpose utility for doing a number of things

  • decoding single metadata or index files (same as pm above)
  • filter various Keyfile types from an indiscriminate list of files
  • search a folder of Keyfile data for files containing a particular byte pattern (produces a list)
  • Make a list of every keyfile document in a particular folder (this is how we get index files referenced above)
  • verify that all the files listed in an index file actually exist

Source Code

Click for a high-level description of the source code and it's objects.