Building a signature file with ROY - richardlehane/siegfried GitHub Wiki
This guide describes how you can use the roy
tool to build custom signature files.
Note: When
sf
runs, it defaults to a standard signature file (usuallydefaults.sig
). You can choose a custom signature file with the-sig
flag, i.e.sf -sig custom.sig FILE
.
Example
Getting started
Install
roy
is bundled with the windows releases. Simply copy the executable into a location in your PATH.
If you are on Ubuntu or OS/X, roy
is installed with the homebrew and Ubuntu packages.
If you are on a different OS, you can compile roy
with golang installed. Just use go install github.com/richardlehane/siegfried/cmd/roy
.
Setup
Home directory
In order to build a signature file, roy
needs to the know the location of source signature files (e.g. DROID, DROID container, PRONOM reports, and Tika and freedesktop.org MIME-info files). It also needs to know where the signature file should be loaded from / saved to. The sf
tool and roy
both share a home directory where all this information is normally located. You can find your home directory by invoking either of those tools with the -help
flag. You can set a custom home directory for both tools with the -home
flag.
Quick setup: Copy and extract the contents of the latest data.zip file on the releases page page into your home directory and you can skip the rest of this section and start modifying signature files.
DROID and container signature files
If you want to build your own PRONOM signature file, you'll need copies of recent DROID signature and container files in your home directory. You can find these signature files on the TNA's website: http://www.nationalarchives.gov.uk/aboutapps/pronom/droid-signature-files.htm. By default, the latest files in your home directory will be used. You can apply the -droid
or -container
flags to choose specific versions.
PRONOM reports
In normal usage, the DROID file is only used as a list of all current puids. The XML Report files from the TNA's PRONOM website are preferred as the primary sources for building the signature. In order to build a signature in this way you need copies of all these reports. The roy harvest
command will download these for you into a reports directory within your home directory. This runs fairly quickly but if you are on a slow connection you may find it times out before completion. You can use the timeout flag (roy harvest -timeout 5m
). You can select a different reports directory with the -reports
flag.
You can, if you prefer, build a PRONOM signature file without reports and using only the DROID file. To do this, use the flag -noreports
.
Apache Tika and freedesktop.org MIME-info signature files
If you want to build a MIME-info signature file, you'll need a copy of the latest signature files from Apache Tika or freedesktop.org.
Apache Tika MIME-info files (tika-mimetypes.xml) are available for download with the Tika source code: http://tika.apache.org/download.html.
The freedesktop.org MIME-info files are available at: https://freedesktop.org/wiki/Software/shared-mime-info/.
Some other software projects maintain their own, custom MIME-info signature files, for example UFRaw (UFRaw's MIME-info files can be found within the project's source code e.g. v0.22). You can use any valid MIME-info file as a source when building signatures.
Library of Congress FDDs
To build a Library of Congress FDD signature file, download the latest fddXML.zip from the Library of Congress website at: http://www.loc.gov/preservation/digital/formats/fdd/fdd_xml_info.shtml
Format sets (optional step)
Format sets are a convenience mechanism to support signature customisation with roy
. The -limit
and -exclude
flags take comma-separated lists of format IDs to limit a signature to a selection of formats or to exclude a selection. The sets feature makes these flags a bit more functional by allowing commands like:
roy build -limit @pdf
or
roy build -exclude @pdfa,@pdfx,@pdfe
Sets can also be used with the -extend
and -extendc
flags for adding lists of format extensions to your signature. E.g. roy build -extend @exponential-decay,@archivematica
.
The sets feature works like a macro: it looks at any json files in a sets directory (within your siegfried "home" directory) for definitions of format sets. Any formats with the '@' prefix are expanded to the contents of those sets. Here is an example of a sets file for pdf. This expansion is recursive: you can include sets within larger sets. You can also refer to sets across set files (so could create a separate 'office.json' file that has references to the pdf sets).
To use this feature, you'll need to create that sets directory and add your own format sets there. You can also copy sets files from the siegfried repository (contributions welcome).
Setup... in six steps
So, to recap, if you want to build your own signature file you need to:
- identify where your home directory is located (or select a custom one with the
-home
flag) - copy DROID and container files from the TNA's website into that home directory
- invoke the
roy harvest
command to download PRONOM reports - download the Apache Tika and freedesktop.org MIME-info files
- download the Library of Congress FDD signatures
- (optional) create a sets directory and create/copy format sets there.
Build
Once you've done all that, simply invoking roy build
is enough to create a new signature file. This will build a default.sig file identical to the signature file distributed by the siegfried update service (sf -update
). The default signature file contains a single identifier based on the latest release of the PRONOM database.
A MIME-info signature file
The roy build
command assumes that you are creating a PRONOM signature file by default. To build a MIME-info signature file instead, use the -mi
flag with the name of the MIME-info signature file:
e.g. roy build -mi tika-mimetypes.xml
As a convenience, you can just use "tika" instead of "tika-mimetypes.xml" and "freedesktop" instead of "freedesktop.org.xml". The -mi
flag also works with the roy add
command (which is described further below):
e.g. roy add -mi freedesktop
A Library of Congress FDD signature file
To build a FDD signature file do:
roy build -loc
or roy add -loc
Where FDD signatures reference PRONOM IDs, PRONOM signatures are imported into the LOC identifier. You can override this behaviour so that only LOC magic is used with the -nopronom
flag i.e. roy build -loc -nopronom
A Wikidata signature file
The Wikidata identifier implements harvest and build routines. Using the defaults to build a Wikidata signature file you would do the following:
Harvest
roy harvest -wikidata
Build
roy build -wikidata
There are a few different ways to work with either of these capabilities which are documented more thoroughly in the documentation for the Wikidata identifier.
Customisable
roy
has a number of options for further customising your signature files.
Here are the flags you can apply:
roy build -bof 16000
(set a maximum beginning of file offset for byte sequence matching)
roy build -eof 8000
(set a maximum end of file offset for byte sequence matching)
roy build -noeof
(trim end of file segments from byte signatures)
roy build -nobyte
(build an identifier without byte signatures)
roy build -nocontainer
(build an identifier without container signatures)
roy build -notext
(build an identifier without a text matcher)
roy build -noname
(build an identifier without a filename matcher)
roy build -nomime
(build an identifier without a MIME matcher)
roy build -noxml
(build an identifier without an XML matcher)
roy build -noreports
(build an identifier using the DROID file alone and not PRONOM XML reports)
roy build -limit fmt/1,fmt/2,fmt/3
(limit the identifier to certain formats)
roy build -exclude fmt/1,fmt/2,fmt/3
(exclude formats from the identifier)
roy build -extend custom-fmt1.xml,custom-fmt2.xml
(add custom signatures in DROID format e.g. using this utility. Custom signature should be placed in a custom directory within your home directory)
roy build -multi single
(build an identifier that is guaranteed to return a single result. In the event of a tie, UNKNOWN is returned with a descriptive warning)
roy build -multi conclusive
(the default mode, applies weights and returns only the strongest result(s))
roy build -multi positive
(in this mode, all strong results are returned. This means that a result that is based on an internal signature such as a byte, container, RIFF or xml match. Weights and priorities are still applied in order to return early from matching wherever possible - i.e. this mode does not affect speed.)
roy build -multi comprehensive
(identical to positive except that weights and priorities are ignored during matching - like exhaustive, this mode will slow things down)
roy build -multi exhaustive
(build an identifier that ignores format weights and returns all results - this will slow things down but can be useful for debugging e.g. alongside sf -debug FILE
)
roy build -extend custom-fmt1.xml -extendc -custom-container-fmt1.xml
(add custom signatures in DROID container format. The DROID container format doesn't include format details such as name and mimetype so these need to be provided in a matching normal DROID extension file. Read this post for more information. Custom signature should be placed in a custom directory within your home directory)
roy build -droid DroidSignatureFile_V10.xml -noreports
(specify a particular DROID file, the noreports flag is useful if you don't have matching PRONOM reports for older versions)
roy build -container container-signature-2010.xml
(specify a container signature file)
roy build -mi tika-mimetypes.xml
(build a MIMEInfo identifier with the supplied MIMEInfo signature file. You can use "tika" or "freedesktop" as aliases for "tika-mimetypes.xml" and "freedesktop.org.xml" respectively.)
Naming your signature file and your identifier
All of the commands above will work but they will override your default default.sig file. Since many of these constraints will alter the way that files are identified, it is best practice to use a different signature name and a different identifier name.
For example:
roy build -name speedy -bof 131072 speedy.sig
The last part speedy.sig is the signature name and the -name
flag names the identifier.
Describing your modifications
When roy
builds your signature file it will automatically populate a "details" field with information about all the modifications you have made. This information goes into the provenance block at the beginning of sf
results. You can override this "details" field, to provide your own description, with the -details
flag.
E.g. roy build -exclude @pdf -details "Sorry posterity... I don't care about provenance!" evil.sig
One signature file, multiple identifiers
A single signature file can contain one or more identifiers.
Identifiers are sets of format signatures with a common identity. When you run the sf
tool, all the identifiers are listed in the "provenance" block at the head of the results and each identifier will report its results for every file matched. Siegfried's design means you can add additional identifiers without incurring significant additional runtime cost (i.e. a second identifier won't double the matching time). The main purpose of this feature is to enable support for additional signature formats. But you may want to build signature files with multiple identifiers for other reasons: for example, to view changes in signature files over time or to test the effects of various signature customisations on sample files.
To create a signature file with multiple identifiers you first use roy build
to create a signature file with one identifier and then roy add
to add additional identifiers. roy add
accepts the same arguments as roy build
: the only difference is that roy build
creates a new signature file while roy add
adds a new identifier to an existing signature file.
For example:
roy build -name latest -nocontainer history.sig
(build a signature file with the latest version of DROID but without containers)
roy add -name "version 10" -droid DroidSignatureFile_Version10.xml -noreports -nocontainer history.sig
(add an additional identifier with an older signature file. Use -noreports
if you don't have old PRONOM XML reports lying around.)
Inspecting your handiwork
roy
has an inspect
command for viewing the contents of signature files, see Inspect and Debug