Tips and Tricks - richardlehane/siegfried GitHub Wiki

Most of these tips and tricks require roy to have been set up. If you haven't done this yet, follow the setup steps for roy that are listed on this page.

Getting absolute file names in your results

The 'filename' field in sf results is based on the file path that you give as input to sf. For example, if you give sf a relative path as input, such as sf docs, you'll get relative paths in your results like "filename: docs\myfile.txt".

You might prefer absolute paths here instead, such as "filename: c:\Users\richardl\docs\myfile.txt". To get absolute paths you need to give absolute paths as input, e.g. sf c:\Users\richardl\docs.

Rather than type out the full path in your terminal, when you've already navigated to the right folder, you can use the following shortcuts to get absolute file names in your results:

sf %cd%\docs [on Windows]
sf $PWD/docs [on unix]

Piping unknowns to file or tika

The sf -log unknown command returns a list of all the files that sf cannot identify. This list can be used as the input for other identification tools such as file and tika.

For example, on a Linux or Mac command line (need GNU version of xargs), you could do:

sf -log unknown,stdout DIR | xargs -d '\n' file

sf -log unknown,stdout DIR | xargs -d '\n' -n 1 java -jar tika-app.jar -d

Filtering and deep scanning PDF files

PRONOM is geared towards finding single, canonical identifiers for files. In certain circumstances, however, we might like to know all the matching signatures for a particular file type e.g. a PDF/X file will also have a PDF version (e.g. 1.6) and we might want that additional information.

We can get all matching signatures with sf if we turn off signature prioritisation (the system that tells us which signatures take priority over other signatures) with roy build -nopriority. But scanning in nopriority mode can take a long time. We can speed this up if we limit the signature file to only the PDF signatures:

roy build -limit @pdf -nopriority -name pdfscan pdfscan.sig

Running this with sf -sig pdfscan.sig DIR will give us all the matching signatures for PDFs in that directory. But what happens if that directory also contains other file types? In that case, we'll end up with many UNKNOWNs that we might not want in our output. Let's filter these out by creating a second signature file that just isolates PDF files:

roy build -limit @pdfcore -noeof -name pdffilter pdffilter.sig

The @pdfcore set doesn't contain wildcard PDF signatures like PDF/A so runs much faster than a full PDF signature set. We can make it even faster by removing end-of-file portions from signatures (using -noeof). We can then run this filter over our directory, piping all the PDF files to a second sf process that does the full PDF scan (the -log known,stdout command means we just emit filenames to stdout and the -f - flags at the end of the second command ranges over that list):

sf -sig pdffilter.sig -log known,stdout DIR | sf -sig pdfscan.sig -f -

Using jq to isolate errors and warnings

jq is a command line tool for processing JSON. Since JSON is one of the outputs supported by sf, you can use jq to process sf output.

For example, the following jq command will isolate all the error and warning information from sf results:

sf -json DIR | jq '.files[] | select(.errors!="" or .matches[].warning!="") | .filename, .errors, .matches[].warning'

This is useful for post-processing results files. To access errors and warnings during a scan do:

sf -log e,w -json DIR > my_results.json.

A simpler way to post-process a results file to inspect errors and warnings is to use sf -replay.

E.g. sf -replay -log e,w my_results.json

Using mlr to do two-pass scanning

If you are doing very large scans, and you want to tune your signature file for better performance, the best approach is usually to narrow the scope of your signature file with the -limit flag.

For example, at my workplace we recently received a consignment that we knew contained hundreds of thousands of tiff images. This job might take hours with a default signature file but is much quicker with a signature file limited to the tiff formats:

roy build -limit @tiff -name tiff tiff.sig

If we run roy inspect tiff.sig we see that the resulting signature file has automatically inferred a BOF limit of 4096 and an EOF limit of 4101 bytes... so this should run really quick! Any file that isn't a tiff will be reported as UNKNOWN and in this kind of scenario this is a good thing because it prompts us to take a closer look at any of the file types we weren't expecting.

Let's run a first pass over our consignment, choosing csv output so we can do further processing with mlr:

sf -csv -sig tiff.sig DIR > 1pass.csv

Miller is a command-line tool for CSV that works a lot like jq. If you're on a Mac, you can install it with brew. We can use the following command to filter out any files with unknown formats:

mlr --csv --rs lf filter '$id == "UNKNOWN"' then cut -f filename 1pass.csv

The --rs lf argument tells mlr that the input csv has UNIX line endings. We apply a cut command to the results of our filter so that we just return filenames.

If we strip the header line from our output (with the awk command below), we can pipe the list of our unknown files back to a second sf process, this time testing against the full PRONOM signature set:

mlr --csv --rs lf filter '$id == "UNKNOWN"' then cut -f filename 1pass.csv | awk '{if(NR>1)print}' | sf -csv -f - > 2pass.csv

Because this second scan is only testing the small number of non-tiff signatures in our consignment, we can save quite a bit of time.

Note In this example mlr is doing all the work but a shorter way to get the same result might be to redirect -log unknown when running sf. E.g. sf -csv -sig tiff.sig -log unknown DIR > 1pass.csv 2> unknowns.txt. You could then get the second pass by simply doing sf -csv - <unknowns.txt.

Finally, we can use mlr to merge both passes into a single result file:

  1. create a new file based on the first pass that only contains known formats (i.e. tiffs):

mlr --csv --rs lf filter '$id != "UNKNOWN"' 1pass.csv > 1pass_known.csv

  1. concatenate those knowns with our second pass:

mlr --csv --rs lf cat 1pass_known.csv 2pass.csv > results.csv

Tip: you can also use sf -replay to concatenate results files. -replay can take multiple results files as input. To achieve the same as above simply do: sf -replay -csv 1pass_known.csv 2pass.csv > results.csv

Re-scanning your repo after a PRONOM update

OK, the team at TNA have just released a big new update of PRONOM ... yay! ... however, the last time you scanned your massive repository you were on PRONOM version 77 and it will take days to do full scan with fresh signatures.

How about doing that full scan with a signature file that contains only the signatures that have been added or changed since you last did a scan?

This will be a lot quicker and will also pinpoint all the changes. The changes set is designed for this use case.

To build a signature file with all changes since version 77 do:

roy build -limit @78,@79,@81,@82 -name changes changes.sig

If you're eagle eyed you may have noticed I skipped @80 - that is because there is no version 80 of PRONOM.

What I'd recommend doing next is to use this changes.sig file as a filter over your repository and only doing a full identification for files that get a match. Assuming you've run sf -update to update your main pronom.sig signature file, this command is:

sf -sig changes.sig -log known,stdout DIR | sf -f -

Re-scanning your repo after a PRONOM update Part II (viewing history)

We could get even more information when re-scanning our repository by using a signature file that contains multiple identifiers, with each identifier representing a different PRONOM release.

The roy build command defaults to the latest available PRONOM signatures so we can build version 88 with this command:

roy build -name v88 history.sig

To add the full history since version 77, we'd then do (using the -noreports flag since we want to build straight from the DROID files):

roy add -droid DROID_SignatureFile_V81.xml -noreports -container container-signature-20150218.xml -name v81 history.sig

roy add -droid DROID_SignatureFile_V79.xml -noreports -container container-signature-20140923.xml -name v79 history.sig

roy add -droid DROID_SignatureFile_V78.xml -noreports -container container-signature-20140923.xml -name v78 history.sig

roy add -droid DROID_SignatureFile_V77.xml -noreports -container container-signature-20140717.xml -name v77 history.sig

We could then apply this history file to the subset of files in our repository that are affected by PRONOM changes since version 77 (that we can identify using the filter signature defined in the section above):

sf -sig changes.sig -log known,stdout DIR | sf -sig history.sig -f -

Running this will tell us, for every file affected by changes to PRONOM, what the original (v77) identification was and how each subsequent release (until v82) identifies the file.

Using multiple identifiers to peek within DOCX files

If we trick sf into thinking a DOCX file is a zip file, we can use the sf -z flag to recurse through the contents of that file. This can be handy, for example, to identify whether DOCX files contain multimedia or other embedded content.

First off, let's create a filter signature that will just isolate DOCX files in a directory:

roy build -limit fmt/412,fmt/494 -name docxfilter docxfilter.sig

Our DOCX analyzer signature will contain the full set of PRONOM signatures:

roy build docxanalyzer.sig

Let's add a second identifier that can only identify ZIP files (and will therefore think that DOCX files, as well as any other zip-based formats, are ZIPs):

roy add -limit x-fmt/263 -name unzipper docxanalyzer.sig

We can now use this command to do a deep scan of all DOCX files in a directory (adding the -z flag to trigger zip scanning):

sf -sig docxfilter.sig -log known,stdout DIR | sf -sig docxanalyzer.sig -z -f -

An Empty Identifier

There may be occasions where you'd like to access some of the ancillary functionality of sf (listing contents of directories, listing contents of compressed files, computing file hashes, etc.) but not perform any identification.

Doing this sort of thing will be quicker with an empty identifier. You can create such a signature file by adding a non-puid limit, e.g.:

roy build -limit null -name empty empty.sig [note due to a roy bug you'll need to add the -noreports flag to this line unless you are on >= sf 1.3.1]

Another way to do this is to do an -exlude on all the puids. The all.json sets file is your friend here:

roy build -exclude @all -name empty empty.sig

Either way, you can now access the ancillary functions of sf and skip the identification step, e.g.:

sf -sig empty.sig -z -csv -hash crc DIR

Use xxd to understand basis

If you are on Mac or Linux, you can use xxd (a command line hex dump tool) to interpret sf basis output.

E.g. given the basis (for a PNG match):

basis: "extension match; byte match at [[0 16](/richardlehane/siegfried/wiki/[0-16) [37 4](/richardlehane/siegfried/wiki/37-4) [18353 12](/richardlehane/siegfried/wiki/18353-12)]

You view the matching bytes with:

xxd -l 16 FILE

xxd -s 37 -l 4 FILE

xxd -s 18353 -l 12 FILE

The -s flag seeks to the given offset, and the -l flag returns the given length in bytes.

Identifying plain text encodings

Siegfried runs its text matching algorithm if a file has a ".txt" extension or if it will otherwise be reported as UNKNOWN. Text matching doesn't run when a file has been matched by its extension because we don't want to override potentially more precise identifications (like javascript or css) with a generic plain text identification.

However, there may be cases where you would like to run text matching anyway, for example because you are uncertain that the extension-based matches you are getting are accurate or if you'd like information about the type of text encoding (ASCII, UTF8 etc.) used in text-based formats.

You can force text matching to run in all cases by adding a second, text-only identifier to your signature file:

roy add -limit x-fmt/111 -name "text only"

This will result in two identifications for each file: 1) a PRONOM identification and 2) a text identification. Encoding information, if text is detected, will be in the basis field of the text identification.

You can also run sf as a standalone text identification tool by building a text-only signature file:

roy build -limit x-fmt/111 -name "text only" text.sig

sf -sig text.sig DIR

This tip inspired by Ross Spencer query: https://github.com/richardlehane/siegfried/issues/67