Tips and Tricks - richardlehane/siegfried GitHub Wiki
Most of these tips and tricks require
roy
to have been set up. If you haven't done this yet, follow the setup steps forroy
that are listed on this page.
Getting absolute file names in your results
The 'filename' field in sf
results is based on the file path that you give as input to sf
. For example, if you give sf
a relative path as input, such as sf docs
, you'll get relative paths in your results like "filename: docs\myfile.txt".
You might prefer absolute paths here instead, such as "filename: c:\Users\richardl\docs\myfile.txt". To get absolute paths you need to give absolute paths as input, e.g. sf c:\Users\richardl\docs
.
Rather than type out the full path in your terminal, when you've already navigated to the right folder, you can use the following shortcuts to get absolute file names in your results:
sf %cd%\docs [on Windows]
sf $PWD/docs [on unix]
file
or tika
Piping unknowns to The sf -log unknown
command returns a list of all the files that sf
cannot identify. This list can be used as the input for other identification tools such as file
and tika
.
For example, on a Linux or Mac command line (need GNU version of xargs), you could do:
sf -log unknown,stdout DIR | xargs -d '\n' file
sf -log unknown,stdout DIR | xargs -d '\n' -n 1 java -jar tika-app.jar -d
Filtering and deep scanning PDF files
PRONOM is geared towards finding single, canonical identifiers for files. In certain circumstances, however, we might like to know all the matching signatures for a particular file type e.g. a PDF/X file will also have a PDF version (e.g. 1.6) and we might want that additional information.
We can get all matching signatures with sf
if we turn off signature prioritisation (the system that tells us which signatures take priority over other signatures) with roy build -nopriority
. But scanning in nopriority mode can take a long time. We can speed this up if we limit the signature file to only the PDF signatures:
roy build -limit @pdf -nopriority -name pdfscan pdfscan.sig
Running this with sf -sig pdfscan.sig DIR
will give us all the matching signatures for PDFs in that directory. But what happens if that directory also contains other file types? In that case, we'll end up with many UNKNOWNs that we might not want in our output. Let's filter these out by creating a second signature file that just isolates PDF files:
roy build -limit @pdfcore -noeof -name pdffilter pdffilter.sig
The @pdfcore set doesn't contain wildcard PDF signatures like PDF/A so runs much faster than a full PDF signature set. We can make it even faster by removing end-of-file portions from signatures (using -noeof
). We can then run this filter over our directory, piping all the PDF files to a second sf
process that does the full PDF scan (the -log known,stdout
command means we just emit filenames to stdout and the -f -
flags at the end of the second command ranges over that list):
sf -sig pdffilter.sig -log known,stdout DIR | sf -sig pdfscan.sig -f -
jq
to isolate errors and warnings
Using jq is a command line tool for processing JSON. Since JSON is one of the outputs supported by sf
, you can use jq
to process sf
output.
For example, the following jq
command will isolate all the error and warning information from sf
results:
sf -json DIR | jq '.files[] | select(.errors!="" or .matches[].warning!="") | .filename, .errors, .matches[].warning'
This is useful for post-processing results files. To access errors and warnings during a scan do:
sf -log e,w -json DIR > my_results.json
.
A simpler way to post-process a results file to inspect errors and warnings is to use sf -replay
.
E.g. sf -replay -log e,w my_results.json
mlr
to do two-pass scanning
Using If you are doing very large scans, and you want to tune your signature file for better performance, the best approach is usually to narrow the scope of your signature file with the -limit
flag.
For example, at my workplace we recently received a consignment that we knew contained hundreds of thousands of tiff images. This job might take hours with a default signature file but is much quicker with a signature file limited to the tiff formats:
roy build -limit @tiff -name tiff tiff.sig
If we run roy inspect tiff.sig
we see that the resulting signature file has automatically inferred a BOF limit of 4096 and an EOF limit of 4101 bytes... so this should run really quick! Any file that isn't a tiff will be reported as UNKNOWN and in this kind of scenario this is a good thing because it prompts us to take a closer look at any of the file types we weren't expecting.
Let's run a first pass over our consignment, choosing csv output so we can do further processing with mlr
:
sf -csv -sig tiff.sig DIR > 1pass.csv
Miller is a command-line tool for CSV that works a lot like jq
. If you're on a Mac, you can install it with brew. We can use the following command to filter out any files with unknown formats:
mlr --csv --rs lf filter '$id == "UNKNOWN"' then cut -f filename 1pass.csv
The --rs lf argument tells mlr that the input csv has UNIX line endings. We apply a cut command to the results of our filter so that we just return filenames.
If we strip the header line from our output (with the awk
command below), we can pipe the list of our unknown files back to a second sf
process, this time testing against the full PRONOM signature set:
mlr --csv --rs lf filter '$id == "UNKNOWN"' then cut -f filename 1pass.csv | awk '{if(NR>1)print}' | sf -csv -f - > 2pass.csv
Because this second scan is only testing the small number of non-tiff signatures in our consignment, we can save quite a bit of time.
Note In this example
mlr
is doing all the work but a shorter way to get the same result might be to redirect-log unknown
when runningsf
. E.g.sf -csv -sig tiff.sig -log unknown DIR > 1pass.csv 2> unknowns.txt
. You could then get the second pass by simply doingsf -csv - <unknowns.txt
.
Finally, we can use mlr
to merge both passes into a single result file:
- create a new file based on the first pass that only contains known formats (i.e. tiffs):
mlr --csv --rs lf filter '$id != "UNKNOWN"' 1pass.csv > 1pass_known.csv
- concatenate those knowns with our second pass:
mlr --csv --rs lf cat 1pass_known.csv 2pass.csv > results.csv
Tip: you can also use sf -replay
to concatenate results files. -replay
can take multiple results files as input. To achieve the same as above simply do: sf -replay -csv 1pass_known.csv 2pass.csv > results.csv
Re-scanning your repo after a PRONOM update
OK, the team at TNA have just released a big new update of PRONOM ... yay! ... however, the last time you scanned your massive repository you were on PRONOM version 77 and it will take days to do full scan with fresh signatures.
How about doing that full scan with a signature file that contains only the signatures that have been added or changed since you last did a scan?
This will be a lot quicker and will also pinpoint all the changes. The changes set is designed for this use case.
To build a signature file with all changes since version 77 do:
roy build -limit @78,@79,@81,@82 -name changes changes.sig
If you're eagle eyed you may have noticed I skipped @80 - that is because there is no version 80 of PRONOM.
What I'd recommend doing next is to use this changes.sig file as a filter over your repository and only doing a full identification for files that get a match. Assuming you've run sf -update
to update your main pronom.sig signature file, this command is:
sf -sig changes.sig -log known,stdout DIR | sf -f -
Re-scanning your repo after a PRONOM update Part II (viewing history)
We could get even more information when re-scanning our repository by using a signature file that contains multiple identifiers, with each identifier representing a different PRONOM release.
The roy build
command defaults to the latest available PRONOM signatures so we can build version 88 with this command:
roy build -name v88 history.sig
To add the full history since version 77, we'd then do (using the -noreports flag since we want to build straight from the DROID files):
roy add -droid DROID_SignatureFile_V81.xml -noreports -container container-signature-20150218.xml -name v81 history.sig
roy add -droid DROID_SignatureFile_V79.xml -noreports -container container-signature-20140923.xml -name v79 history.sig
roy add -droid DROID_SignatureFile_V78.xml -noreports -container container-signature-20140923.xml -name v78 history.sig
roy add -droid DROID_SignatureFile_V77.xml -noreports -container container-signature-20140717.xml -name v77 history.sig
We could then apply this history file to the subset of files in our repository that are affected by PRONOM changes since version 77 (that we can identify using the filter signature defined in the section above):
sf -sig changes.sig -log known,stdout DIR | sf -sig history.sig -f -
Running this will tell us, for every file affected by changes to PRONOM, what the original (v77) identification was and how each subsequent release (until v82) identifies the file.
Using multiple identifiers to peek within DOCX files
If we trick sf
into thinking a DOCX file is a zip file, we can use the sf -z
flag to recurse through the contents of that file. This can be handy, for example, to identify whether DOCX files contain multimedia or other embedded content.
First off, let's create a filter signature that will just isolate DOCX files in a directory:
roy build -limit fmt/412,fmt/494 -name docxfilter docxfilter.sig
Our DOCX analyzer signature will contain the full set of PRONOM signatures:
roy build docxanalyzer.sig
Let's add a second identifier that can only identify ZIP files (and will therefore think that DOCX files, as well as any other zip-based formats, are ZIPs):
roy add -limit x-fmt/263 -name unzipper docxanalyzer.sig
We can now use this command to do a deep scan of all DOCX files in a directory (adding the -z flag to trigger zip scanning):
sf -sig docxfilter.sig -log known,stdout DIR | sf -sig docxanalyzer.sig -z -f -
An Empty Identifier
There may be occasions where you'd like to access some of the ancillary functionality of sf
(listing contents of directories, listing contents of compressed files, computing file hashes, etc.) but not perform any identification.
Doing this sort of thing will be quicker with an empty identifier. You can create such a signature file by adding a non-puid limit, e.g.:
roy build -limit null -name empty empty.sig
[note due to a roy
bug you'll need to add the -noreports flag to this line unless you are on >= sf 1.3.1]
Another way to do this is to do an -exlude on all the puids. The all.json
sets file is your friend here:
roy build -exclude @all -name empty empty.sig
Either way, you can now access the ancillary functions of sf
and skip the identification step, e.g.:
sf -sig empty.sig -z -csv -hash crc DIR
xxd
to understand basis
Use If you are on Mac or Linux, you can use xxd
(a command line hex dump tool) to interpret sf
basis output.
E.g. given the basis (for a PNG match):
basis: "extension match; byte match at [[0 16](/richardlehane/siegfried/wiki/[0-16) [37 4](/richardlehane/siegfried/wiki/37-4) [18353 12](/richardlehane/siegfried/wiki/18353-12)]
You view the matching bytes with:
xxd -l 16 FILE
xxd -s 37 -l 4 FILE
xxd -s 18353 -l 12 FILE
The -s
flag seeks to the given offset, and the -l
flag returns the given length in bytes.
Identifying plain text encodings
Siegfried runs its text matching algorithm if a file has a ".txt" extension or if it will otherwise be reported as UNKNOWN. Text matching doesn't run when a file has been matched by its extension because we don't want to override potentially more precise identifications (like javascript or css) with a generic plain text identification.
However, there may be cases where you would like to run text matching anyway, for example because you are uncertain that the extension-based matches you are getting are accurate or if you'd like information about the type of text encoding (ASCII, UTF8 etc.) used in text-based formats.
You can force text matching to run in all cases by adding a second, text-only identifier to your signature file:
roy add -limit x-fmt/111 -name "text only"
This will result in two identifications for each file: 1) a PRONOM identification and 2) a text identification. Encoding information, if text is detected, will be in the basis field of the text identification.
You can also run sf
as a standalone text identification tool by building a text-only signature file:
roy build -limit x-fmt/111 -name "text only" text.sig
sf -sig text.sig DIR
This tip inspired by Ross Spencer query: https://github.com/richardlehane/siegfried/issues/67