Inspect and Debug - richardlehane/siegfried GitHub Wiki
This guide describes how you can use the roy
tool to inspect file format signatures and signature files and the sf -log slow
and sf -log debug
commands for debugging identification processes.
If you haven't already set up the
roy
tool, follow the instructions here
Inspect signature files
roy inspect
(view contents of the default signature file)
roy inspect SIGNATURE_FILE
(view contents of a particular signature file)
Tip: the output of the
inspect
command can be verbose, so redirect it into a file for analysis e.g.roy inspect > inspect.txt
.
When inspecting signature files, you'll see:
-
the raw content of the extension matcher (just a map of file extensions to formats)
-
the full set of priorities (the slice at the end of the output)
-
the container matchers (including their component byte matchers)
-
and the primary byte matcher (that contains the main PRONOM signatures).
The section describing the primary byte matcher is probably the most useful, and looks like this:
BYTE MATCHER
BOF seqs: 731
EOF seqs: 87
BOF frames: 1
EOF frames: 0
Total Tests: 1159
Complete Tests: 891
Incomplete Tests: 693
Left Tests: 53
Right Tests: 527
Maximum Left Distance: 2032
Maximum Right Distance: 2558
Maximum BOF Distance: -1
Maximum EOF Distance: 131069
In order to interpret this information, it is necessary to understand how siegfried processes signatures.
During processing, siegfried splits signatures into segments, flattens the patterns in those segments into simple strings as much as possible (so they can be identified by the Aho-Corasick string matching algorithm), and stores references to any follow-up tests that must be satisfied to fully match those segments.
The BOF and EOF seq figures above tell how many sets of simple strings there are.
Some signatures have no simple strings in them at all (e.g. .tar) and must be individually pattern-matched: these go into the BOF and EOF frames counts.
The tests information tell how many individual follow-up tests there are ('complete' tests are satisfied immediately - the string match is sufficient; 'incomplete' tests need following up, either to the right or left of the matching string).
The BOF and EOF distances are the calculated maximum distances to which we must scan in order to satisfy all signatures (-1 is no limit).
Inspect PRONOM signatures
roy inspect PUID
(view a PRONOM signature)
You will need to have a copy of the relevant PRONOM XML report file in your reports directory for this command to work. You can use the
roy harvest
command to download all PRONOM XML reports.
This command outputs siegfried's internal representation of the PRONOM byte signatures for a format. For example, the output of roy inspect fmt/41
is:
Raw JPEG Stream:
(F B:0 seq ffd8ff | WW E:0-65536 seq ffd9)
(F B:0 seq ffd8ffed | F P:2 seq "Photoshop 3.0\x008BIM" | WW E:0-16000 seq ffd9)
In order to interpret this information, it is necessary to understand how siegfried represents file format signatures. Internally, siegfried represents signatures as ordered lists of frames that encose patterns.
A pattern is a test against which a byte stream can be matched. The simplest pattern is a sequence (the 'seq' above) and this is just a string equality test. Other patterns used in PRONOM are 'r' for byte range, 'not' for the inversion of a pattern, 'c' for a choice of patterns, and 'l' for an ordered list of patterns.
A frame is positional information about where a pattern must match. There are different types of frames: fixed frames ('F' in the example above) must occur at exact offsets; window frames ('WW') must occur within a range of offsets; wild frames ('W') have no offset limits; and wild-min ('WM') frames have a lower but not upper offset limit. Frames can be anchored to the beginning of the file (B:0 above means at 0 offset from the BOF), the end of the file (e.g. E:0 above), to a previous frame (e.g. P:2 above), or to a succeeding frame (these are represented by S:).
Putting this all together, we see that there are two signatures for fmt/41: the first signature has the sequence 'ffd8ff' (hex values - inspect
will display bytes as printable ASCII if possible, but otherwise reverts to hex) at the very start of the file and the sequence 'ffd9' within 65536 bytes (plus the 2 bytes of the sequence's length) of the end of the file.
Inspect partial matches
roy inspect INDEX
(this inspect function is used in conjunction with sf -log debug
and sf -log slow
. In debug mode sf
reports all the raw hits from the byte matcher. In slow mode sf
reports the indexes of signatures which are delaying an earlier identification. You can identify which PUIDs are responsible for debug or slow hits by inspecting their indexes with roy
).
For example, the output of roy inspect 270
is:
Results at 270: pronom: fmt/131 (identifies results reported by -log slow)
Hits at 270: pronom: fmt/41 (identifies hits reported by -log debug)
Results have one associated PUID. Hits have one or many associated PUIDs.
Slow mode
sf
can be run in slow mode when scanning files or directories:
sf -log slow,stdout file.ext | DIR
When running in this mode, sf
will report any signatures that are delaying early identification of a file's format. Use the roy inspect
command to determine which PRONOM formats are represented by the result indexes returned by this command.
An example
Running sf
on large XML files can take a long time as full file scans are necessary. Why can't sf
identify these files sooner? sf -log slow
can help in this type of analysis. For large files, sf
will report the indexes of any signatures it is waiting on at a certain point in the scan (around 500,000 bytes from the beginning of file). sf
also reports whether a signature is "potentially excludable": this means that the signature could be ruled out if its segments are more closely inspected (this is a data point that might inform future enhancements to sf
but probably won't have much utility beyond that use). When a match is finally made, that will be reported too. Below is an example of the slow output for a large XML file.
We can see that three signatures cause us to wait. You can identify these signatures by giving the index number to roy inspect
. For example roy inspect 191
tells us that the first signature is fmt/91. You can run roy inspect fmt/91
to get the name associated with this PUID and view its byte signature.
It turns out that all three "slow" XML signatures are SVG signatures that are looking for <svg> at any point in the file (after wildcards). This quick analysis points to improvements that could be made to the PRONOM database to speed up all XML file matching.
Debug mode
sf
can be run in debug mode when scanning files or directories:
sf -log debug,stdout file.ext
When running in this mode, sf
will report any hits reported by the byte matchers (the container byte matchers as well as the primary byte matcher). You can use the roy inspect
command to determine which formats are partially matched by these hits (but only for hits from the primary byte matcher, container hits cannot be traced in this way).
An example
sf -log debug
has a number of different uses. It can be used to identify 'noisy' signatures (signatures that cause a lot of false positives and slow down identification for all formats). I use it to debug siegfried itself. It also provides a head start when following-up on format identification failures. For example, consider this JPEG below:
PRONOM's JPEG signatures currently don't match this image. sf
gives this result, marking the file as unknown but suggesting possibilities of fmt/42, fmt/43, fmt/44, fmt/41, fmt/112, x-fmt/390, x-fmt/391, x-fmt/398, or fmt/645, based on the extension.
So what is going wrong?
If we identify the file using the -log debug
flag we get these partial matches:
Understanding strikes
Strikes are hits reported either by the BOF and EOF multiple string matchers or the BOF and EOF frame matchers. The frame matchers are a fallback for signatures that can't be found using simple string matching alone (e.g. .tar): for these we just do sequential pattern matching. Strikes from frame matchers have "frame hit". All of the strikes above are string matches as they have "sequence hit".
The Offset value provides the offset of the hit, calculated either from the beginning or end of the file. The Length is the length of the string match.
The index value points to the follow-up test for the sequence. The number following (in brackets) is the sequence's position within a group of sequences, "0" is first. For example, consider a signature that has "ABC", followed by a number of random bytes, and then "DEF". Siegfried's string matching algorithm will store these two strings as a pair and will only report "DEF" hits if "ABC" has already matched. These follow-up index can be used with roy inspect
to identify which actual signatures are matched by the follow-up tests.
Getting to the bottom of things...
Returning to our example, the very first strike 270[0] is the hit we are expecting. If we do roy inspect 270
, fmt/41 (Raw JPEG Stream) is the result. If we view the signature, (using roy inspect fmt/41
), we see that the first signature has two segments: a BOF segment (which is this strike) and an EOF segment. The last three strikes reported are for EOF segments but none of these are for fmt/41. So it seems that the problem with this JPEG is that it is lacking the EOF sequence ffd9 within a maximum offset of 65536 from the end of the file.
We can confirm this by rebuilding our signature file and excluding EOF segments (roy build -noeof
). Running sf
with this customised signature file correctly identifies the image.
Mystery solved!
It turns out that the EOF ffd9 sequence does appear in this image, but it appears at a point too far from the end of file to register (at file offset 1050297, 140730 bytes from the end of file). You can use a hex editor to locate the sequence. A bit of web research reveals that this is an issue that Jay Gattuso has previously identified. Following Jay's post, the PRONOM team added the 65536 buffer to this signature... perhaps an even larger maximum EOF offset is necessary? One way to calculate an appropriate new EOF offset would be to create a test signature with a wild EOF offset for this sequence and run sf
over a large sample of similar files. Siegfried reports offsets in the "basis" field of its results and you could use the largest reported offset to update this signature.