Identifying file formats - richardlehane/siegfried GitHub Wiki

Scanning commands

Siegfried is a command line tool. If you aren't a frequent user of command prompts, it may be helpful to start by reading one of these guides: Windows, OSX or Linux

Example

asciicast

Scanning files

Scanning a file with siegfried just involves navigating to the right directory (using the cd command) and running:

sf file.ext

Scanning directories

You can scan the whole contents of directories by providing a directory rather than a file as the first argument:

sf DIR

By default, siegfried will descend down into all the subdirectories of that directory. You may not want this (especially if it is a large directory) and can prevent it with a -nr flag (for "no recurse") like so:

sf -nr DIR

Tip: Because sf is run from the command line, you can use the standard ctrl-C key combination to kill the command if you accidentally start descending down a really big directory tree.

Scanning a list of files and/or directories

You can also provide a list of files or directories to scan, e.g.:

sf myfile1.doc mydir myfile3.txt   

Saving output to files

Siegfried's output prints nicely in terminals but oftentimes you'll want to keep the scan results. To do this, simply redirect (>) the output to a results file:

sf file.ext (or DIR) > my_results.yaml

JSON, CSV and DROID CSV

I gave my_results a .yaml extension because the default output format is YAML. You can switch the output to JSON, CSV or DROID CSV with the -json, -csv and -droid flags:

sf -json DIR > my_results.json
sf -csv DIR > my_results.csv
sf -droid DIR > my_results.csv

Scanning archive formats (zip, tar, gzip, warc, arc)

By default, siegfried does not scan within archive formats. To scan within the contents of zip, tar, gzip, warc or arc files use the -z flag:

sf -z file.ext
sf -z DIR

Calculating checksums (md5, sha1, sha256, sha512, crc)

To include file hashes with your identification results, use the -hash flag:

sf -hash md5 file.ext
sf -hash sha1 DIR

Scanning content piped to stdin

If you use - instead of a file or directory argument, then sf will scan from stdin. This allows you to do things like:

cat myfile.doc | sf -

You can optionally pass a filename with the -filename flag, in order to enable filename/extension matching:

cat myfile.doc | sf -filename myfile.doc -  

Scanning a list of files and directories

The -f flag causes sf to scan a newline separated list of files and directories. E.g. sf -f myfiles.txt will scan each of the files and directories listed in myfiles.txt.

You can combine this with the - argument to read from a list piped to stdin. This allows you to do things like:

find */*.doc | sf -f -    [on Mac or Linux]
dir /b /s *.doc | sf -f - [on Win]

Throttling directory scans

If you use the -throttle [Duration] flag, sf will pause between files when scanning directories. For example:

sf -throttle 50ms DIR

will pause for 50 milliseconds between each file scan.

This flag can be useful if you encounter bandwidth issues running sf.

Continue on error

You can use the continue on error flag (-coe) to prevent sf halting when it encounters a fatal file error. This can be useful e.g. for scans over unreliable networks.

sf -coe DIR

File modified datetime in UTC

By default, the file modified field is formatted according to your system's local time. You can have this converted to UTC by using the -utc flag.

sf -utc DIR

Scanning many files at once

If you use the -multi [Number] flag, sf will scan up to that number of files at once. For example:

sf -multi 256 DIR

will scan up to 256 files at once.

The default for -multi is 1 as the flag can slow down matching (especially when IO is the bottleneck - e.g. spinning disks or network scans). Depending on your platform, you may find you get dramatic improvements in speed by using the -multi flag and it is worth experimenting with. For example, on the lab PC at my workplace (which has SSD hard drives, a fast processor, and a lot of RAM) using -multi 256 speeds up a scan of the Govdocs Selected Corpus (31 Gb) from 6m41s to 36s.

Note: The -multi flag can't be stacked with the -z flag (if you attempt to, sf will provide a warning and automatically drop down to single scan mode).

Saving scan settings

You can save and load frequently used scan settings with the -setconf and -conf flags.

To save a default configuration, execute the sf command with the -setconf flag and any other desired combination of flags (e.g. -csv -multi 32):

sf -setconf -multi 32 -hash sha1 -csv

This will save your default configuration in a sf.conf file within your siegfried home directory.

Default configurations are loaded each time you run sf. Any other flags you use when running sf will either override or add to that configuration. I.e. running sf -json . with the above default configuration would override the -csv flag and expand to sf -multi 32 -hash sha1 -json .. You can reset your default configuration by running sf -setconf.

You can save multiple named configurations by using the -conf flag. E.g. the following command saves a configuration to a server.conf file:

sf -setconf -serve :5138 -z -hash md5 -conf server.conf    

You can then invoke sf with that named configuration with:

sf -conf server.conf

Help

If you forget a command or option, use sf -help for a list of options.

Working with siegfried output

Logging

The -log flag reports progress, time, errors, warnings, knowns, unknowns, and slow and debug information to either stderr or stdout.

For example, if you're scanning a large directory, you might like to see the progress of your scan. You can do this with -log progress:

sf -log progress -csv DIR > my_results.csv

This command reports progress to stderr (the default output).

The -log flag takes the following options:

progress OR p
time OR t
error OR err OR e
warning OR warn OR w
known OR k
unknown OR u
chart OR c
debug OR d
slow OR s
stdout OR out OR o

You can combine any of these options in comma-separated strings e.g. -log e,w reports all errors and warnings to stderr. -log p,t,e reports progress, errors and time elapsed. -log u,o reports unknowns to stdout (when you direct -log to stdout it replaces the normal result output).

In addition to those specific options, you can also include format IDs (e.g. fmt/1) and sets (e.g. @pdf) in your log string. These can be combined with the normal logging options. E.g. -log u,fmt/1,fmt/2,@pdf,o reports to stdout unknowns, any files that identify as fmt/1 or fmt/2, and any files that identify as one of the formats in the @pdf set.

Knowns, Unknowns, Formats and Sets

The -log known and -log unknown commands output lists of files that are either recognised or not recognised.

The -log fmt/1,fmt/2 and -log @pdf,@tiff commands output lists of files that are recognised as having those specific format IDs or belonging to the provided sets.

One use for these commands is in combination with a modified signature file (see Building a signature file with roy). For example, you could create a signature file that only recognises pdf formats with:

roy build -limit @pdf -name pdf_only pdf.sig

Using sf -log known you could then filter all the pdf files in a given directory for further processing by some other command, such as tika:

sf -sig pdf.sig -log known,stdout . > temp.out && java -jar tika-app.jar -t -i . -o ~/local/out -fileList temp.out

Another, slightly slower, way to accomplish the same task would be to use your regular signature file and log the PDF formats directly:

sf -log @pdf,stdout . > temp.out && java -jar tika-app.jar -t -i . -o ~/local/out -fileList temp.out

You might also want to send a list of unknowns to the file command:

sf -log unknown ~/local/files 2> temp.out && file -f temp.out

You can even pipe results from these commands back to sf itself. For example, you might run a full identification over all the non-pdf files in a directory:

sf -sig pdf.sig -log unknown,stdout . | sf -f -

Replaying a scan from results file(s)

You can use the -replay command to simulate a scan from one or more results files (CSV, JSON, YAML or even DROID and Fido results). E.g.

sf -replay myresults.yaml

Or:

sf -replay myresults.yaml moreresults.csv

The value of this command is apparent when used with additional flags. E.g. you can use -replay to convert a YAML results file to CSV:

sf -replay -csv myresults.yaml > myresults.csv

Or you can use -replay to concatenate results files together:

sf -replay myresults.yaml moreresults.csv > allresults.yaml

Another use case for -replay is to use logging functions to interactively explore results files. For example:

sf -replay -log chart,o myresults.yaml // view a chart of formats
sf -replay -log error,warn,o myresults.yaml // view warning and errors
sf -replay -log unknown,o myresults.yaml // view unknowns 
sf -replay -log @pdf,o myresults.yaml moreresults.csv // view all PDFs across two results files

This clip shows how the replay command can be used:

asciicast

Comparing results

The roy tool's compare sub-command allows you to view the difference between multiple results file (in any of the sf formats as well as Droid and FIDO results). This sub-command outputs the differences into a CSV file for analysis e.g. in MS Excel.

For example:

roy compare myresults1.yaml myresults2.json droid-results.csv fido-results.csv > comparison.csv

This example does a four-way comparison between two sf results files, a DROID results file and a FIDO results file.

By default this sub-command joins the different results files based on file paths within those results. Sometimes results can't be joined in this way (e.g. because DROID gives absolute paths where the other tools give relative paths). You can use the -join flag with the roy compare sub-command to change the way results are joined.

Options are:

roy compare -join 0 myresults1.yaml myresults2.json // join on full path (default)
roy compare -join 1 myresults1.yaml myresults2.json // join on (local) filename only
roy compare -join 2 myresults1.yaml myresults2.json // join on filename and size
roy compare -join 3 myresults1.yaml myresults2.json // join on filename and modified
roy compare -join 4 myresults1.yaml myresults2.json // join on filename and hash
roy compare -join 5 myresults1.yaml myresults2.json // join on hash only

Interpreting the output

Siegfried output example

Technical provenance fields

Note: the JSON and CSV outputs have identical fields to the YAML output, except that the CSV output omits the technical provenance block. The DROID output mimics the TNA's DROID tool's CSV export.

The first block of information in siegfried output gives a technical provenance for the scan.

This includes information about siegfried (version number), about the date and time of the scan, about the signature file (name and date created), and about the identifiers within that signature file. The default signature file (default.sig) includes a single identifier named "pronom". In the "details" field for an identifier you'll see the versions of DROID signature files used to create the identifier as well as any modifications made to it (e.g. limited BOF, extensions etc.). No modifications are made to the default signature file's PRONOM identifier.

File fields

The second block of information in siegfried ouput describes the file being scanned.

This includes the file's name, size (in bytes), last modified date, and any errors siegfried encountered in attempting to read the file. Treat any errors reported here as red flags warranting further investigation. File errors may prevent matching occurring altogether or they may only affect certain matching processes. For example, a badly structured zip or Microsoft Compound file will prevent prevent container matching and generate a file error but the byte matcher will still report its results.

Identification fields

The third block of information in siegfried output is a list of matches reported by the identifiers within the signature file. All identifiers will return at least one match (which may have the special value "UNKNOWN") and may report multiple matches (if there are multiple matches returned that have equal weighting).

For each match you will see:

  • the name of the identifier returning the match (just "pronom" if you are using the default signature file)
  • the format ID (a unique identifier or the special value "UNKNOWN")
  • the format's name, version and MIME type
  • the basis for the match
  • and any warnings.

The basis field gives a technical justification for why the format has matched. This includes the names of the matchers (extension, container, byte and text matchers) that have triggered the result. If it is a byte matcher result, you will also see comma separated pairs that describe the offsets and lengths of matching segments (signatures may have one or more segments that must be satisfied). If it is a container matcher, you will see the names of matching sub-files as well as the output of any byte matchers that are applied to those sub-files. If it is a text matcher, you will see the character encoding detected.

For example:

'extension match; container name CompObj with byte match at 77, 20; name WordDocument with name only' 

This basis value tells us that file in question matched on extension and triggered a container match, due to the sub-files "CompObj" and "WordDocument", with a byte match for the "CompObj" stream.

The warning field reports any warnings reported by the identifiers during matching. These aren't strictly errors but may still warrant further investigation. A common warning is for "UNKNOWN" files. The warning text for "UNKNOWN" files will list any potential matches based on extension that the byte matcher has excluded.

For help in debugging ID warnings: Inspect and Debug