Wikidata identifier - richardlehane/siegfried GitHub Wiki

The Wikidata identifier harnesses the file format signatures in Wikidata that can be made to be compatible with Siegfried. Data can be downloaded to create a new identifier and that identifier used to scan the objects in your collection.

For developers the exposed aspects of the integration API have been documented and can be viewed on go.dev. As well as the rest of the Siegfried interfaces.

For users, then a basic understanding of the Roy tool and how to build identifiers will be helpful. Understanding Siegfried's identification capabilities will also be useful.

Further information about the identifier can be found below. Summaries in the following sections.

Table of contents generated with markdown-toc

Overview

Given prior knowledge of Siegfried and how to configure its defaults, the commands below can be used to make use of the Wikidata integration.

Harvesting

Harvest a Wikidata signature file as follows: roy harvest -wikidata

The file which this creates can be found at $HOME/<user>/siegfried/wikidata/wikidata-definitions-<wikidata-version>.

The SPARQL query used to generate version 1.0.0 of the Wikidata identifier should be in the Wikidata SPARQL module here.

Version 1.0.0. of the identifier used the following query:

# Return all file format records from Wikidata.
# 
select distinct ?uri ?uriLabel ?puid ?extension ?mimetype ?encodingLabel ?referenceLabel ?date ?relativityLabel ?offset ?sig
where
{
  ?uri wdt:P31/wdt:P279* wd:Q235557.               # Return records of type File Format.
  optional { ?uri wdt:P2748 ?puid.      }          # PUID is used to map to PRONOM signatures proper.
  optional { ?uri wdt:P1195 ?extension. }
  optional { ?uri wdt:P1163 ?mimetype.  }
  optional { ?uri p:P4152 ?object;                 # Format identification pattern statement.
    optional { ?object pq:P3294 ?encoding.   }     # We don't always have an encoding.
    optional { ?object ps:P4152 ?sig.        }     # We always have a signature.
    optional { ?object pq:P2210 ?relativity. }     # Relativity to beginning or end of file.
    optional { ?object pq:P4153 ?offset.     }     # Offset relatve to the relativity.
    optional { ?object prov:wasDerivedFrom ?provenance;
       optional { ?provenance pr:P248 ?reference;
                              pr:P813 ?date.
                }
    }
  }
  service wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], <<lang>>". }
}
order by ?uri

Changing the WikiBase URI

If you have access to another WikiBase implementation that can serve Wikidata compatible file format information you can change the URI from where the information is harvested as follows:

  • roy harvest -wikidataendpoint http://sparql.example.com

Changing the Wikidata results language

The Wikidata results can be returned in a different language. Where a translation is available for a file format or a format's native language was something other than English this can be useful for finding the information you need. A different language can be returned using the following command:

  • /roy harvest -wikidata -lang <two-letter-language-code>

E.g. German (DE)

  • /roy harvest -wikidata -lang de

An example format with a German translation is "Microsoft Shortcut" or "Dateiverknüpfung": http://www.wikidata.org/entity/Q1109779

E.g.

filename : 'shortcut.lnk'
filesize : 8
modified : 2021-04-18T17:07:20+02:00
errors   : 
matches  :
  - ns      : 'wikidata'
    id      : 'Q1109779'
    format  : 'Dateiverknüpfung'
    URI     : 'http://www.wikidata.org/entity/Q1109779'
    mime    : 'application/x-ms-shortcut'
    basis   : 'byte match at 0, 8'
    source  : 'Wikidata reference is empty'
    warning : 'extension mismatch'

Where a translation doesn't exist for a file format, the translation will fall back, for now, to English (EN).

Building

Siegfried's binary representation of a signature file is called an Identifier. The identifier must be compiled. It can be compiled using many different combinations discussed in the Roy documentation. We will focus on Wikidata with PRONOM and Wikidata without PRONOM.

PRONOM

Wikidata will be build with PRONOM by default. What this does is look for PRONOM identifiers in the Wikidata dataset. Those identifiers might not have a signature associated with them where PRONOM does. We supercharge Wikidata by making use of a set of both information sources.

  • roy build -wikidata

No PRONOM

To build a Wikidata identifier without PRONOM, for example, to test your Wikidata developed signatures more easily:

  • roy build -wikidata -nopronom

Logging

Roy will output information specific to the Wikidata identifier which can be helpful to you to see the amount of information in Wikidata that you can expect to use. An example below shows that there are 192 records with signatures. This means the Wikidata identifier on its own can identify up to 192 file formats through binary pattern matching.

{
  "AllSparqlResults": 13187,
  "CondensedSparqlResults": 4582,
  "SparqlRowsWithSigs": 2927,
  "RecordsWithPotentialSignatures": 196,
  "FormatsWithBadHeuristics": 4,
  "RecordsWithSignatures": 192,
  "MultipleSequences": 11,
  "AllLintingMessages": [
    "Use the `-wikidataDebug` flag to build the identifier to see linting messages"
  ],
  "AllLintingMessageCount": 134,
  "RecordCountWithLintingMessages": 116
}

Linting can help you in developing file format signatures in Wikidata and identifying errors in that process. It is described below.

Linting

It can be helpful to Wikidata signature developers to identify potential issues in Wikidata signatures. The technique used in version 1.9.x of Siegfried is not a perfect technique. We anticipate greater schema checking against the Wikidata data source using ShEx in time.

For now, when you build using the following parameters, you will see additional "linting" information to help you identify records in Wikidata that can be improved with your attention:

  • roy build -wikidatadebug
  "AllLintingMessages": [
    "Linting: WARNING no encoding: URI: http://www.wikidata.org/entity/Q4839791 Critical: false",
    "Linting: WARNING no provenance: URI: http://www.wikidata.org/entity/Q4839791 Critical: false",
    "Linting: WARNING no provenance date: URI: http://www.wikidata.org/entity/Q98843338 Critical: false",
    "Linting: ERROR bad heuristic: URI: http://www.wikidata.org/entity/Q1109779 Critical: true",
    "Linting: ERROR blank node returned for offset: URI: http://www.wikidata.org/entity/Q26546575 Critical: false",
    "Linting: WARNING no relativity: URI: http://www.wikidata.org/entity/Q939636 Critical: false",
  ],

We can go into more information about linting issues:

ERROR bad heuristic

A bad heuristic means that information vital to understanding a signature is missing. Wikidata is the first place to look for this inconsistency. An example of a bad heuristic might be a file format listed with two BOF sequences but no offset to describe how one is related to the other. The weakness might be in the code where the code does not demonstrate enough complexity to work with the data that is available to it. If you believe a heuristic can be created for the information in Wikidata please open a new Siegfried issue.

ERROR blank node returned

A blank node can be returned for any field that Roy/Siegfried anticipates using. A blank node error is returned for a field that has been deliberately listed in Wikidata but for which there has been no value supplied, e.g. the author recognizes there should be something but does not know what that something is. Roy cannot work with this field/value as it is incomplete. The best way to remedy this is to complete the record in Wikidata.

WARNING no encoding

A no-encoding error exists in Roy at present because Wikidata can encode information using multiple signature encoding, e.g. hexadecimal, ascii. If an encoding isn't specified Roy will try to parse or convert the data to hexadecimal. If it works no error will be thrown and we can use the signature. This error is a signal to the signature developer to rectify the issue in the Wikidata record.

WARNING no provenance

A no provenance error indicates that the signature information in Wikidata has no listed reference. A default value will be used in the output by Siegfried. The remedy is to attempt to find a suitable provenance for the information in Wikidata and edit the record directly.

WARNING no provenance date

A no provenance date error indicates that the signature information in Wikidata has no listed date for its reference. No value will be used by Siegfried. The remedy is to attempt to find a suitable provenance date for the information in Wikidata and edit the record directly.

WARNING no relativity

A no relativity error indicates that the signature information in Wikidata has no listed relativity value, e.g. it is not listed as BOF (beginning of file) or EOF (end of file). A default of BOF value will be used in the output by Siegfried. The remedy is to attempt to find a suitable provenance for the information in Wikidata and edit the record directly.

Inspect

Signatures can be inspected on a case-by-case basis. To lookup a Wikidata identifier to see the signature compiled into the identifier you can do the following

  • roy inspect -wikidata <Wikidata-QID>

E.g. FLAC (Free Lossless Audio Codec)

  • roy inspect -wikidata Q27881556
FORMAT INFO: NAME: 'FLAC'
MIMETYPE: 'AUDIO/X-OGG; AUDIO/X-FLAC; AUDIO/FLAC'
SOURCES: 'GARY KESSLER'S FILE SIGNATURE TABLE (SOURCE DATE: 2017-08-08) PRONOM (OFFICIAL (FMT/279))'
QID: (Q27881556)
globs: *.flac, *.oga
sigs: (B:0 seq "fLaC\x00\x00\x00\"")
      (B:0..4 seq "fLaC\x00\x00\x00\"")
superiors: none

Scanning

One an identifier is built, Siegfried does not require any special invocation to use Wikidata. A standard command might be sf <your-file-name>. The result, e.g. for img.bmp will look something like as follows:

---
siegfried   : 1.9.1
scandate    : 2020-11-15T22:24:56-05:00
signature   : default.sig
created     : 2020-11-15T22:24:35-05:00
identifiers : 
  - name    : 'wikidata'
    details : 'wikidata-definitions-1.0.0 (2020-11-15)'
---
filename : 'img.bmp'
filesize : 35
modified : 2020-11-15T22:26:09-05:00
errors   : 
matches  :
  - ns       : 'wikidata'
    id       : 'Q27596325'
    format   : 'Windows Bitmap, version 4'
    URI      : 'http://www.wikidata.org/entity/Q27596325'
    mime     : 
    basis    : 'extension match bmp; byte match at 0, 35'
    source   : 'PRONOM (Wikidata) (source date: 2017-08-08)'
    warning  : 
    software : 

Custom Wikibase

Siegfried can use format definitions harvested from a custom Wikibase.

Wikibase is an extension of Mediawiki - the software more commonly known as being used for Wikipedia. Wikidata is a Wikibase, but it is also possible to run custom Wikibase instances, e.g. wikibase.cloud.

Your Wikibase needs to be configured so as to satisfy the fields in the following SPARQL request:

PREFIX p: <http://wikibase.example.com/prop/>
PREFIX pd: <http://wikibase.example.com/prop/direct/>
PREFIX ps: <http://wikibase.example.com/prop/statement/>
PREFIX pq: <http://wikibase.example.com/prop/qualifier/>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX ref: <http://wikibase.example.com/prop/reference/>
PREFIX wd: <http://wikibase.example.com/entity/>
PREFIX ws: <http://wikibase.example.com/entity/statement/>

select distinct ?uri ?uriLabel ?puid ?extension ?mimetype ?encoding ?referenceLabel ?date ?relativity ?offset ?sig
where {
  	  ?uri pd:P9 wd:Q1.
	  optional { ?uri pd:P6 ?extension. }
	  optional { ?uri pd:P7 ?mimetype.  }
	  optional { ?uri p:P8 ?object;
	    optional { ?object pq:P2 ?encoding.   }
	    optional { ?object ps:P8 ?sig.        }
	    optional { ?object pq:P3 ?relativity. }
	    optional { ?object pq:P10 ?offset.     }
	    optional { ?object prov:wasDerivedFrom ?provenance;
	       optional { ?provenance ref:P4 ?reference;
	                              ref:P5 ?date.
	                }
	    }
	  }
   service wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
order by ?uri

NOTE: In the request above the values beginning with ? need to remain the same, these are the 'fields' - The URIs, e.g. pd:P6, pq:P3, etc. will become the respective values of your custom instance.

If you are able to generate results for each of the fields (those marked ?) in your modified version of the query above, then you can harvest this information using roy. To do this, you need to store the SPARQL in Siegfried's %HOME% folder as wikibase.sparql.

The harvest command then might be:

./roy harvest \
   -wikidata \
   -wikidataendpoint http://wikibase.example.com:8834/proxy/wdqs/bigdata/namespace/wdq/sparql? \
   -wikibaseurl http://wikibase.example.com/

Roy cannot interpret this information without a little help. At runtime it needs to be able to know which Wikibase IRIs are for PRONOM, BOF, and EOF values.

In Siegfried's Siegfried's %HOME% create a wikibase.json file. It will look something like this.

{
 "PronomProp": "http://wikibase.example.com/entity/Q2",
 "BofProp": "http://wikibase.example.com/entity/Q3",
 "EofProp": "http://wikibase.example.com/entity/Q4"
}

Again, replace the base URL for your Wikibase instance.

With this information in place you simply need to run Roy's build command.

./roy build -wikidata

Roy can interpret the harvest results and determine it comes from a custom Wikibase and requires the custom properties that you have setup.

A special note on using the PRONOM entity for your signatures

By far, the most flexible of options for writing signatures for Wikidata or Wikibase is in PRONOM regular expression syntax. At least until we can invent something more user friendly, and capable of looking at container formats too.

When you are creating your custom Wikibase you will need to create an encoding property, and a PRONOM internal signature entity. That way you can state that your signatures are written using PRONOM syntax.

PRONOM syntax is described on ffdev.info in the Guide section.

PRONOM syntax can incorporate plain hexadecimal sequences for more static magic numbers. The regular expression syntax comes into its own when you require added variability. PRONOM has described over 1500 file format signatures using this syntax.

Using the custom Wikibase functionality for Wikidata

An added bonus of being able to customize queries for your own Wikibase is that you can use the same technique to customize a query for Wikidata. There are three principles:

  1. The foundational shape of the SPARQL query (especially ?fields) must still exist after customization.
  2. Any part of the SPARQL query can be modified to filter the results further.
  3. The shape of the existing SPARQL query can be added to, e.g. adding a new ?predicate to the query to make it more precise.

Some use cases may include:

  • Removing excess information from signature files that aren't required for your use-case.
  • Outputting information that has a clearer provenance, e.g. file format signatures from your own institution.

Two queries are shown below for:

  1. Returning TrID-only results.
  2. Returning only Raster Graphics Formats in a signature file.

Fist, the basic configuration that you will need is as follows:

  • Connection string: to connect to Wikidata using the custom service you will need to use the following URLs:

    roy harvest -wikidata -wikidataendpoint https://query.wikidata.org/sparql? -wikibaseurl https://www.wikidata.org/w/api.php

  • You will use a wikibase.json file with the following json:

    {
       "PronomProp": "http://www.wikidata.org/entity/Q35432091",
       "BofProp": "http://www.wikidata.org/entity/Q35436009",
       "EofProp": "http://www.wikidata.org/entity/Q1148480"
    }

    NB. We're only able to use values using PRONOM regular expression syntax initially. This also covers hexadecimal but requires a PRONOM internal signature (Q35432091) encoding.

TrID example SPARQL

The result of this query is a signature file with only TrID signatures. In wikibase.sparql you will need the following SPARQL:

# Return all file format records from Wikidata.
# 
# Custom query example:
# 
# All formats must have a signature.
# All signatures must come from the TrID Q41799265 reference.
# 
# NB. Keep in mind all optional fields as they increase the
# number of fields where schemas aren't consistent across entries.
# 
SELECT DISTINCT ?uri ?uriLabel ?puid ?extension ?mimetype ?encoding ?referenceLabel ?date ?relativity ?offset ?sig WHERE {
  ?uri (wdt:P31/(wdt:P279*)) wd:Q235557.
  OPTIONAL { ?uri wdt:P2748 ?puid. }
  OPTIONAL { ?uri wdt:P1195 ?extension. }
  OPTIONAL { ?uri wdt:P1163 ?mimetype. }
  ?uri p:P4152 ?object.
  ?object ps:P4152 ?sig;
    prov:wasDerivedFrom ?provenance.
  ?provenance pr:P248 wd:Q41799265, ?reference.  # <-- modified to return TrID only, and TrID's reference label.
  OPTIONAL { ?provenance pr:P813 ?date. }
  OPTIONAL { ?object pq:P3294 ?encoding. }
  OPTIONAL { ?object pq:P2210 ?relativity. }
  OPTIONAL { ?object pq:P4153 ?offset. }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
ORDER BY (?uri)

NOTE: The comment marked # <--. We add the TrID property, but keep ?reference so ?referenceLabel is output.

Example output after building the Wikidata signature roy build -wikidata -nopronom (randomly picked from the dataset):

---
siegfried   : 1.9.2
scandate    : 2022-09-07T11:55:20+02:00
signature   : default.sig
created     : 2022-09-07T11:55:18+02:00
identifiers :
  - name    : 'wikidata'
    details : 'wikidata-definitions-3.0.0 (2022-09-07)'
---
filename : 'trid-example-skeleton'
filesize : 6
modified : 2022-09-07T11:55:14+02:00
errors   :
matches  :
  - ns        : 'wikidata'
    id        : 'Q100137240'
    format    : 'VariCAD Drawing'
    URI       : 'http://www.wikidata.org/entity/Q100137240'
    permalink : 'https://www.wikidata.org/w/api.php/w/index.php?oldid=1423314911&title=Q100137240'
    mime      : 'application/octet-stream'
    basis     : 'byte match at 0, 3 (TrID)'
    warning   : 'extension mismatch'

Raster graphics SPARQL

The result of this query is a signature file with only entries for raster-graphics file-formats. In wikibase.sparql you will need the following SPARQL:

# Return all file format records from Wikidata.
# 
# Custom query example:
# 
# Formats must be an instance of, or subclass of raster-graphics file-format.
# 
# 
select distinct ?uri ?uriLabel ?puid ?extension ?mimetype ?encoding ?referenceLabel ?date ?relativity ?offset ?sig
where
{
  ?uri wdt:P31/wdt:P279* wd:Q235557.
  ?uri wdt:P31/wdt:P279* wd:Q105599390.    # <-- line added to return instance/sub-class of raster-graphics file-format
  optional { ?uri wdt:P2748 ?puid.      }
  optional { ?uri wdt:P1195 ?extension. }
  optional { ?uri wdt:P1163 ?mimetype.  }
  optional { ?uri p:P4152 ?object;
    optional { ?object pq:P3294 ?encoding.   }
    optional { ?object ps:P4152 ?sig.        }
    optional { ?object pq:P2210 ?relativity. }
    optional { ?object pq:P4153 ?offset.     }
    optional { ?object prov:wasDerivedFrom ?provenance;
       optional { ?provenance pr:P248 ?reference;
                              pr:P813 ?date.
                }
    }
  }
  service wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE], en". }
}
order by ?uri

NOTE: The comment marked # <--. We add a new line, filtering by instance-of/subclass-of raster-graphics file-format.

Example output after building the Wikidata signature roy build -wikidata -nopronom (randomly picked from the dataset):

---
siegfried   : 1.9.2
scandate    : 2022-09-07T12:30:19+02:00
signature   : default.sig
created     : 2022-09-07T12:30:16+02:00
identifiers :
  - name    : 'wikidata'
    details : 'wikidata-definitions-3.0.0 (2022-09-07)'
---
filename : 'raster-example-skeleton'
filesize : 10
modified : 2022-09-07T12:29:20+02:00
errors   :
matches  :
  - ns        : 'wikidata'
    id        : 'Q1143961'
    format    : 'JBIG2'
    URI       : 'http://www.wikidata.org/entity/Q1143961'
    permalink : 'https://www.wikidata.org/w/api.php/w/index.php?oldid=1526516378&title=Q1143961'
    mime      :
    basis     : 'byte match at 0, 8 (Gary Kessler''s File Signature Table (source date: 2017-08-08))'
    warning   : 'extension mismatch'

File formats edited after 1 January 2023

# Return all file format records from Wikidata.
# 
# Custom query example:
# 
# Formats that have been edited since 1 January 2023.
# 
# 
select distinct ?uri ?uriLabel ?puid ?extension ?mimetype ?encoding ?referenceLabel ?date ?relativity ?offset ?sig
where
{
  ?uri wdt:P31/wdt:P279* wd:Q235557.
  optional { ?uri wdt:P2748 ?puid.      }
  optional { ?uri wdt:P1195 ?extension. }
  optional { ?uri wdt:P1163 ?mimetype.  }
  optional { ?uri p:P4152 ?object;
    optional { ?object pq:P3294 ?encoding.   }
    optional { ?object ps:P4152 ?sig.        }
    optional { ?object pq:P2210 ?relativity. }
    optional { ?object pq:P4153 ?offset.     }
    optional { ?object prov:wasDerivedFrom ?provenance;
       optional { ?provenance pr:P248 ?reference;
                              pr:P813 ?date.
                }
    }
  }
  ?uri schema:dateModified ?modifiedDate
  FILTER (?modifiedDate >= "2023-01-01T00:00:00"^^xsd:dateTime)
  service wikibase:label { bd:serviceParam wikibase:language "en". }
}

NB. This is a good query to use when testing recent changes to a file format record in Wikidata but you don't want to download the entire dataset. A useful tip, however, is to make a backup of your local copy of the dataset, e.g. on Windows C:\Users\Username\siegfried\wikidata\wikidata-definitions-x.x.x so that you can avoid downloading it again next time around.

Terminology

  • QID: Identifiers in Wikidata have a Q prefix. Q-Identifier.
  • Wikibase: A semantic web platform on which Wikidata is built.
  • Roy: Tool to harvest and build (compile) Siegfried identifiers from sources such as Wikidata.
  • Siegfried: A consumer of the Roy built identifier, returning file-format information based on binary pattern matching.
⚠️ **GitHub.com Fallback** ⚠️