Media Types (aka. MIME) - KinoKabaret/Kinoit-Quasar-Skeleton GitHub Wiki

INTRO

We prefer to use the term media-type rather than MIME-type not only because it is more clear, but also because it is spec. Indeed, IANA calls them "Media Types" and says: "RFC2046 specifies that Media Types (formerly known as MIME types) and Media Subtypes will be assigned and listed by the IANA."

To be clear, we NEED to know exactly what type of file we are dealing with before we begin the transcoding process.

e.g. { "media-type" : "audio/mpeg" }

METHOD

Undertaking media-type detection is not a trivial task. Generally speaking, there are six methods of determining media-type (going from least to most trustworthy):

resource name
content type hint
metadata
magic bytes
container analysis
testing

Resource Name

First and foremost, the "resource name" (or extension) of a file is not a trustworthy description of its media-type. It is trivial to change a file's extension into .mp3 when it is an image/jpeg - or anything else for that matter. For obvious reasons, this type of file will not playback in an <audio> tag in html. However, this can be used as an indicator for reasons explained below in Testing.

Content Type Hint

If a server provides a file, it will also often tell the client what type of file it is. As a service however, we generally do not have this luxury. In a hostile environment, this type of hint is less trustworthy than a resource name.

Metadata

Another approach to getting at the media-type is to use the file metadata often present in media assets. There are a wide variety of tools for doing this, such as:

exiv2
exiftool
mediainfo
ffprobe
gm identify
mplayer -identify
lesspipe

The problem with metadata, however, is that it is not always present and because it is technically in userspace, it is not ultimately trustworthy. It can, however, be used as an indicator together with the resource name to eliminate false-positives or uncertainty in many cases. See testing below for more information.

Magic Bytes

The first part of a file header often uses a sequence of bytes specific to the media type. (But not always! I'm looking at you .obj and .stl!) One of the most common tools to read these magic bytes is the humble file program. Unfortunately, many versions of the unix file command (e.g. < 1:5.19-2 which is something older Debian installations still use) often fail at detecting some mp3 files. One would assume that:

file -b --mime-type "pretty_much_any.mp3"

Would return something like audio-mpeg, but instead it invariably returns application/octet-stream. This has been fixed in 1:5.22+ - but now there is a report that some mp4 containers are also being described as application/octet-stream. Indeed, just today (June 2015) with current file a video was determined to be an audio/mp4!!!

Another option is to use xdg-mime (which is GTK based and probably a show-stopper for mac users).

xdg-mime query filetype "pretty_much_any.mp3"

To date, this yields the best results. We are still on the lookout for something better / cross-platform (that is not Java based), so feel free to drop us a line.

At any rate, if we know the content-type, then we can compare the content-type's allowed resource names with the resource name we have been given. Depending on our paranoia levels, given a mismatch, this would be an opportunity to advance to testing or error-throwing.

Container Analysis

Many different types of documents (whether MS Office or Libreoffice based) use the same magic bytes, such as application/vnd. This means that the container has to be opened and analysed further. It seems that xdg-mime is able to take care of this for us as well.

Testing

So what if all else fails and the only thing you know about the file is that its media-type is application/octet-stream? One school of thought would say: Throw an error, log the results and invite the user to try another file. Another possibility is to make a best guess and try to transcode the file in a sandbox with limited resources and low system priority (perhaps in a VM). Here is where the resource name and metadata come into play. Using these two bits of information it is possible to identify which program could be used to test the file for mimetype. Anyway, its never too late to throw an error.

Problems

There are two main classes of problems with media-types:

Attack Vectors
steganographic attack vectors (like gifar [gif + java])
png decompression overflow
out of bounds payloads
something else dastardly
Off Spec Resources
non-spec'ed resources (.obj)
historical resources (.nbm)
hipster resources (.gifv)
brand new media-types

At the end of the day we use a whitelist to accept only certain types of files. As files are tested against the system they are added to the list manually. Furthermore, we prefer to offer transcoded files to the user, which generally saves bandwidth and prevents several of the more common attack vectors since out of bounds data will not be trasformed into the new asset.

Open an issue if there is something you need to have transcoded that is not recognized.

[Note: This text has been copied from the ito-suite project.]