up

Zoompf's Web Performance Blog

Note: Archived Content

This is the archived version of the Zoompf blog. Since our acquisition by Rigor, all our new research and posts on web performance are being published on The Rigor Blog

Content Detection: A wretched hive of scum and villainy

 Billy Hoffman on October 24, 2014. Category: Uncategorized

A few days ago, I wrote about how the BBC was serving a BMP image as a JPEG. Several people have asked me why would this even work at all? The file extension and the MIME Type both implied it was a JPEG, but the file wasn’t a JPEG. How could the browser even render this image in the first place?

The answer is content detection.

This is not the content you are looking for

How does a web browser know how to interpret what the body bytes of an HTTP response mean? Is this an image? If so, what type? Is this text? If so, how are the characters encoded into bytes?

HTTP provides, in theory, a clean and logical mechanism to communicate this information: the Content-Type header. This header uses MIME types to identify the content. Unfortunately, there are numerous reasons the Content-Type MIME information can be wrong, including:

  1. In all major web servers, by default, the Content-Type value for static content is driven by the file extension. So logo.jpg is served as Content-Type: image/jpeg, regardless of the actual image format of logo.jpg.

  2. The value of the Content-Type header for different pieces of content is typically controlled via the server configuration. The people who create the content are often not the people who configure the server. This separation means you might save your HTML document with the character set UTF-8, while the server is configured to use Content-Type: text/html; charset=ISO-8859-4 for HTML files.

  3. MIME types must be approved by a committee! Until they are approved, you have to use a x- prefix to denote that a MIME type is experimental. This leads to at least 2 MIME types for every file format: the MIME type used while experimental, and the later official MIME type. JavaScript started as text/x-javascript and moved to text/javascript. But it gets worse because…

  4. No one actually agrees on MIME types. JavaScript’s MIME type? Technically, its application/javascript, and before that application/x-javascript. Complicating things, for a many years Microsoft decided to call their JavaScript JScript and used the text/jscript MIME type instead. Oh, and the committee that standardizes HTML? Its’ a different committee than the one that approves MIME types, which, just to make things fun, is also different than the committee that standardizes HTTP. And that first group insists the MIME type for JavaScript is text/javascript despite what the second group says, and puts that in all its standards. This leaves the HTTP group in an awkward position about saying which standard to follow!

    Now what about JSON? Technically its a subset of JavaScript. Should it get its own MIME type even though its JavaScript just because of how you use it? What about when you fetch JSON as JSONP via a <script src> tag? About here is where a front-end application developer just makes up a MIME type for whatever format their API is using. And so far we have only talked about all the different MIME values used for a single type of web content!

  5. If a web server doesn’t know what MIME type to use, it will default to something generic like application/octet-stream. This is a nice way of telling the browser “here is an opaque blob of bytes, you figure it out.”

“It’s not my fault!”

As we can see, the Content-Type header is not a dependable mechanism the browser can use to determine file type. Like me, you might be thinking: “Well, people should be using Content-Type correctly! The website is the one that is messed up and misconfigured.. The browser should just throw an error so the website can fix their problem!”

While a noble idea, this is a terrible approach in practice. Think about this from the user’s perspective. Let’s say Firefox decided to only use Content-Type to determine the file format, and displayed an error if they didn’t match. Now, imagine a user, Melanie, tries to visit a site with Firefox. This site has misconfigured MIME types, so Firefox displays an error or doesn’t fully render the page. Which of the following does Melanie think has happened?

  1. The website she is visiting must be misconfigured. They need to get their act together

  2. Firefox is a crappy browser! It doesn’t work for cutecutekitties.com! They need to get their act together.

That’s right, Melanie thinks it’s number 2. Firefox can rightly know it is the website that’s messed up, but that is kind of like Han Solo trying to explain why the Millennium Falcon doesn’t work:

Han: It’s not my fault!
Leia: No Lightspeed?
Han: It’s not my fault!

It doesn’t matter whose fault it is, it’s a bad user experience. its-not-my-fault

(If this “if you aren’t doing it right it shouldn’t work at all” philosophy sounds familiar, that’s because you have heard this story before. It is one of the primary reasons well-formed XHTML never caught on with the larger web development community. You had a passionate group who created a standard with noble intentions. However they insisted that XHTML documents had to be valid and correct and that browsers should throw a warning and not display anything if the XHTML page had any problems. This too was a terrible user experience, and made people think something was wrong with the browsers. The browsers, understandably, said “screw this”, and would parse and render XHTML as HTML, unless you really really forced them to. The rather depressing lesson from this is that, for a web browser, being “correct” is never as important as rendering something).

Browsers needed to find another way to detect what format was being used without exclusively using the MIME type supplied by the server. This is solution was content detection.

Content Detection

Content detection means the browser must look at an opaque blob of bytes and try to figure out what type of content it is. This can be a very challenging task, especially for text responses like HTML and CSS. However, for binary content like images, it is fairly straight forward.

Binary content like images can be identified by looking for well known byte signatures, also called magic numbers. For example, PNG images always start with the byte sequence 89 50 4E 47 0D 0A 1A 0A. GIF images start with the string GIF87a or GIF89a. Fonts like WOFF, OTF, EOT, ZIP files, Flash objects, and other binary files common on the web all have well known magic numbers. Wikipedia has a great list of common magic numbers, including the magic numbers for most types of web content.

Determining the file type for some arbitrary blob of bytes is as simple as looking for these magic numbers. The source code for the content detection function in Chrome is open, available, and rather readable. Here is the relevant snippet:

static const MagicNumber kMagicNumbers[] = {
  // Source: HTML 5 specification
  MAGIC_NUMBER("application/pdf", "%PDF-")
  MAGIC_NUMBER("application/postscript", "%!PS-Adobe-")
  MAGIC_NUMBER("image/gif", "GIF87a")
  MAGIC_NUMBER("image/gif", "GIF89a")
  MAGIC_NUMBER("image/png", "\x89" "PNG\x0D\x0A\x1A\x0A")
  MAGIC_NUMBER("image/jpeg", "\xFF\xD8\xFF")
  MAGIC_NUMBER("image/bmp", "BM")
  // Source: Mozilla
  MAGIC_NUMBER("text/plain", "#!")  // Script
  MAGIC_NUMBER("text/plain", "%!")  // Script, similar to PS
  MAGIC_NUMBER("text/plain", "From")
  MAGIC_NUMBER("text/plain", ">From")
  // Chrome specific
  MAGIC_NUMBER("application/x-gzip", "\x1F\x8B\x08")
  MAGIC_NUMBER("audio/x-pn-realaudio", "\x2E\x52\x4D\x46")
  MAGIC_NUMBER("video/x-ms-asf",
      "\x30\x26\xB2\x75\x8E\x66\xCF\x11\xA6\xD9\x00\xAA\x00\x62\xCE\x6C")
  MAGIC_NUMBER("image/tiff", "I I")
  MAGIC_NUMBER("image/tiff", "II*")
  MAGIC_NUMBER("image/tiff", "MM\x00*")
  MAGIC_NUMBER("audio/mpeg", "ID3")
  // TODO(abarth): we don't handle partial byte matches yet
  // MAGIC_NUMBER("video/mpeg", "\x00\x00\x0 1\xB")
  // MAGIC_NUMBER("audio/mpeg", "\xFF\xE")
  // MAGIC_NUMBER("audio/mpeg", "\xFF\xF")
  MAGIC_NUMBER("application/zip", "PK\x03\x04")
  MAGIC_NUMBER("application/x-rar-compressed", "Rar!\x1A\x07\x00")
  MAGIC_NUMBER("application/x-msmetafile", "\xD7\xCD\xC6\x9A")
  MAGIC_NUMBER("application/octet-stream", "MZ")  // EXE

You can see they are building a list of common web content like PDF, GIF, PNG, JPEG, and their corresponding magic numbers. Look at the line MAGIC_NUMBER("image/bmp", "BM"). This is defining that Windows BMP images start with the characters BM!

And now we arrive at the answer to the original question: How can a browser render an image if the MIME type is incorrect? Chrome was able to render the image from the BBC’s website because it took the response, ignored the Content-Type header, and used its content detection function to look for magic numbers in the image file. Chrome discovered that it did not contain the magic number for JPEG images, but it did contain the magic number for a BMP image. Chrome makes a note of this, and passes along the image data with the correct image format information down the chain, until eventually it is rendered properly for the user.

Conclusions

The web is a very dirty place. Servers are misconfigured, files are saved and named incorrectly, and web browsers are forced to deal with it all. We’ve seen that browsers cannot trust the Content-Type header and instead must try and determine the file type of content using content detection. So far, we have only covered content detection for binary files like images or fonts. This is fairly easy, since browsers can just look for magic numbers to determine the format. Imagine trying to determine if a string of text is HTML or JavaScript! Content detection for text is very tough indeed!

If you think its cool to learn about how browsers do content detection, then you’ll love our new Zoompf Alerts Beta. Zoompf Alerts continuously scans your website through the day, looking for specific front-end performance issues, and alerts you when new problems are introduced. We just launched the public beta of Zoompf Alerts and you can join Zoompf Alerts now for free!

Comments

Have some thoughts, a comment, or some feedback? Talk to us on Twitter @zoompf or use our contact us form.