Detected and Optimizing Mismatched Image Formats
Serving the wrong image type can really hurt web performance. An improperly saved image can waste bandwidth and delay the page load time. In this blog post, I’ll show how you can use the
grep commands to quickly, easily, and automatically find mismatched images that look like one image type but are actually another format.
A few months back I wrote about an image on the BBC’s website whose MIME type and file extension said it was a JPEG image but in fact was a 484 KB BMP image. This is a performance issue as BMP images are unsuitable for use on the web.
I got a lot of questions about how the browser could even render this kind of mismatched image, so I a wrote a follow up piece about content detection. In short, most binary files like images contain so-called magic numbers: a sequence of bytes that are unique to a specific file format. Browsers largely ignore the file extension and the MIME type of a response and instead look for different magic numbers to determine what kind of file it is and how it should be rendered.
This status quo is great for the end user. If a designer misnames the file or if the IT team misconfigures the MIME type, a visitor’s browser will still be able to display everything correctly. But unfortunately the browser’s behavior masks the fundamental problem: the BBC wants to use a JPEG but is serving a BMP by mistake. Everything works except that image is about 10x bigger than it needs to be, wasting bandwidth and slowing down page load times.
In my original article I demonstrated using a hex editor to see inside the file to show how it was indeed the wrong image format. While using a hex editor works well for a small number of files, that process is rather technical and doesn’t scale well for dozens or hundreds of images. Luckily there is a solution.
File to the rescue
Most Unix-like systems, including OS X, Linux,and Cygwin on Windows include the
file uses a database of magic numbers to figure out what is the actual type of format for a file. You can see the output in the screen shot below when I run
file on the contents of a directory.
file has revealed a lot of great data. We see Windows binaries, an HTML document, and various images. We also see additional meta data about the files as well, like the dimensions of the images. If you look closely, you will also see the
bbc.jpg image is identified as PC Bitmap which is another way to say a Windows Bitmap (BMP) image. So
file can help us identify files that were saved incorrectly. But right now, that information is buried. What we want is a way to call attention to files whose file extension does not match the detected file type.
Enter grep the sidekick
To pull out this information, let’s use the
grep is a super handy program that lets you match or filter any input and display the results. For example, you can use
grep to filter a text file and only display lines of text that contain the word “awesome”. We can use
grep to highlight when the file extension doesn’t match the file type using the following code snippet:
file *.jpg *.jpeg | grep -v JPEG
file on all files with a
.jpeg file extension and passes the output to
grep. We then use
grep with the
-v option which inverts the matching. Basically we are telling grep to display any line of text which doesn’t contain the word “JPEG”. Since we told
file to only look at files that have a common JPEG extension, every line of output should be identified as a JPEG image and contain the word “JPEG”. Our filter means that the only output will be files which have a
.jpeg file extension, but which are not actually JPEG images. You can see this below:
Awsome! That shows us JPEG imposters, but what about other image formats? We can do the same thing.
file *.png | grep -v PNG will find
.png files that are not really PNG images.
file *.gif | grep -v GIF will find
.gif files that are not really GIF images. We can execute these commands all at once by separating them with semicolons, like this:
file *.jpg *.jpeg | grep -v JPEG; file *.gif | grep -v GIF; file *.png | grep -v PNG
Using this approach, I was able to detect 2 mismatched image files, as shown below:
This mismatched image format detector is a great way to quick find problem images on your website. Just run it against you different image directories for your website. It is also a great script to add to a build process like grunt, so you can verify that all of your images are the correct format.
Mismatched image files are very tricky to detect. This is because the browser does what it is supposed to do and renders the image, even if it has the wrong file extension or MIME type. Since the image renders, it is not obvious that there is a performance problem. However, as the BBC was, you could be wasting bandwidth and reducing page load times. Using
grep we can detect file which were saved in the wrong format.