Unsuitable Image Formats for Websites

Posted: April 18, 2012 at 4:53 pm

As I mentioned in our How Fast Is … USPS.com video and blog post yesterday, I discovered a few TIFF files on the US Postal Service’s website. I thought a follow up post about images suitable for use on the web was in order. According to the awesome HTTP Archive, the most common image formats on the web are PNG, GIF, and JPEG:

Usage of Image formats as of 4/2012

Together these images can handle the most common image use cases on the web:

  • GIF – simple animations
  • PNG- Figures, diagrams, screen shots, basic images
  • JPEG – Photographs

These formals are so common and entrenched that the processors and SoCs used in mobile devices are hardware optimized for things like JPEG decoding or the checksums used from PNG’s DEFLATE compression scheme. And frankly, until there is a format which accomplishes these common use cases better than an existing image type, this will not change. This “what we have works fine” mentality is a big reason APNG and other GIF animation replacements have failed to take off. They are not significantly better at solving the simple animation use case, so they do not get adopted.

Unsuitable Image Formats

If these are the most common and suitable formats to solve various image use cases on the web, then what about other image formats? What make up that “1% other” on the chart from the HTTP Archive? By and large, these other image formats have one of the following characteristics that make them unsuitable for use on the web:

  1. They are not natively supported by most browsers.
  2. They do not compress graphical data, do not compress graphics data by default, or compresses graphics data uses inefficient or obsolete methods.
  3. They contains additional information not relevant for display on a monitor.

Characteristic 1 without a doubt kills off almost any image formats you have heard of. Characteristics 2 and 3 make those image formats that are compatible wasteful. So, while there are dozens of image file formats that are unsuitable for the web, the types of unsuitable image formats you do find on the web tend be compatible (or were compatible at some point in the past), but are wasteful and bloated in size. We will focus on 3 types of image format that meet this criteria.

TIFF Images

The TIFF images I found on USPS.com are a good example of this. At one time, browsers did support them. But today TIFF images cannot be natively rendered by modern web browsers and require a plug-in to be visual. TIFF images are not compressed by default. TIFF images are also widely used in the publishing industry so they often use alternative color spaces like CMYK and contain printing information such spacing, lay out, and density. All of this makes TIFF images unsuitable for use on the web.

BMP Images

BMP images are another example of an unsuitable image format for use on the web. BMP images do not compress graphical data. While a primitive form of RLE compression is defined in the specification, this is not widely supported and rarely used. Even if it was, RLE achieves poor compression ratios compared to a more modern scheme such as DEFLATE. Images that, when saved as a BMP images will be hundreds of kilobytes are only tens of kilobytes when saved as a PNG or JPEG. With such huge sizes, BMP is definitely not a suitable image format for the web.

XBM Images

Besides TIFF and BMP, a final image format encountered on the Internet which is not unsuitable is XBM. Ironically, XBM was the very first image format used on the web and support started with the Mosaic web browser. Subsequent web browsers supported XBM to be compatible with early websites. Sadly, XBM is, to put it nicely, a horribly horribly horribly designed image format. If you can imagine all the badness of the DOM’s document.cookie “interface” somehow packaged into a image format, you’d get XBM.

Each XBM file is, quite literally, valid C source code, defining a byte array and the bytes values to populate it. In addition to not being natively compressed, this approach creates a large number of security holes as the format essentially says:

Hey web browser, I’m totally untrusted content from some random 3rd party you found on the Internet, but I’d like you to allocate an array in memory that is X bytes big, and then give me a pointer to it. Now I’m going to shove some number of bytes (who knows how many!) directly into memory starting at that address.

Security vulnerabilities like buffer overflows, heap overflows, and uninitialized memory leakage cropped up everywhere! It was so bad, that to fight the flood of issues Microsoft removed support for the image format back in XP Service Pack 2. Even in late-2008, before I stopped doing security research, I quite by accident discovered an information leakage vulnerability in Firefox using XBM images. XBM is exceedingly rare today, though Zoompf has encountered several websites with them through our free scans and from customers using our WPO product.

Why Do Unsuitable Image Formats Get Used?

If these formats are a poor fit for web usage, why do they get used on websites? There are a few reasons:

  • Legacy. Once upon a time, that image format worked. The site is old, and nobody fixed it. This is common with XBM and BMP.
  • Mistake. People just make mistakes. Sometimes they didn’t know they shouldn’t use one of the image types. Usually someone didn’t intend to use an unsuitable image format. 9 times out of 10, when Zoompf detects a BMP image, it has a file extension of .jpg or .png. The creator intended to save the image as a JPEG, but the actual image format that was used was BMP.
  • Actually meant to use it. If you want to provide a downloadable version of your logo that is high resolution and suitable for printing, you probably do want to publish it as TIFF. If you have a collection of wallpaper images to use as a desktop background, you probably want to offer them as BMP. Having a TIFF or a BMP on your website isn’t a bad thing as long as it is downloadable. The performance problem starts when you use these image formats inside of an <img src> or a CSS background: style, because there are faster, better, and more efficient image formats you should be using.

Conclusions

You should only be using PNG, JPEG, and PNG images on production websites. All other image formats are either not supported by browsers, or are not optimized and efficient for use as images on a website. Remember, just because a filename has a .png, .jpg, or .gif extension does not mean its a PNG, JPEG, or PNG image! Review the images used by your website and 3rd party content to ensure you are using the proper image formats.

Want to see what performance problems your website has? Unsuitable Web Image (TIFF), Unsuitable Web Image (BMP), and Unsuitable Web Image (XBM) are just 3 of the nearly 400 performance issues Zoompf detects when testing your web applications. You can get a free performance scan of you website now and take a look at our Zoompf WPO product at Zoompf.com today!

Poor Choices are Ruining the Web

Posted: February 21, 2012 at 8:18 pm

A recent article by John Naughton has sparked a debate inside the web design and web development community. Are designers, with their image heavy designs, ruining the web?

The answer is yes, but its not why you think. It’s not because designers use big images or even that they use a lot of images. It’s because they are creating and using images poorly.

I get it. Your name is Chris and you spell it Criss. You have a Mac. You wear trendy clothes. You are an awesome web designer. I’m happy to have you on the team.

The problem is you suck at it. Not your art or your design. You are probably awesome at that. Believe me, I’m hardly qualified to judge art. What you suck at is taking your beautiful design and delivering it to your website’s visitors in an efficient way which creates an excellent experience. And that’s what’s ruining the web.

Now I want to see your awesome design. I truly do. Because you have a gift that I don’t to take ideas and visualize how they should look. Even better, you have the skill to express those design ideas in ways that I can only imagine. And often, all I can do is imagine, because when I visit your webpage, all I see is a white screen as my browser downloads megabytes of content.

As Aaron Gustafson says:

"Graphic designers are not ruining the web, but a lack of web professionalism is. Without proper training and an appreciation of the ramifications of each decision that goes into building a website, you more than likely won’t make the right decision regarding optimising the user experience. This isn’t print and it’s not television – bandwidth is a factor."

I couldn’t agree more with Aaron more. What we need here is professionalism and ownership. Performance is not someone else’s responsibility. Performance is your responsibility. Your job doesn’t stop when you create that PSD file. You are creating a User Experience, which is far more that the visual characteristics of your design. You are responsible for the experience of actually engaging with the design.

This is why I’m so glad this discussion started in the middle of our Lose the Wait series. As we have seen, you can lose webpage wait time by shedding page weight. And images are contributing the largest portion of total page weight.

We at Zoompf have made posts and given presentations all about this in the past. There is a lot that can be done to make sure the experience you give to your visitors reflects all the effort and skill that when into the design.

But lets apply some focus. Forget all the things that can be done to optimize images. Lets focus on a single thing. It’s easy to do. It’s obvious to check if it has been done. It has serious and immediate effects. It’s even something you can do right now.

Removing Image Bloat

Image files can contain all sorts of data inside of them that has nothing to do with the rendering of the image. This should not be news. In fact, I even wrote about it last Friday. While the types of non-graphical data present vary with each file format, a few examples are:

  • Unused palette entries
  • Embedded thumbnails
  • Meta data
  • Comments
  • Application settings
  • Camera information

Take a photo of something? Edit it in Photoshop? Well now it has an embedded thumbnail in it hitching along and taking up space. Be careful now! You might accidentally posted naked pictures of yourself to the Internet if you aren’t careful. Talk about a user experience.

So how much can this help? Is this a tempest in a teapot? No. Research shows the average savings by losslessly optimizing an image is 15-20%. That means 1 bytes out of every 4 bytes of an image is wasted bloat.

All of this sounds kinda of amateurish doesn’t it? Is this something novices do or are professional designers at real websites doing this to.

Lets try and experiment. Go check out Best Buy’s website. See those images. The images are bloated by 32% with this non-graphical gunk. 9 months ago, Twitter had the same problem.

Fixing the Problem

There two things we need to think about to fix this problem: finding a way to get rid of the bloat now, and finding a way to make sure we get of it consistently in the future.

The first part is easy. All the tools to optimize images are free. Even better, Chris Sullo of Nikto fame wrote Site Crunch, a script that lets you automatically run image optimization tools over your entire website.

The second part is more challenging. Bloated images get on a website, even when the designer knows better, because of your processes, or lack thereof. Marketing needs this image now, so it goes out the door fast and it doesn’t get optimized. Designers optimize their images, but the product catalog images they get from a 3rd party don’t get optimized. Organizations need a clear policies or procedures about how images and other assets are placed on a production site. Optimization should be incorporated into that process. That is how to fix the problem long term. Ideally, it becomes an automated step in the publish-to-production process. How to do that is another post in and of itself.

Summary

I love designers. You do things I could never do and make the web a better place. But when you create your design without thinking about the other half of the equation, the actual experience of getting that content, you sell yourself and your design short. And that is what’s ruining the web.

Want to see what performance problems your website has? Unoptimzied GIFs, PNGs, and JPEGs are just 3 of the nearly 400 performance issues Zoompf detects when testing your web applications. You can get a free performance scan of you website now. Need more performance goodness? Try our Zoompf WPO product.

Lose the Wait: HTTP Compression

Posted: February 10, 2012 at 4:25 pm

One of the ways you can improve website performance is to reduce the amount of data that needs to get delivered to the client. An easy way to reduce the amount of data sent to a client is to compress the content and then transfer it to the client. This can be done with HTTP compression. Despite being a surprising simply feature of HTTP, there are numerous challenges which must be addressed to properly use HTTP compression. These challenges are:

  1. Ensuring you are only compressing compressible content.
  2. Ensuring you are not wasting resources trying to compress uncompressible content.
  3. Selecting the correct compression scheme for your visitors.
  4. Configuring the web server properly so compressed content is sent to capable clients.

In this post, part of our Lose the Wait performance series, I will discuss each of these issues and demonstrate how to configure your web server to implement HTTP compression properly.

Compressing Compressible Things

Let’s start out easy. What should HTTP compression get applied to? The answer is simple: Any content which is not already natively compressed.

Notice I didn’t say "text resources." Text resources, like HTML, CSS, and JavaScript certainly should be compressed because they are not natively compressed file formats. Unfortunately, most people seem to focus on these 3 types of files. In fact, a quick web search shows that most of the top results for ".htaccess compress" include instructions only on compressing HTML, CSS, and JavaScript files. This just reinforces what I’ve said before; you have to be careful where your advice comes from.

Here is a list of common text resource types on the web which should be served with HTTP compression:

  • XML. XML is structured text used in standalone files (like Flash’s crossdomain.xml or Google’s sitemap.xml) or as a data format wrapper for API calls.
  • JSON. JSON is a subset of JavaScript used as a data format wrapper for API calls.
  • News feeds. Both RSS and Atom feeds are XML documents.
  • HTML Components (HTC). HTC files are a proprietary Internet Explorer feature which package markup, style, and code information used for CSS behaviors. HTC files are often used by polyfills such as Pie or iepngfix.htc to fix various problems with IE or to back port modern functionality.
  • Plain Text. Plain text files can come in many forms, from README and LICENSE files, to Markdown files. All should be compressed.
  • Robots.txt. Robots.txt is a specific text file used to tell search engines what parts of the website to crawl. Robots.txt is often forgotten since it is not usually accessed by humans and does not appear in JavaScript-based web analytics logs. Since robots.txt is repeatedly accessed by search engine crawlers and can be quite large, it can consume large amounts of bandwidth without your knowledge.

ICO

As I said, HTTP compression isn’t just for text resources and should be applied to all non-natively compressed file formats. What do I mean by this?

As an example, let’s look at ICO files. ICO files are an image format used originally used for icon images on Windows. The format, as it is in use today, was created over 20 years ago for Windows 3.0. Today, ICO files are used on the web as Favicons for a website, usually displayed in the address bar or browser tab. While modern browsers allow other file formats besides ICO support is not universal. Many sites continue to use ICO files as Favicons for compatibility reasons.

Despite being an image, ICO files are not natively compressed. ICO images are actually a primitive version of a BMP image. Neither ICO nor BMP image formats are natively compressed. While can (and should) avoid using BMP images on your website, you can’t do this with ICO files. Be sure to configure your web server to server ICO images with HTTP compression.

SVG

SVG images are example of an image format which is not natively compressed. SVG images are just XML documents, but they have a different MIME type and file extension. This means, while someone might remember to compress XML documents, they forget to compress SVG documents.

You might be using SVG images on your website and not even know it. This is because of a feature of SVG images, SVG fonts, which allow SVG files to contain font glyphs used to render text. These SVG image-that-really-a-font files can be references in CSS using the @font-face syntax much like a OTF or WOFF font file. Divya Manian has written a comprehensive post about the pros and cons of SVG fonts. For the purposes of this discussion the main take-away from her post is that, until iOS 5, SVG fonts were the only type of custom font supported by iPhone, iPad, and iPod Touch.

Font support is, to put it nicely, a giant mess. Font libraries abstract this away from the web developer and serve the correct format, including SVG fonts, to the correct browser. This mean your website can be using SVG without you even knowing it. Remember to serve your SVG files using HTTP compression.

Compressing already compressed content

Another mistake developers make with HTTP compression is using it on content that is already natively compressed. Apply compression to something that is already compressed doesn’t help improve performance. In fact, it can hurt performance to two ways.

First, HTTP compression has a cost. The web server has to take the content, compress it, and then send it to the client. If the content cannot be compressed further, you are just wasting CPU doing a meaningless task.

Secondly, applying HTTP compression to something that’s already compressed doesn’t make it smaller. In fact, the overhead of adding headers, compression dictionaries, and checksums to response body actually makes it bigger, as shown in the figure below:

Do websites actually do this? Yes, and it’s more common than you would think. I used Zoompf WPO to examine Fox News. Fox News is the 40th most visited website in the United States. As you can see, Fox News is mistakenly applying HTTP compression to PNG images.

This not only wastes CPU, but also increases the size of the PNG images delivered to Fox News visitors by a few dozen bytes:

Zoompf actually has two different checks for this issue. The first check "Compressed Content served with HTTP compression" alerts you that you are wasting CPU time compressing something that is already compressed. The second check, "Bigger with HTTP Compression" identifies content that is actually larger when served using HTTP compression.

Both of these problems usually are the result of a configuration problem with the web server or an inline network device. Something in your environment is applying HTTP compression to all outbound content instead of only content that should be compressed.

GZIP Vs. DEFLATE

So far, we have talked about HTTP compression as if it is an opaque or atomic feature. But that is not the case. HTTP simply defines a mechanism for a web client and web server to agree a compression scheme can be used to transmit content. This is accomplished using the Accept-Encoding and Content-Encoding headers. There are two commonly used HTTP compression schemes on the web today: DEFLATE, and GZIP.

DEFLATE is a patent-free compression algorithm for lossless data compression. There are numerous open source implementations of the algorithm. The standard implementation library most people use is zlib. The zlib library provides functions for compressing and decompressing data using DEFLATE/INFLATE. The zlib library also provides a data format, confusingly named zlib, which wraps DEFLATE compressed data with a header and a checksum.

GZIP is another compression library which compresses data using DEFLATE. In fact, most implementations of GZIP actually uses the zlib library internal to conduct DEFLATE/INFLATE compression operations. GZIP produces its own data format, confusingly named GZIP, which wraps DEFLATE compressed data with a header and a checksum.

Unfortunately, the HTTP/1.1 RFC does a poor job when describing the allowable compression schemes for the Accept-Encoding and Content-Encoding headers. It defines Content-Encoding: gzip to mean that the response body is composed of the GZIP data format (GZIP headers, deflated data, and a checksum). It also defines Content-Encoding: deflate but, despite its name, this does not mean the response body is a raw block of DEFLATE compressed data. According to RFC-2616, Content-Encoding: deflate means the response body is:

[the] "zlib" format defined in RFC 1950 [31] in combination with the "deflate" compression mechanism described in RFC 1951 [29].

So, DEFLATE, and Content-Encoding: deflate, actually means the response body is composed of the zlib format (zlib header, deflated data, and a checksum).

This "deflate the identifier doesn’t mean raw DEFLATE compressed data" idea was rather confusing. Early versions of Microsoft’s IIS web server was programmed to return raw DEFLATE compressed data for Accept-Encoding: deflate requests instead of a zlib formatted response. And naturally versions of Internet Explorer at the time expected responses with a Content-Encoding: deflate header to have raw DEFLATE response bodies.

As Mark Adler, one of the authors of zlib, explains in this StackOver thread:

However early Microsoft servers would incorrectly deliver raw deflate for "Deflate" (i.e. just RFC 1951 data without the zlib RFC 1950 wrapper). This caused problems, browsers had to try it both ways, and in the end it was simply more reliable to only use GZIP.

As Mark says, browsers receive Content-Encoding: deflate had to handle two possible situations: the response body is raw DEFLATE data, or the response body is zlib wrapped DEFLATE. So, how well do modern browser handle raw DEFLATE or zlib wrapped DEFLATE responses? Verve Studios put together a test suite and tested a huge number of browsers. The results are not good.

All those fractional results in the table means the browser handled raw-DEFLATE or zlib-wrapped-DEFLATE inconsistently, which is really another way of saying "It’s broken and doesn’t work reliably." This seems to be a tricky bug that browser creators keep re-introducing into their products. Safari 5.0.2? No problem. Safari 5.0.3? Complete failure. Safari 5.0.4? No problem. Safari 5.0.5? Inconsistent and broken.

Sending raw DEFLATE data is just not a good idea. As Mark says "[it's] simply more reliable to only use GZIP."

It should be also noted that all browsers that support DEFLATE also support GZIP, but all browser that support GZIP do not support DEFLATE. Some browsers, such as Android, don’t include deflate in their Accept-Encoding request header. Since you are going to have to configure your web server to use GZIP anyway, you might as well avoid the whole mess with Content-Encoding: deflate.

Luckily, avoiding DEFLATE isn’t all that difficult.

The Apache module which handles all HTTP compression is mod_deflate. Despite its name, mod_deflate don’t not support deflate at all. It’s impossible to get a stock version of Apache 2 to send either raw DEFLATE or zlib wrapped DEFLATE. Nginx, like Apache, does not support deflate at all. It will only send GZIP compressed responses. Sending an Accept-Encoding: deflate request header will result in an uncompressed response.

Microsoft’s IIS web server can send both gzip and deflate responses and you can enabled or disable each scheme individually. For IIS6, you can , you can edit the metabase to disable DEFLATE support. For IIS7, you can disable DEFLATE support by editing the DEFLATE compression scheme section in the <schemes> element of the <httpCompression> element of the various IIS7 .config files.

Both Zoompf’s free and commercial products have a check built-in, “Obsolete Compression Format”, which will detect if your web server is sending content compressed with DEFLATE.

Netscape 4 and Internet Explorer 6 Are Screwing You. Again.

So by now you should have your web server configured to:

  1. Properly compress what needs to be compressed.
  2. Avoid compressing already compressed content.
  3. Configured to only use GZIP.

Now you need to ensure that your configuration is not actually excluding perfectly capable browsers.

While HTTP compression is a mature feature today, there were some problems early on. Netscape 4 only supported HTTP compression for HTML documents even though it sent an Accept-Encoding: deflate, gzip for all requests. Serving it HTTP compressed CSS or JS documents would make it crash. For reasons that aren’t quite clear, the developers of Apache decided to address this client-side bug with a server-side fix. They added the following seemingly harmless line into the Apache configuration file:

BrowserMatch ^Mozilla/4 GZIP-only-text/html

Any browser calling itself Mozilla/4 would only receive HTTP compressed HTML files. Since Apache was and is the most popular web server on the Internet, this caused enormous problems which still affect us today.

First of all, this was the middle of the browser wars and Internet Explorer 4, Internet Explorer 5 and even Internet Explorer 6 all identified themselves as Mozilla/4 in their User-Agent strings. But these browsers could accept HTTP compression for non-HTML responses. Trying to patch around one buggy browser caused another to be slow! Since IE6 would ultimately achieve over 95% market share, it was a problem that IE6 would download webpages more slowly from Apache than from other web servers. To resolve this, the Apache developers were forced to add another configuration directive:

BrowserMatch \bMSI[E] !no-GZIP !GZIP-only-text/html

This line means: if the User-Agent has MSIE in it, then turn off the no-GZIP and GZIP-only-text/html options, thereby instructing Apache to use HTTP compression for all responses if IE asked for it. And all was good, until it wasn’t.

You see, IE6 on Windows XP also multiple problems with HTTP compression. Most of these issues dealt with compressed CSS or JavaScript files being cached as compressed items and which were then read from the cache assuming they were not HTTP compressed. So again another Mozilla/4 browser had problems with compression, and so again the Apache developers had to "fix" the issue with another configuration directive:

BrowserMatch \bMSIE\s6 GZIP-only-text/html

This directive instructed the web server to only send compressed content for HTML responses if the browser was IE6. While this helps dealt with the majority of the issues, some of these bugs caused so many extreme edge-case problems that, for reliability reasons, larger sites would completely disable HTTP compression for IE6 entirely:

BrowserMatch \bMSIE\s6 no-GZIP

Eventually Microsoft fixed these issues with hot fixes and, comprehensively, with Windows XP Service Pack 2. But this created a fragmentation problem, where some IE6 browsers could handle HTTP compression for all content, and some could not. Another rule was added in an attempt to serve compressed content to IE6 browsers that had SP2 installed. This was done by looking for the poorly named SV1 identifier in IE6′s User-Agent string:

BrowserMatch "^Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1" !no-GZIP !GZIP-only-text/html

This chain of "deny this, but not this, unless it’s this, but not if it is also this" directives made configuring a web server to properly serve compressed documents to the appropriate browsers difficult and prone to error. Since these bug/solution cycles happened numerous times over several years these configuration directives mutated. Blog posts from 2004 would tell you to do one thing and blog posts from 2006 would say another. Much like a child’s game of telephone short comings, errors, missing edge cases, and missing corner cases were magnified as people reused old configuration files and shared the "correct" advice. Even today, many of the top Google search results for configuring HTTP compression for Apache using mod_deflate contain different and incorrect directives.

As I wrote in Advice on Trusting Advice it all comes down to where you get your advice from. Follow the advice on this top search result and IE9+ gets no compression at all. Follow the advice on this top search result and IE6 gets no compression at all. Follow the advice from this search result and no version of IE will get anything using HTTP compression, except for IE7. Follow advice from IBM, no version of IE will ever get a non-HTML file using HTTP compression.

Depending on which directives were used, and how match criteria is configured, you ended up with several possible scenarios:

  • HTTP compression is completely disabled for all Mozilla/4 browsers.
  • HTTP compression is completely disabled for IE6
  • HTTP compression is completely disabled for IE6 except SV1
  • HTTP compression is completely disabled for all versions of IE
  • HTTP compression is completely disabled but all versions of IE, except IE6 (so no compression for IE > 6)
  • HTTP compression for non-HTML files is disabled for all Mozilla/4 browsers.
  • HTTP compression for non-HTML files is disabled for IE6
  • HTTP compression for non-HTML files is disabled for IE6 except SV1
  • HTTP compression for non-HTML files is disabled for all versions of IE
  • HTTP compression for non-HTML files is disabled but all versions of IE, except IE6 (so no compression for IE > 6)

Apache makes it quite easy to mess this up. Nginx is much easier. It completely ignores the old Netscape 4 browsers and does not attempt to work around them. It also has a very simply mechanism to avoid sending compressed content to bad versions of IE6. You don’t need to manually define "this is good" and "this is bad" regexs, allows you to avoid making a mistake.

In practice, you should just not even try to work around these problematic browsers. The problem browser have all been updated or patched. Even the most recent of the affected browsers, IE6, was fixed nearly a decade ago. Even on platforms that are no longer supported, this issue has been fixed. You should review you configuration file and remove any browser filtering code used for HTTP compression.

Hopefully this section has also taught you that fixing a client-side bug with a server-side fix it rarely a good or sustainable idea. As I discussed in The Big Performance Improvement in IE9 No One is Talking About, this approach of using the User-Agent as a factor in content generation forced the widespread use of the Vary: User-Agent header. The Vary header used in this manner effectively nullifies the shared caching which reduces the overall performance of the web.

Extension Vs. MIME Type

It is important to review how your web server is configured to compress content. Most browsers allow you to specify either a list of file extensions to compress, or a list of MIME types to compress, or both. Be careful to review this list.

Let’s say you have configured your application to serve text/javascript responses using compression. Are you sure that’s the only MIME type you application uses when serving for JavaScript files? What about text/x-javascript or application/x-javascript or application/javascript? What MIME type does your API serve for JSON responses? text\json? application\json? Something else? How about HTML? Are all of your HTML files using text/html? Do you have some sections from the XHTML days which use other MIME types like application/xhtml+xml or text\xhtml or application\xhtml? Is all of the markup generated by your application served using a single and consistent MIME type? And let’s not forget about the code you didn’t write. What MIME type does that opaque charting library use to send data to the client? Or that auto-completing textbox widget you got from Github?

If you are configuring the web server to use compression using file extensions, did you get all of them? .htm or .html or is it something else? What about your 404 handler? A request happens for the non-existent file /foo/bar.jpg. Since the file extension is not explicitly defined as something that should be compressed (or, being an image, is explicitly defined not to be compressed), the 404 response isn’t sent with compression.

Care must be taken when configuring your web server to ensure that uncompressed content is not slipping through due to a missing file extension or MIME type declaration.

Properly Configuring HTTP Compression

So, given all these challenges, how should you go about configuring HTTP compression properly?

To see where you might have made a mistake configuring your server, your need a something to compare it to. I am a big fan of the .htaccess file from the HTML5 Boilerplate Project. This is an Apache configuration file specifically crafted for web performance optimizations. It provides a great starting point for implementing HTTP compression properly. It also serves as a nice guide to compare to an existing web server configuration to verify you are following best practices. At the very least, the HTML5 Boilerplate .htaccess file provides a comprehensive list of common web content which should or should not get served using HTTP compression.

Getting a good starting point is only half the battle. The configuration for HTTP compression on a web server only works when it matches the application running on that server. Even the HTML5 Boilerplate configuration file can fail you if there is a discrepancy between the file extensions and MIME types in the configuration file and those used by your application. It’s easy to forget or overlook a MIME type or a file extension that you application uses. To ensure your application matches your configuration, the best thing to do is carefully review:

  1. How is your web server configured to map MIME types to content or file extensions?
  2. How is your web server configured to compress content relative to those MIME types or extensions?
  3. How are your application’s filenames and extensions structured?
  4. How does your application change or override a response’s MIME type?
  5. What third party libraries use MIME types?

Once you think you have properly configured the web server, you need to validate it. Web Sniffer is a great, free, web-based tool that let you make individual HTTP requests and see the responses. Web Sniffer gives you some control over the User-Agent and Accept-Encoding header to ensure that compressed content is delivered properly. Hurl is another web-based HTTP tool you can use. It allows for more control than Web Sniffer, but requires you to manually enter more information to get the same results:

Hurl and Web Sniffer only test a single page at a time. You can use Zoompf’s free scan and Zoompf WPO can be used to scan multiple pages to verify no uncompressed content is slipping through.

Conclusions

As this post shows, there are many challenges which must be overcome to properly configure HTTP compression. Make sure all non-natively compressed content is served using HTTP compression. Don’t waste load time, CPU cycles, and bandwidth compressing content that is already compressed. Only use GZIP compression to ensure compatibility. Don’t try to work around old browsers since it is easy to make a mistake and end up not delivering compressed content to a capable browser. Review your application code and server configuration to make sure the application’s content and structure matches your HTTP compression settings. Don’t forget about compressing 404′s. Finally, don’t just assume your configuration works. Use a tool to validate that is works.

Want to see what performance problems your website has? Content Served Without Compression, Compressed Content Served with Compression, Bigger With Compression, and Obsolete Compression Format are just 4 of the nearly 400 performance issues Zoompf detects when testing your web applications. You can get a free performance scan of you website now and at a look at our Zoompf WPO product at Zoompf.com today!

Performance Questions to Ask Hosting Providers: Secure Website Access

Posted: December 14, 2009 at 3:06 pm

(This is the third article in a series of articles about performance questions you should ask when choosing a hosting provider. The first article, “What control do I have over the web server?” and the second article “What access do you provide to web server logs?” are also available.)

So far in this series we have talked a lot about questions to ask hosting providers to make sure you can configure your website for performance and access the raw traffic logs of your website to spot performance problems. All of this is moot of course if you cannot get content onto your website. That’s why this post of “Questions to ask a hosting provider” is all about:

“Can I Securely Communicate With My Website?”

ethernet-locked

It has happened to everyone. You are out at a coffee shop, a client site, or at a conference and you need to make changes to your website. Perhaps you need to upload a few new PHP files or some images. Perhaps you need to update your web server configuration to set up a new email address for an event. Perhaps you simply saw something cool and want to write a WordPress post. But can you do anything of these things securely using a public network? This question is best answered with an analogy.

Imagine you are at a formal cocktail party. You drift from room to room, through a sea of lavishly dressed party goers and dine on mouth-watering morsels served on silver trays by waiters in white gloves. As you approach a side table of crystal champagne glasses you overhear bits and pieces of the conversations around you.

  • “We cannot wait. It should be a lovely vacation and it’s the perfect time for us to get away for a week.”
  • “That’s right, with the nanny! Walked right in on them! And he tried to say that she was only choking!”
  • “Chris starts there next spring, just like his father.”

Well attended cocktail parties are loud and noisy. Its almost impossible not to hear what everyone else is saying! Of course we are taught that to be polite we should ignore the conversations other people are having unless we are involved. You are on the honor system not to eavesdrop.

Public networks such as wireless networks are just like cocktail parties. Your wireless card is like a party guest. It broadcasts out to the room when it “speaks” and “listens” to everyone within range to hear a response. Like a real party guest, wireless cards are supposed to ignore any conversations that they overhear that is not meant for them. They do this by dropping the data and not bubbling it up to the computer. However nothing forces network devices to ignore data they receive that is not meant for them. In fact, all networking devices (not just wireless devices) can be placed into “Promiscuous Mode” where any data they receive, even data that is not addressed to themselves, is received and bubbled up to the computer to process. This allows any networking device to become a giant listening device that hears and records all the information on the network! Promiscuous mode is not some evil hacker trick. It’s a fully intended feature of networking devices that has many legitimate uses.

Diagram showing how clients in a wireless network hear each others' traffic

But wait! I use Encryption!

“The conference wireless network or the coffee shops wireless network is encrypted. They tell me they use something called WPA2 with a key of a million bits! I’m secure right?”

No, you are not secure.

Let’s go back to the cocktail party analogy. The hosts don’t want just anyone coming into their party and drinking all their fine wines. So they place a bouncer at the door of the party. Only people that know the password are allowed into the party. If you know the password you get into the party and can listen to all the other guests. If you do not know the password you remain outside the building and cannot hear anything that is going on inside.

Encrypted wireless networks are just cocktail parties with bouncers. You need the “password” to join the wireless network. Once you are connected you can listen to everyone else’s traffic just like before because on the network everyone is using the same password to transmit and receive their data. (This is the only scalable solution. Otherwise the wireless network administrator would have to create a new, unique password for each and every person that joins the network). In other words, an encrypted network uses the password solely to protect and restrict “access” to the network. It does nothing to protect the users of the network from themselves or from each other.

The Danger of Sniffing (packets)

So What! Who cares if someone can listen to my network traffic. It’s not a big deal. After all they will just see the blog content I was about to post anyway. Unfortunately this is not true. Using any system that requires a username and a password on a wireless network? You may have shouted to the entire cocktail party that username and password. And chances are you use that same username and password somewhere else on the Internet. Like your bank. Or an online store. Are you already logged into a system like Gmail or your WordPress administration panel? You are shouting your HTTP Cookies to the entire cocktail party. Someone can steal your HTTP session cookies and use session hijacking to access Gmail or WordPress as if they were you without needing your username and password. Next thing you know you are on The Wall Of Sheep!

Secure Communications With Your Website

Remember: network encryption protects networks and application encryption protects applications! You need to make sure you are using encrypted application protocols to properly protect yourself. What protocols you use and how you use them will vary with different use cases.

Uploading Content

How do you upload content to your website? If the answer is FTP you are in trouble. FTP sends usernames and passwords in the clear. You need an encrypted file transfer mechanism like SFTP or SCP. If you have shell access to your web server using SSH you also have the ability to use either SFTP or SCP as they are simply subsets of the functionality of SSH. By default most hosting companies provide an insecure file transfer system like FTP. Ask if they provide (for free) a secure file transfer system like SFTP or SCP. Make sure they understand you don’t need full SSH functionality and are only interested in secure file transfer. If this is not available you might need to upgrade your account or purchase an add-on to get SSH access for your website.

Writing Content

Do you use a web interface to write content for your blog platform or CMS system? Does it use SSL? Check the address bar. Does it start with https? If not you are not using SSL. Do you write your content using other software? Does that software directly publish the content to your blog using a web API like RSD or XMLRPC? Does that use SSL? Check the settings and see if you are using “https” to access the API interface. If you are not using SSL to communicate with these web resources then anyone can capture your username and password or cookies (which are just as good as your username and password).

Website Administration

How do you administer your website? Do you use a web interface like cPanel? These web administration interfaces are most common in shared hosting environments and typically run on a different hostname or an odd port number. Ask the hosting provider if they offer SSL access to the interface. Hosting providers often get confused and think you want to create an SSL certificate for your website. While this would secure a CMS you configure like WordPress (see previous use case) it does not help you secure the web administration interface because that is often running on a separate system. Make sure they understand you want secure access to their interface, not your website. This discussion may take several emails back and forth but most hosting providers are willing to supply SSL access to cPanel or other administration interfaces.

Summary

In conclusion, the questions about secure communications you should ask your hosting provider are:

  • “Do you provide a secure file transfer mechanism like SFTP or SCP? Is it provided for free or is it extra? If you don’t do you offer SSH access to the web server? Is it free?”
  • “If you provide a web-based website administration interface like cPanel do you provide access to it using SSL?”
  • “Do you provide an SSL certificate for my CMS? What is the cost?”

How to judge their answers will vary from person to person based on need. Personally, a secure file transfer mechanism is a requirement. Too many times have I needed to upload a presentation, PDF, or file to my website from a public network at a conference or client site. If you have a heavy blogger secure access to your content management system is going to be critical. After all, it is difficult to write a blog post about an event from the event if you cannot securely access your blog to write the post!

Browser Performance Problem with CSS "print" Media Type

Posted: December 7, 2009 at 3:02 pm

I ran across an article today that shocked me. Geert De Deckere wrote how you can save an HTTP request by combining the CSS files for the print and screen media types.

Wait, I thought. What? Why do I need to do this? What behavior is this correcting? I was very confused. Maybe you are too.

image of CSS code in an editor

CSS allows you to define styling information for different media types. CSS could tell a browser that is rendering to a TV screen to style the same content differently from a browser rendering on a mobile phone. The HTML content is all the same. Media types simply define which style rules apply for which devices. CSS also defines a print media type which is the style to use when styling a page that is being printed. Browsers should be smart about only downloading the style sheet with the media type for the device they are rendering. Firefox on my laptop is should not fetch the mobile.css style sheet whose media type is handheld. And luckily browsers are smart and don’t download CSS files for media types that they don’t support.

Except for the CSS media type print.

Geert’s article and advice were predicated on the claim that web browsers will download external style sheets with the print media type even if you don’t print the page. Is this true? To find out I built a quick test page:

<html> <head> <title>CSS Media Tests</title> <link rel="stylesheet" type="text/css" href="screen.css" media="screen" /> <link rel="stylesheet" type="text/css" href="print.css" media="print" /> </head> <p> Hello! <img src="new-logo.png"> </p> </html>

Wow! Geert’s claim was true! All major desktop browsers that I tested (Firefox 3.5, IE 7, Chrome 3.0 and Safari 4.0) will download external style sheets whose media type is print even if you don’t print the page! This is hurts performance for no good reason. Currently your browser must make one HTTP request and download screen.css. But then your browser has to make an additional HTTP request to download a file full of content that it does not need. Worst of all, the browser will not start rendering the page until it has grabbed the completely unused print.css file!

This is very silly behavior. Especially given that virtually none of your website visitors are going to print any of your web pages, unless you are a website like Google Maps. Try and remember: “when was the last time your printed a web page?” Unfortunately all 4 browser I tested all downloaded print.css even though I never printed the page. Firefox, IE, and Chrome all downloaded print.css in order as if it was a external CSS file whose media type was screen. Looking through a proxy the request order was:

  1. css-media-test.html
  2. screen.css
  3. print.css
  4. new-logo.png

Safari 4.0 however, downloaded the content in this order:

  1. css-media-test.html
  2. screen.css
  3. new-logo.png
  4. print.css

Safari was smart enough to defer downloading but did still downloaded it. I do not know if Safari delayed firing the window.onload event until after print.css downloaded or not. WebPageTest confirms that IE does not start rendering the page until print.css is downloaded. The fact that Firefox and Chrome both requested content in the same order as IE leads me to think they also delay rendering.

Possible Solution?

Geert proposed a solution to this problem. He recommends combining the two external CSS files into a single CSS file and use @media directives inside the CSS file to separate the style info for screen from the style info for print. You end up with a single CSS file that looks like this

@media screen { /* contents of screen.css here */ } @media print { /* contents of print.css here */ }

This solution does not sit well with me. Yes, by combining the two CSS files and using @media directives you can remove an HTTP request. You now only have to download a single CSS file whose size will be smaller than the sum of the two original file sizes because a single large file will compress better. However your visitors still have to download a large amount of CSS content. 30-40% of that content is printer-centric style information which no one will ever actually use anyway, and the browser will not start to render the page until all this useless data has been downloaded. (Interestingly enough Zoompf free Web Performance Scan checks style blocks and CSS files for @media directives and recommends you break them into separate style sheets to prevent unnecessary rules from being downloaded. I had to modify the check to allow @media print directives when I found this solution.)

A Different Solution

I believe there is a different and perhaps better solution. You can defer downloading print.css by using JavaScript to dynamically add a <LINK> tag pointing to the external CSS file with the print media type after the page has loaded! This solution means the browser only needs to make 1 HTTP request and less CSS content needs to be downloaded to start drawing the page. This will have a faster “Time to Render” than a single CSS file as less data is downloaded. The extremely small number of people who do print your web page will still get the style sheet necessary for them to print. You can also use a <NOSCRIPT> tag in the <HEAD> to link to print.css. This means anyone who has JavaScript turned off will the performance hit all of your visitors are currently taking and request both external style sheets. The deferring print.css solution looks like this:

<html> <head> <title>CSS Media Tests</title> <link rel="stylesheet" type="text/css" href="screen.css" media="screen" /> <noscript> <link rel="stylesheet" type="text/css" href="print.css" media="print" /> </noscript> </head> <p> Hello! <img src="new-logo.png"> </p> <script> window.onload = function() { var cssNode = document.createElement('link'); cssNode.type = 'text/css'; cssNode.rel = 'stylesheet'; cssNode.href = 'print.css'; cssNode.media = 'print'; document.getElementsByTagName("head")[0].appendChild(cssNode); } </script> </html>

You reduce initial request count and download size at the cost of greater complexity and more markup. This code could be improved. A more scalable solution would be for the JavaScript code to look in the <HEAD> and parse any <LINK> tags inside of a <NOSCRIPT> with a print media type and create new LINK elements dynamically.

Solving the problem

A summary of the problems and the two solutions appears below. This table assumes two CSS files (screen.css and print.css) each 30 kilobytes and size and a combined CSS file (all.css) whose size is 55 kilobytes.

MethodHTTP Requests before “Start Render”CSS Downloaded before “Start Rendering”# HTTP Requests after “Onload”Content Download after “Onload”
No Optimization260 Kb00 Kb
Single CSS file all.css155 Kb00 Kb
Deferring print.css130 Kb130 Kb

Which solution works best will vary with your situtation. The status quo is 2 HTTP requests to deliver 60 Kb of content before the browser can start rendering. A single CSS file reduces that to 1 HTTP requests and 55 Kb of content before the browser can start rendering. Deffering print.css also only requires 1 HTTP request before pageload but only sends 30 Kb before the browser can start rendering. If you have a small print.css file it might be better to use a single CSS file with @media directives. The overhead of serving a single larger CSS file containing unused style dat aand the delay that adds until the browser can start rendering might be so small it does not matter. However if you have a larger print.css file deferring the print.css download until after page load would provide a great performance benefit.

The moral of the story here is that the browser creators need to remove this performance defect from their code. Ideally the print CSS media type data should not be downloaded until the print dialog box appears, either from user action or using window.print() in JavaScript. Next best solution would be for the browser to automatically defer the downloading of “print” CSS media type data until after the page has downloaded. In the mean time, you can use either the single CSS file solution or the deferring print.css solution to make your web pages load faster!

Want to see what performance problems you have? An appropriately placed <LINK> tag and proper use of CSS @media directives are just two of the 200+ performance issues Zoompf detects while assessing your web applications for performance. You can sign up for a free mini web performance assessment at Zoompf.com today!

The Challenge of Dynamically Generating Static Content

Posted: December 7, 2009 at 12:17 am
php_code

Time and time again I see people using PHP or some other application logic to try and hack around some issue they are facing. We saw this in our previous post Questions to Ask Hosting Providers: Web Server Configuration where people would use PHP to emulate mod_deflate or mod_expires. Andrew King, in his book Website Optimization talks about wrapping developer comments in CSS or JavaScript files in <?php ?> tags and using the PHP interpreter to remove them. People use PHP to combine CSS or JavaScript resources together. And today I read an article from the always awesome Chris Coyier over at css-tricks.com about using PHP to emulate CSS variables.

Don’t get me wrong. I was actually bemoaning the lack of variables in CSS two days before Chris wrote his article. (Actually, what we really want is more like C/C++ macros but that’s another story). Anyone who has tried to implement CSS sprites, change margins or element sizes, or modify color values knows what a pain it is to go through a CSS file and type the same thing over and over.

Using PHP to solve this problem, or any of the other problems listed above, makes perfect sense at first. Because it makes things easy. Because you are all being lazy. You are using a runtime mechanism to try and simplify your life.

Stop Being Lazy!

Now, under normal circumstances programmers should be lazy! After all your very job is to create something that does work for you! Unfortunately in this case your laziness is harming the performance of your application. Using application logic to dynamically generate static content at runtime is a massively bad idea. Consider these 4 consequences:

  • You take an order of magnitude performance hit for invoking the application tier instead of just serving a flat static file from the file system.
  • Since the web server is not serving a static file, there will be no Last-Modified header sent by default. That means no conditional GETs and no 304 responses which means lots of bandwidth consumption.
  • PHP, like virtually all application tiers, produces a chucked response. This is because the web server has no idea what the content length will be because it is dynamically generated. Dynamically generated chunked responses will not send the Accept-Range header. This means no pausing or resuming or error recovering. The entire resource must be re-downloaded.
  • Chunked encoding is not supported with HTTP/1.0, so any HTTP/1.0 device (like every caching proxy ever made) has to flip into “store and forward” mode where it downloads the entire response before passing it along.

And as if all these downsides for invoking the application tier was not enough, we have my personal favorite: Web Security! As someone who professionally broke into computer systems for many years when I see:

http://example.com/combine.php?files=a.js|b.js|c.js

I get very excited. Think about what a resource combiner script does. “Hey website, I’m going to give you a list of files on your hard drive, and I want you to read them off the disk, one at a time, and dump their raw contents into a response and send it to me!” Jackpot baby! This is what we call a Local File Inclusion vulnerability just waiting to happen. The developer has not so much created a resource combiner as they have provided me with a rudimentary remote file download service! I immediately do something like this:

http://example.com/combine.php?files=db.inc

In about 45 seconds I have downloaded the /etc/password file, your httpd.conf, your .htaccess, your raw mysql database, you app config files filled or user names, passwords, and database connection strings, and each PHP file to retrieve all your source code. Or worse I perform remote file inclusion, thereby injecting a PHP-Shell, which allows me to completely take over your website! (BTW: Roughly one in every 3 PHP resource combiner scripts I have seen contains these security vulnerabilities. Beware where you get your source code!)

The Fundamental Problem

The fundamentally problem in all of these examples are developers are getting lazy and are using PHP code to do something at runtime that should have been done earlier.

Properly Generating Static Content

Great! So what is a web developers to do? Go back to the dark ages where you cannot leverage all that great application logic in the generation of our content? I want my CSS variables and I want them now! Notice I never said you cannot dynamically generate static content! I just said you should not dynamically generate static content at runtime! Want CSS variables? Want to use a PHP script to combine resources or minify or whatever?Go ahead and do it! Just do it ahead of time. You can run your PHP script form the command line, produce your CSS file, complete with all the correct CDN paths and color values, and upload that to your website. And this isn’t just for PHP. Use Perl, Python, Ruby, Java, or whatever. You can even do it in QBASIC!

'CSSGEN.BAS - kicking it old school CDN$ = "http://zoompf.com/" LOGO$ = "includes/logo.png" PRINT ".logo {" PRINT " background: url("; CDN$ + LOGO$ + ");" PRINT "}"

And the output:

qbasic-css-gen

(Thats right. I totally just used QBasic 1.1 from DOS 5.0 to automate publishing a web application on 64bit Vista. Oh yeah!)

The moral of the story is never make the user pay for your laziness. Do not use the application tier of a website to dynamically generate static content at runtime. Instead do it at publishing time or even do it in a daily or hourly cron job. This approach allows you all the advantages of using application logic without drastically reducing the very web performance you were trying to improve in the first place!

Browser Performance Issues with Charsets

Posted: December 3, 2009 at 2:35 am

Not defining a character set or where you do define it can cause poor performance for your website’s visitors. In this post we will discuss character sets and how best to define them to avoid web performance problems.

At their core, HTML documents are just a series of bytes. The character set (or charset) for an HTML document tells your web browser how it should process those bytes to construct characters. The browser then interprets those characters to render the web page. The 2 most common ways to tell the web browser what charset to use for an HTML page are by specifying it in the HTTP Content-Type header or by using a <META> tag to emulate an HTTP Content-Type header. When the web content author is the same person as the web server administrator it is possible to directly configure the web server to use the appropriate charset for the appropriate URLs. In this world of virtual hosts, Content Management Systems, and blogs this is rarely the case anymore. As such more and more web developers are using <META> tags to define the charset for HTML documents.

This leads to a Chicken-and-the-Egg problem. The HTML document contains text which tells the browser how to read the document. Hmmm. So how does the browser read the document without a charset? While it varies with browser and version, most assume a Latin alphabet charset like US-ASCII, Latin-1, or ISO-8859-1. The browser then reads the HTML document using this charset scanning for charset information. At this point one of three things happens:

  1. There is no <META> tag with charset information.
  2. There is <META> tag with a charset and it’s what the browser guessed.
  3. There is a <META> tag with a charset, but it’s a different charset than the browser guessed.

If there is no charset information the browser is in an odd position. At this point most browsers attempt some type of charset detection. With several years of web security experience believe me when I tell you that in theory this is an awesome idea but in practice this is a horrible idea. Web browsers or servers trying to “fix” broken data is the root a number of nasty web security vulnerabilities (such as UTF-7 XSS attacks and various other injection evasions). Regardless, no charset information of any kind forces the browser to do more processing which can produce a very small performance hit at best and a hacked website at worse.

If there is a <META> tag whose charset is the same as what the browser guessed there is no issue. Nothing else needs to occur.

If there is a <META> tag and it specifies a charset different than the assumed charset the browser has a problem. It has already interpreted some amount of the HTML document but it was the wrong charset. That information is all bad. The document needs to be reprocessed using the correct charset. So right now at best you are talking about a small performance penalty as the browser has to reparse the beginning of the HTML document.

But it can get worse! This is because browsers don’t scan the entire HTML document looking for a charset. They want to start rendering content! If they don’t see a charset defined “near the top” of the HTML document they start rendering content and executing JavaScript using the assumed charset. (“Near the top” varies from browser to browser which we will discuss in a minute). But once the browser gets going interpreting and executing content and then finds a <META> tag with charset information it’s in a real bind. Because now it has already been executing code, and requesting other resources, and render content using the wrong charset! Those URLs could be wrong, that JavaScript could have syntax errors, or the CSS rules could be misspelled all because the browser read them using the wrong charset information.

“Near the top” for Firefox 3.5 means within the first 2048 bytes. If Firefox does not detect charset information in the first 2048 of an HTML document (and no charset was defined in the HTTP headers) it starts rendering the page and executing script using an assumed charset (I did not investigate other browsers). Consider this example web page adapted from a Simon Pieters test case. It contains some JavaScript, whitespace, and starting just after 2048 bytes, a <META> tag defining the charset. In Firefox the JavaScript and pop an alert box showing a Euro sign. After 2048 bytes there is a <META> tag changing the charset from the assumed Latin-1. Firefox has to reprocess and re-render the page which will execute the JavaScript again with a Cyrillic character appearing in the alert box this time.

It is also interesting what the browser does if it has already made a request. If Firefox has already requested a URL and then detects a new charset the URL must be re-requested. Consider this example page. Here JavaScript make a request to a nonexistent image from www.google.com (we include the alert box to create a delay in thus simple test case to ensure Firefox has already started fetching the resource). The URL contains a character changes based on the charset so it must be re-requested. Using an HTTP proxy we see the browser made 2 requests to 2 different URLs (with URL encoding to encode the characters being sent)

charset

Note: it appears that Firefox does not try to re-request a URL if the change in the charset did not affected the change the meaning of the URL. If you modify the 2nd example to request “abc.gif” it does not appear that Firefox fetches this twice. More testing is needed here.

So there you have it. Browsers take a performance hit of varying severity when you fail to specify the charset near the very top of your HTML document. Always make sure to include some type of character set information so the browser does not waste time auto detecting one. This can slightly help performance and avoid security vulnerabilities. If you are using <META> tags to specify the character set information of your web pages make sure to place it a high in the <HEAD> of your HTML document as possible. The W3C standard specifically mentions this problem and solution. For Firefox, you only need 2048 characters before the <META> charset tag to cause this problem. A <SCRIPT> tag, a <STYLE> tag, an HTML comment, or even a <META> description tag or long <META> keywords tag can easily consume 2048 bytes. While other browsers may be more tolerant and allow a larger window they would still take a performance hit of having to reparse the byte stream. For these reasons Zoompf recommends you place the <META> charset tags as the first element inside of the <HEAD> of your HTML document to avoid any performance problems.

Want to see what performance problems you have? An appropriately placed <META> charset tag is just one of the 200+ performance issues Zoompf detects while assessing your web applications. You can sign up for a free mini web performance assessment at Zoompf.com today!

Expanding Rezipping

Posted: December 1, 2009 at 12:33 am

This post is a follow up to the previous post “Rezipping Web Resources for Fun and Profit.” In that article, we showed that many common web files, such as MS Office documents, Silverlight applications, Java Applets, and more are really just Zip files with a special structure of files inside. By rezipping a file (unzipping the contents and rezipping those contents using a higher compression level) web developers can reduce size of those files by 5-30%!

An obvious, but less useful expansion of rezipping is to extend it to other compression types, namely GZip compressed files or BZip2 compressed files. We can use 7-zip’s command line version 7za to accomplish this. It looks something like this:

//gunzip the file into temporary directory 7za X -tgzip original.gz -o"c:\tmp\" //regzip using maximum compression 7za A -tgzip -mx9 new.gz "c:\tmp\original"

This approach can be extended to BZip2 using “-tbzip2″ switch. I collected a few samples of GZip archives and using rezipping was able to reduce their size by an average of 5.03% as shown in the table below.

ArchiveOriginal Size(kb)Rezipped Size(kb)% Savings
bochs-2.4.2.tar.gz4,035,0103,879,1233.863%
dojo-release-1.3.2.tar.gz2,618,4932,471,0785.630%
expsummarytalk.ps.gz130,247121,5286.694%
httpd-2.2.14.tar.gz6,684,0816,420,9483.937%

Using rezipping on GZip or BZip2 archives is unfortunately less useful and beneficial than on Zip files. This is because so many files that served or downloaded on the web use Zip files as a wrapper. Finding ways to optimize Zip files lets you optimize a dozen other file types on the web. These files are either directly loaded and executed by the browser (like Silverlight or Applets) or are very common downloadable content like documents or presentations. However I know of no web content that uses a GZip file or BZip2 file as a wrapper file. While downloadable programs, source code, or other archives might use GZip or BZip2 you will not find any widely deployed document or content format that uses these as the wrapper file. This limits the usefulness of rezipping GZip or BZip2 archives.

As mention in the last post, one positive note is that while no widely deployed web files use GZip as a wrapper, many files contain raw GZip or DEFLATE streams. Flash files use GZip to compress the contents of the SWF tags. PDF’s uses DEFLATE to compress text streams. This means with a little parsing and some glue code proven tools like 7-zip should be able to be used to reduce the size of other files that are very common on the web today!

Rezipping Web Resources for Fun and Profit

Posted: November 30, 2009 at 4:48 pm

One large area of web performance optimization is reducing the size of your content. Most people know about obvious techniques like HTTP compression, minifying, or removing extra data from images. However there is one size-reduction technique that does not seem to be common knowledge for most web performance junkies: Rezipping.

zipper

Let us start with a little background. Zip archives consist of multiple compressed files that are package together into a single file. Zip archives are compressed using the DEFLATE compression algorithm. Deflate supports different compression levels from 1-9. These compression levels provides a trade-off between CPU and memory resources used to create the Zip file and the size of the resulting Zip file. Using a higher compression level consumes more resources but you end up with a smaller file. Most Zip programs tend to create Zip archives using a compression level of 5 or 7. While this can be a good trade off as the file is created quickly and is reasonable compressed it will not produce the smallest file possible.

Now all that is well and good. But why should frontend web developers care about Zip file optimization? Simple: Many of the most common files on the Internet are actually Zip files. By creating methods to make smaller Zip files we are actually optimizing multiple different types of web files. Optimizing these files will reduce bandwidth consumption and server load while improving page load times.

These “Files that don’t end in .zip but really are Zip Files” use the Zip file format as kind of a wrapper to collect all the bits and pieces that really make up the file and store them in a single compressed unit. For example, Silverlight applications have a XAP file extension. However Silverlight applications are just a Zip file containing compiled byte code, resources like images and sounds, and other configuration. Java Applets contained in JAR files are Zip files. All of the Microsoft Office’s OOXML documents (DOCX, XLSX, PPTX, etc) are Zip files. All of OpenOffice.org’s ODF documents (ODT, ODP, ODS, etc) are Zip flies. You can rename any of these types of files to “.zip” and open them with any Zip program.

Since all of these common web files are simply Zip files we can optimize them to improve web performance and operational costs. This is where Rezipping comes in. Rezipping is process of recompressing a Zip file to create a smaller file. The process is simple: you take any Zip file, unzip the contents, and then rezip the content at a higher compression level. To accomplish this, I am using the command line version of 7zip. 7zip’s implementation of the DEFLATE compressor is generally considered to compress files better than other Zip programs by 5% to 10%. The process looks like this:

//unzip the contents of the original zip into a temporary directory 7za.exe X original.zip -o"c:\tmp\" //rezip using maximum compression 7za.exe A -mx9 new.zip "c:\tmp\*" To see how much this could help web performance, I download several samples of different types of zip files off of the internet.

Silverlight

NameOriginal Size (kb)ReZipped Size (kb)% improvement
cached – SilverlightApplication1.xap3,9723,8991.84%
Everything-SilverlightApplication1.xap825,801782,5945.23%
Examples.CS.xap4,752,2623,376,41128.95%
GeoReference.xap388,898288,97725.69%
HoldemSimulatorUI.xap1,280,7141,243,9672.87%
ImageGallery_v25_9458063489vC.xap18,22617,5383.77%
SilverlightControl.xap678,995557,79117.85%

On average rezipping reduces a Silverlight application by 12.32%. This is quite good given that XAP files can contain many binary files like images or sounds that will not be recompressed. Some files created from Visual Studio saw an improvement or more than 25%! Also notice that “ImageGallery_v25″ is the Silverlight application used by Bing to change Bing’s background image. This heavily served file could be slimmed by nearly 4% simply be rezipping the XAP file!

Microsoft Excel Documents

NameOriginal Size (kb)ReZipped Size (kb)% improvement
Listedescourselearning.xlsx55,61840,75326.73%
ParticipatingMembers.xlsx170,382123,27527.65%
PartnerReadinessAndTrainingFY09.xlsx26,67321,34919.96%
PermissionTemplate.xlsx22,57015,96929.25%
Presentation_Skills_Providers.xlsx33,09227,14417.97%

On average rezipping Excel files saves about 25%. This makes sense as most Excel spreadsheets contain predominately text and not uncompressable binary data.

Microsoft PowerPoint Documents

NameOriginal Size (kb)ReZipped Size (kb)% improvement
AMP 8.0 Project Kickoff Template v1.2 07102009.pptx112,63796,75314.10%
CL01.pptx1,918,4401,692,78511.76%
CL02.pptx5,872,2285,448,8187.21%
EC2.pptx123,137100,01318.78%
MSDN_Admin_08.pptx2,006,0911,862,4967.16%
SharePoint_Buzz.pptx2,123,7782,040,2343.93%
speedgeeks-20091026.pptx3,408,3653,271,3844.02%
SupportingDistributedTeamwork.pptx2,454,3602,387,2572.73%

On average rezipping PowerPoint files saves about 9%. This can vary widely depending on the number of images that are contained inside the PPTX file as images are not recompressed (more on that in another article).

Microsoft Word Documents

NameOriginal Size (kb)ReZipped Size (kb)% improvement
ASC_3.0_Demo_Image_Release_Notes.docx431,220412,0344.45%
implementationchecklist.docx126,981120,0755.44%
MSCOM_Virtualizes_MSDN_TechNet_on_Hyper-V.docx115,23089,57222.27%
CompProposal.docx25,54821,39516.26%
Web content redline 2009-10-28.docx201,304180,86810.15%
WindowsSharePointServicesDatasheet.docx198,837172,08213.46%

On average rezipping Word documents saves about 12%.

Conclusions

Always Use Rezipping! Stop sending bytes down the pipe you don’t have to! The savings you receive from ReZipping is driven by the contents of the Zip file. Files with a large number of binary objects that will not be compressed (like images) will have a lower improvement. Also note that higher compression levels increase the time and memory to compress data. but they do not increase the time it takes to decompress data. This is because all the work is in finding out what can be reduced during compression, not in recreating the original data during decompression. There is no reason not to use rezipping.

By rezipping your files you can reduce the size of your content. This reduces bandwidth consumption and server load while improving page load times! There is more work to be done. There are a number of web flies that contain raw Deflate streams like Flash files, WOFF font files, SVGZ, and more. All of these could be redeflated using a compression level of 9 and make smaller, faster files. Stay tuned as we investigate this more.

Performance Questions to Ask Hosting Providers: Log File Access

Posted: November 25, 2009 at 2:39 pm

(This is the second article in a series of articles about performance questions you should ask when choosing a hosting provider. The first article in the series is here)

Last time we covered the most important question you should ask a hosting provider: What control do I have over the web server. This time we will be showcasing another important question to ask a hosting provider:

“What Access Do You Provide to Web Server Logs?”

web server configuration

The main reason you want access to log files from the web server is to learn how visitors are accessing your content. This will reveal a wealth of knowledge about the raw traffic patterns of your web application and expose various performance issues and limitations. Often these performance issues will not be detected by page-based performance tools like Yahoo’s YSlow or Google’s Page Speed.

Web server logs come in many different formats. Usually they are large text files where every request is logged on its own line. Several pieces of data about each request are logged in different fields on the line separated by commas. Typically information that is logged for each request is:

  • URL requested.
  • IP address of the visitor.
  • Date and time request was received.
  • The program or browser used to request the URL. This is called the User-Agent.
  • The Referring webpage (if any).
  • HTTP version used to request the page.
  • Status code of the response.
  • Size of the body of the response.

Log files are a very granular view of your web traffic. Sometimes it can be difficult to see the forest through the trees. For example, what pages did user XYZ visit, in what order, and how long did the user stay on each page? It is usually very difficult to get this information from logs alone because web server logs only track users by a specific IP address. To provide a larger view and answer questions like those listed above web developers use web analytics packages like Omniture, Hitbox, or Google Analytics. Web analytics packages uses cookies and JavaScript to gather detailed information about your visitors, the capabilities of their browsers, and their actions through your web application. Web analytics packages are simple to add to a website. Typically all that is involved is inserting a block of JavaScript at the end of each HTML page. This is very easy to do on templated or dynamically generated websites. So if web analytics provides you with “bigger picture” and richer data than web server logs that begs the question:

Are Web Analytics Reports Good Enough?

Actually, no. Web analytics reports are not good enough. Web analytics data abstracts away the raw traffic of your web application and which can hide several important problems. Web analytics packages only track visitor requests and activity for HTML files that are served with a 200 status code. Out of the box, here are things that most web analytics packages do not track:

  • All requests to non-HTML resources.
    • Images
    • JavaScript
    • Style sheets
    • Feeds (RSS, Atom)
    • RIA files (HTC, Flash, Silverlight, Java, etc)
    • Access files (robots.txt, sitemaps.xml, crossdomain.xml, etc)
    • Documents (PDF, Office docs, Zip files)
    • Other resources (Fonts, Cursors)
  • Most error pages (404, 5xx, etc).
  • Conditional requests that return “304 Not Modified.”
  • Requests from non browser User-Agents (spiders, mash-ups, etc).
  • Users who have JavaScript disabled for accessibility or (more commonly) security reasons.

This valuable information is completely missed if you are only using web analytics data to understand your traffic. Consider the valuable questions you can answer with web server logs:

  • Where are your redirects? Which can be removed to decrease page load time?
  • What web resources are using the most bandwidth? This is calculated by simply adding up all the body sizes that are returned for a resource and sorting. Can you reduce the size of these files somehow using compression, minification, or by removing meta data?
  • What are the most requested resources on your website? Can you use caching or other methods to minimize the number of times the resource is requested? If you cannot cache those files because they are dynamically generated can you add programming logic to use a Last-Modified header to reduce bandwidth? Can you remove resources like external JavaScript or CSS files that are referenced by not actually used by that web page?
  • How often does your static content actually change? This is calculated by counting the 304s for a resource. Perhaps you can use a longer Expires time on your content.
  • How often are search engine crawlers visiting your site? Are there crawlers that are missing? You should submit your website to their indexes.
  • Are crawlers finding your high value content? Perhaps you should be using or modify your sitemap.
  • Do crawlers request a large amount of low value content? Do crawlers “get stuck” on part of your website? Perhaps you need to fix your robots.txt.
  • Which web resources are you still getting a lot of requests for that no longer exist? You should use a redirect that points to the correct content.

Real Life Examples

At Zoompf we have detected and solved numerous performance issues just by examining a client’s web server log files. Here are some of our more interesting stories:

  • A social networking client used their robots.txt file to prevent crawlers from indexing content from new, untrusted members that could be content spammers. As such their robots.txt was 250 kilobytes! Because the file was updated so often all of the search crawlers would request it multiple times a day. These factors resulted in the client using 5 gigs of bandwidth a month just to serve its robots.txt file! Turning on HTTP compression for text files reduced this by 70%. The client is currently implementing the use of <META> tags and rel=”nofollow” attributes to limit search engine indexing for the web pages of untrusted users. This will result in even higher performance savings.
  • A software client found that by far their most popular pages were for their product documentation. These pages were constructed using a PHP templating system. However these files never changed once that version of the product shipped. The client moved to pre-rendering the web pages to static HTML files and using a far future Expires header on the HTML files. This drastically improved performance and reduced bandwidth consumption and server load.
  • An ecommerce client discovered crawlers were requested all the available colors for each item they sold. For example, the crawlers would visit /item1/, /item1/color/red/, /item1/color/blue/, and so on. By creating a robots.txt rule to prevent crawlers from requesting every color for every item the client reduced their bandwidth by nearly 80% while still having their important content indexed.
  • A client discovered that their most important content was not getting indexed. A developer had copied a code snippet from the Internet into the top of their template to solve a CSS problem they were having. Unfortunately this code snippet also included a <META> tag to telling search engines not to index the web page.
  • A client learned that its logo had not changed for over 4 years and it was also an uncompressed BMP file even though the logo had a JPEG extension. They change the logo to be a proper image format for web and increased the logo’s Expires time.
  • A client discovered that no one had every requested their iPhone application image. They removed the <LINK> tag and reduced the size of all of their HTML pages.
  • A client discovered their favicon.ico file was consuming huge amounts of bandwidth. This is because it contained multiple versions of the same icon at different dimensions. Removing all but the 16 by 16 pixel version from the ICO file reduced file size by 97%.

Typically Log Access

If you have your own server access to the web server logs is usually unrestricted. However in most shared hosting environments you will not have direct access. Typical web server log options you have are:

  • The raw Apache, IIS, or NCSA log files in a directory outside of your web root that you can access using FTP or sFTP. This is the ideal case.
  • The raw Apache, IIS, or NCSA log files placed directly in your web root. While this provides you with the raw logs anyone on the internet can also access your log files. This is a security risk as log files can often contain sensitive data like credentials or “hidden” areas of your web application. Talk with your hosting provider about moving the location of the web logs.
  • An option through a web-based website administration system like cPANEL that lets you download the raw log file.
  • An option or interface in the web admin system that lets you view or download a specially formatted version of the logs.

If you cannot access the raw log files don’t panic. As long as the log file contains the follow information you will have all the data you need:

  • URL requested
  • Date and time request was received
  • The program or browser used to request the URL. This is called the User-Agent.
  • Status code of the response
  • Size of the body of the response

Another question to ask hosting providers is not only “what information is in the log file” but also “how much time does the log file cover?” You can imagine in large sharing hosting environments how log files can quickly go to hundreds of megabytes for potentially thousands of customers. Hosting providers often limit the log file in different ways including:

  • Record only a week of traffic and replace the log with a new empty file every week.
  • Limit the total size of the log file. Each new entry removes an entry from the start of the log
  • Provide a night copy of the log file for all the traffic of the site received that day. These copies are usually removed after a certain about of time.

If you do have a time window make sure grab a copy of the log file. Some interfaces like cPANEL offer a scheduling services that can email you the log file or place them in a special location that you can then download. You can schedule an FTP download or use wget or curl to download these log files.

Processing Log Files

Depending on how much log data you have, you might want to concatenate your log files together until you have a big enough sample. At Zoompf we suggest collecting a sample between 500,000-1,000,000 requests, or a week’s worth of web traffic, depending on which is larger. Programs like awstats are very helpful for processing and provide reports with your most popular and least popular files, largest files in terms of bandwidth, and other data already broken out. Directly processing the logs yourself always you to discovered more detailed data and not as hard as you would thing. Some basic regular expressions can make it very easy to gather metrics like “show all of the 304s, 404s, 500s, etc.”

Remember, examining your web logs is a key technique to discovering and solving performance problems with your web applications. Those pretty graphs from Google Analytics or other web analytics data is simply not good enough to detect performance issues and bottlenecks. You need access to the information about all the requests the web server is processing. Make sure you ask your hosting provider how you can access the raw web server log files. Find out how much web traffic data the logs contain and how you can easily collect this data so you can analysis. If your hosting provider does not provide this you should consider that a deal breaker and find another provider.