up

Zoompf's Web Performance Blog

HTTP Compression use by Alexa Top 1000

 Billy Hoffman on May 2, 2012. Category: Optimization
TwitterLinkedInGoogle+FacebookShare

Yesterday, frontend madman and performance nut Paul Irish reached out to me asking if I had any stats on the use HTTP compression.

I’ve written a bunch about the benefits of HTTP compression, as well as the challenges in implementing it. Surprisingly, I realized that, no, I did not have any figures about HTTP compression usage by major sites. The most recent stats I had were from the talk I gave during Velocity 2011′s Ignite sessions. The more I thought about it, I saw there were lots of interesting stats to gather beyond a raw “X number of sites use compression”. I decided to survey the top 1,000 Alexa websites and get a deeper understanding of how they are using HTTP compression. The results reveal how large websites manage their content and apply the most basic web performance optimization.

Methodology

How could I get data about how HTTP compression is used by the top websites? I could certainly pull from Zoompf’s database of free scans or our paid customers, but this is not a great sample. It contained lots of the same sites scanned over and over again, and all the scans are made by people who know about performance and are actively trying to improve it. It didn’t really matter. All of the Alexa Top 1,000 websites are sadly not Zoompf customers, (yet!), so I wouldn’t even have the data anyway.

My next thought was to use the awesome HTTP Archive. After all, they’ve done the hard work of fetching all the content and provide HAR files for each site. Sadly the HTTP Archive can’t help me here for a few reasons. The first reason is that HAR, while a helpful format, has short comings. The biggest is that the response bytes are not included in the HAR file. This is a critical requirement. I wanted to determine what types of content are or are not getting compressed. The MIME type in the Content-Type header is not a reliable at all, so I would need the response bytes to determine content type.

Another reason for needing the full responses including headers and body bytes is the type of analysis I needed to do. Understanding how top website use HTTP compression is much more involved than simply checking if, say, an HTML file is lacking a Content-Encoding header. For example, HTTP compression can make certain responses larger. To determine if a website is using compression properly, I need to determine if responses that were not served using HTTP compression would truly be smaller if they were compressed. This means that I needed the response bytes to compress and see the result.

In the end, there was no getting around the fact that I needed to download the actual content from Alexa’s top websites. So how do I do it?

I started by downloading Alexa’s Top 1,000,000 sites list. This list is downloadable as a CSV file, so I needed to strip out just the hostnames, and I only wanted the first 1,000. This was accomplished with the following Linux/Unix/Cygwin commands piped together:

$ cat top-1m.csv | cut -d"," -f2 | head -n 1000 > top-1000-sites.txt

This gave me a list of just the host names of the top 1,000 sites on Alexa’s list. Next, I used an internal tool we have at Zoompf for running our performance scanner against a list of hosts. For each host, the scanner visits the home page, and downloads all dependent resources like CSS, JavaScript, images, fonts, Flash and more. This simulates exactly what a visitor’s web browser would do when accessing the main page for the site. I ran the bulk scanner on the list of the Alexa top 1,000 sites to download all this content. This took about an hour and half from my dev box. Then I analyzed the data. Basically, this involved opening the scan data for each scan, and examining each response. In total, I examined 90,517 responses served from 4,597 distinct hostnames.

First I needed to determine which responses could be compressible. This means I was looking for a response whose format is not natively compressed, and so it could be compressed with HTTP compression. As I discussed in the Lose the Wait: HTTP Compression post, this includes more than just text responses like HTML or CSS. Luckily we have built a pretty large database of common web file formats internally at Zoompf, which includes an attribute about whether the format is natively compressed. I simply examined the bytes, determined the file format, and did a quick check to see if it was natively compressed. I also took various measurements including how much content could be compressed, what its type was, and where it came from.

Big Questions To Answer

I wanted to use all this data to answer some big questions. Specifically, I wanted to know:

  • How much of the content was compressible, and how much was actually getting compressed?
  • What types of files get compressed more or less often than others? For example, are most people compressing HTML but forgetting CSS?
  • How are websites doing as a whole? Are some websites applying compression perfectly and some completely failing? Or is it that all websites are overlooking one or two little things?

Ultimately, I wanted to use these answer to try and determine broader answers about how performance optimization is implemented at big companies and why.

The Findings

Pie Chart of compressible content

Of the 90,517 responses I examined, only 14,316 responses (15.81%) were compressible. This is an interesting stat, because it goes to show how much of the web is dominated by binary content like images. This is why I’m a big proponent of image optimization, and it’s nice to see the topic of image optimization on the radar of more mainstream tech bloggers like Daring Fireball’s John Gruber and Jeremy Keith. As I said in my Take it all off: Lossy Image Optimizations talk at Velocity 2011, a 20% reduction in image size has more of an effect on total content size than an 80% reduction of text content.

Let’s dig into those compressible responses. There are 3 categories that a compressible response can fit in:

  1. Properly Compressed Responses – A response that could be HTTP compressed, and was in fact served to Zoompf using HTTP compression. For example, a CSS file which is served with HTTP compression. This is a good category, since the content owner is optimally serving the content.
  2. Properly Uncompressed Responses – A response which is not natively compressed, but compressing it makes the response larger. For example, a small HTML file which is actually larger when served compressed. This is a good category, since the content owner is optimally serving the content.
  3. Responses Missing Compression – A response which is not natively compressed, which should be compressed, and which will be smaller if HTTP compression is used, but which is not compressed. For example, a SVG image served without compression, since SVG images are not natively compressed. This is a bad category, because the content owner is inefficiently delivering content to the client.

You can see the breakdown of responses in the graph and table below:

how compressible content is served
Type# of Responses
Properly Compressed8825
Properly Uncompressed2144
Missing Compression3347

So, across all Alexa Top 1,000 websites, 23.37% of content is not getting compressed. The median savings if compression was used would be 4.4 kilobytes with is a median savings of 62.3%. That’s quite a bit of savings, considering HTTP compression has been a well known, supported, and recommend optimization for over 15 years.

How are sites themselves doing? Of the Alexa top 1,000 websites, 642 of them are serving at least 1 item that is missing HTTP compression. In other words, 64% of the Alexa Top 1,000 are not properly applying HTTP compression.

The median number of responses missing HTTP compression per website was 2, so it’s not like the occasional, single response is getting through without compression. This is a larger issue.

To try and understand why content is slipping through, we first need to know which types of content isn’t being compressed. The table below shows this.

File TypeTotal ResponsesMissing Compression% Missing Compression
JavaScript5469116121.23%
HTML409085720.95%
CSS284949517.37%
ICO54137769.69%
Generic Text451388.43%
RSS42715636.53%
EOT1808748.33%
SVG13010076.92%
Atom602846.67%
TTF422559.52%
BMP171270.59%
OTF1010100.00%
Generic Bin7114.29%

(Generic Binary and Generic Text are responses whose format Zoompf could not determine. This typically indicates an incorrect MIME type, and we could not conclusively determine a type by examining the first 500 characters or so. Manually spot checking revealed that most of these responses were JavaScript files served with an incorrect MIME type and which had a mix of HTML tags in them which confused our scanner. For purposes of analysis, I will ignore these are not count them as any other file type.)

This table tells us lots of interesting things about how content is compressed by the Alexa Top 1,000:

  1. Approximately 20% of all HTML, JavaScript and CSS files are served without compression. Major websites are having real problems compressing even the most basic and common types of compressible content.
  2. JavaScript is the single largest source of compressible content, yet it is served compressed less often than CSS or HTML. I believe this is due to the widespread use of 3rd party libraries and widgets which are served from a website you don’t control. While people can configure their own sites to compress content, 3rd parties serving their JavaScript files appear to be using compression less often than other sites. Based on their URLs, the majority of JavaScript resources missing compression appear to be for analytics scripts.
  3. Lesser known compressible file types are forgotten far more often than HTML, JavaScript, or CSS. While there are fewer instances of these formats on the web, they are more than twice as likely to be uncompressed.
  4. Atom feeds are not very popular.
  5. Someone is using BMP images on an Alexa Top Website? Wow. At least some of them are being served with compression.
  6. SVG images are present largely as fallback for web fonts.

Another interesting area are 404s. 404s are often overlooked because, while the request might be for a file like logo.png, the response is HTML. If your server is not configured properly, it will see the file extension .png and not apply compression, even the response is compressible text.Of the 1513 response which had a 404 status code, 490 of them were not compressed. In other words, 32.4% of all 404 handlers in the Alexa Top 1,000 are not using HTTP compression properly.

As I wrote about before, the use of Content-Type: deflate is incredibly problematic and broken. Luckily I did not see it in wide use in the Alexa Top 1,000. Of the 8825 response I saw which were using HTTP compression, only 23 were using DEFLATE. What is troubling is that, based on the HTTP response headers, virtually all of these Content-Type: deflate responses came from a Juniper Network’s DX series load balancer or web accelerator. A company like Juniper should know better.

See For Yourself

I’m a big believer in transparency, so here are my raw data files so you can review it yourself. In fact, if you work at an Alexa Top 1,000 site, there is even a text file full of exactly which URLs are missing compression!

What Does This All Mean?

This data supports much of what I discussed in the Lose The Wait: HTTP Compression post. Specifically:

  • HTTP compression, though easy in theory, is not properly implemented in practice. The majority of Alexa Top 1,000 websites are not completely implementing HTTP compression.
  • The most commonly compressed content, (HTML, CSS, and JavaScript) are not properly compressed 20% of the time. This is most likely due to incorrect configuring the web server to use HTTP compression based on incorrect or missing file extensions, and incorrect or missing MIME types.
  • Less common text formats, like RSS and XML, are more than twice as likely to be served uncompressed. People are forgetting about these files, and common configuration examples on the web exclude them.
  • Non-natively compressed files formats, such as ICO, SVG, and various font files are more than twice as likely to be served without HTTP compression. People are forgetting about these files, and common configuration examples on the web exclude them.
  • Nearly 1/3 of all 404s handlers do not use HTTP compression. This figure is nearly 50% higher than the 20% of regular, non-404 HTML files which are served without HTTP compression. This is most likely caused by web servers configured to use the requested URL’s file extension to decide if HTTP compression should be used for the response.

This data also reinforces a great quote from Mike Belshe, one of the creators of SPDY, about optional features:

“Experience shows that if we make features optional, we lose them altogether due to implementations that don’t implement them, bugs in implementations, and bugs in the design.” – Mike Belshe

Compression was an optional after thought for HTTP, and so 20 years later we still have problems using HTTP compression appropriately.

The Million Dollar Question

What was startling when I really dug into the numbers was who seemed to be having the most trouble, and how widespread it was. There are a number of very large websites with obviously capable staffs were not compressing the majority of their content. For example, major news sites like The Washington Post, ABC News, The New York Post, CNBC, Sky in the UK, and NPR all served 80%-90% of compressible resources without compression. Even the Pirate Bay has 52 RSS feeds referenced on its main page using <link> tags, and none of them are compressed. The Japanese Social networking site Ameblo has an RSS feed of user activity that’s over 4 Megs.

So why is this happening? The guys who run The Pirate Bay are technical geniuses. The IT departments and budgets for ABC or The Washington Post are huge. How is it that the biggest and most popular websites in the world, who have the most to gain and the most to lose from web performance, and who are the best equipped, staffed, and funded to solve these problems, can’t seem to solve these problems?

This is the million dollar question. And it is literally the million dollar question, because it’s costing these websites millions of dollars.

How is it that the biggest and most popular websites in the world, who have the most to gain and the most to lose from web performance, and who are the best equipped, staffed, and funded to solve these problems, can’t seem to solve these problems?

The short answer is “They don’t know they have a problem.” I know it’s true because whenever I tell one of these websites about the problem, their immediate answer is “we aren’t? Crap we should. Let’s fix that”. And then they do.

Frankly, this reaction opens up an entirely different and very scary box. Because “They don’t know they have a problem” is just another way of saying “testing for front-end performance issues is very immature even at the largest organization.” That is an incredibly huge problem for our industry with many different facets. Its too big of a topic to stick at the tail end of this post, so I’ll put all my thoughts on this in a future blog post.

More Testing

More testing is clearly needed. The results for font files are new and interesting to me. While I know fonts like WOFF are natively compressed, it is interesting that OTF and others can be compressed using HTTP compression. This indicates no native compression, or a poor compression scheme so much so that a second pass of deflate makes it smaller. Also, the use of HTTP compression should be something that is charted over time, to see if we are improving. My last figure, from 2010, showed 78% of Alexa Top 1000 sites having at least 1 resource served without compression by the HTTP. Perhaps this should be added to the HTTP Archive.

If you want to find out whether your site is properly applying HTTP compression, Zoompf offers a free performance scan of you website. HTTP Compression, and the various implementing issues surrounding it, are just a few of the nearly 400 performance issues Zoompf detects when testing your web applications. You can also take look at our Zoompf WPO product.

Comments

    May 2, 2012 at 8:32 pm

    “Perhaps this should be added to the HTTP Archive.”

    You are not the first to think this, others have put up a suggestion akin to this on the HTTP Archive issue tracker:
    1. https://code.google.com/p/httparchive/issues/detail?id=14 (Oct 12, 2010; by Steve Souders)
    2. https://code.google.com/p/httparchive/issues/detail?id=273 (Dec 27, 2011; by Andres Riancho)

    Thought I’d leave this as a comment here, it would be very interesting to see statistics on compression.

    May 3, 2012 at 2:28 am

    Excellent Martijn, thanks for letting me know! I reach out to Steve Souders this afternoon and am looking at the HTTP Archive code on Github to understand how this might work.


Leave a Reply