Lose the Wait: HTTP Compression

Posted: February 10, 2012 at 4:25 pm

One of the ways you can improve website performance is to reduce the amount of data that needs to get delivered to the client. An easy way to reduce the amount of data sent to a client is to compress the content and then transfer it to the client. This can be done with HTTP compression. Despite being a surprising simply feature of HTTP, there are numerous challenges which must be addressed to properly use HTTP compression. These challenges are:

  1. Ensuring you are only compressing compressible content.
  2. Ensuring you are not wasting resources trying to compress uncompressible content.
  3. Selecting the correct compression scheme for your visitors.
  4. Configuring the web server properly so compressed content is sent to capable clients.

In this post, part of our Lose the Wait performance series, I will discuss each of these issues and demonstrate how to configure your web server to implement HTTP compression properly.

Compressing Compressible Things

Let’s start out easy. What should HTTP compression get applied to? The answer is simple: Any content which is not already natively compressed.

Notice I didn’t say "text resources." Text resources, like HTML, CSS, and JavaScript certainly should be compressed because they are not natively compressed file formats. Unfortunately, most people seem to focus on these 3 types of files. In fact, a quick web search shows that most of the top results for ".htaccess compress" include instructions only on compressing HTML, CSS, and JavaScript files. This just reinforces what I’ve said before; you have to be careful where your advice comes from.

Here is a list of common text resource types on the web which should be served with HTTP compression:

  • XML. XML is structured text used in standalone files (like Flash’s crossdomain.xml or Google’s sitemap.xml) or as a data format wrapper for API calls.
  • JSON. JSON is a subset of JavaScript used as a data format wrapper for API calls.
  • News feeds. Both RSS and Atom feeds are XML documents.
  • HTML Components (HTC). HTC files are a proprietary Internet Explorer feature which package markup, style, and code information used for CSS behaviors. HTC files are often used by polyfills such as Pie or iepngfix.htc to fix various problems with IE or to back port modern functionality.
  • Plain Text. Plain text files can come in many forms, from README and LICENSE files, to Markdown files. All should be compressed.
  • Robots.txt. Robots.txt is a specific text file used to tell search engines what parts of the website to crawl. Robots.txt is often forgotten since it is not usually accessed by humans and does not appear in JavaScript-based web analytics logs. Since robots.txt is repeatedly accessed by search engine crawlers and can be quite large, it can consume large amounts of bandwidth without your knowledge.

ICO

As I said, HTTP compression isn’t just for text resources and should be applied to all non-natively compressed file formats. What do I mean by this?

As an example, let’s look at ICO files. ICO files are an image format used originally used for icon images on Windows. The format, as it is in use today, was created over 20 years ago for Windows 3.0. Today, ICO files are used on the web as Favicons for a website, usually displayed in the address bar or browser tab. While modern browsers allow other file formats besides ICO support is not universal. Many sites continue to use ICO files as Favicons for compatibility reasons.

Despite being an image, ICO files are not natively compressed. ICO images are actually a primitive version of a BMP image. Neither ICO nor BMP image formats are natively compressed. While can (and should) avoid using BMP images on your website, you can’t do this with ICO files. Be sure to configure your web server to server ICO images with HTTP compression.

SVG

SVG images are example of an image format which is not natively compressed. SVG images are just XML documents, but they have a different MIME type and file extension. This means, while someone might remember to compress XML documents, they forget to compress SVG documents.

You might be using SVG images on your website and not even know it. This is because of a feature of SVG images, SVG fonts, which allow SVG files to contain font glyphs used to render text. These SVG image-that-really-a-font files can be references in CSS using the @font-face syntax much like a OTF or WOFF font file. Divya Manian has written a comprehensive post about the pros and cons of SVG fonts. For the purposes of this discussion the main take-away from her post is that, until iOS 5, SVG fonts were the only type of custom font supported by iPhone, iPad, and iPod Touch.

Font support is, to put it nicely, a giant mess. Font libraries abstract this away from the web developer and serve the correct format, including SVG fonts, to the correct browser. This mean your website can be using SVG without you even knowing it. Remember to serve your SVG files using HTTP compression.

Compressing already compressed content

Another mistake developers make with HTTP compression is using it on content that is already natively compressed. Apply compression to something that is already compressed doesn’t help improve performance. In fact, it can hurt performance to two ways.

First, HTTP compression has a cost. The web server has to take the content, compress it, and then send it to the client. If the content cannot be compressed further, you are just wasting CPU doing a meaningless task.

Secondly, applying HTTP compression to something that’s already compressed doesn’t make it smaller. In fact, the overhead of adding headers, compression dictionaries, and checksums to response body actually makes it bigger, as shown in the figure below:

Do websites actually do this? Yes, and it’s more common than you would think. I used Zoompf WPO to examine Fox News. Fox News is the 40th most visited website in the United States. As you can see, Fox News is mistakenly applying HTTP compression to PNG images.

This not only wastes CPU, but also increases the size of the PNG images delivered to Fox News visitors by a few dozen bytes:

Zoompf actually has two different checks for this issue. The first check "Compressed Content served with HTTP compression" alerts you that you are wasting CPU time compressing something that is already compressed. The second check, "Bigger with HTTP Compression" identifies content that is actually larger when served using HTTP compression.

Both of these problems usually are the result of a configuration problem with the web server or an inline network device. Something in your environment is applying HTTP compression to all outbound content instead of only content that should be compressed.

GZIP Vs. DEFLATE

So far, we have talked about HTTP compression as if it is an opaque or atomic feature. But that is not the case. HTTP simply defines a mechanism for a web client and web server to agree a compression scheme can be used to transmit content. This is accomplished using the Accept-Encoding and Content-Encoding headers. There are two commonly used HTTP compression schemes on the web today: DEFLATE, and GZIP.

DEFLATE is a patent-free compression algorithm for lossless data compression. There are numerous open source implementations of the algorithm. The standard implementation library most people use is zlib. The zlib library provides functions for compressing and decompressing data using DEFLATE/INFLATE. The zlib library also provides a data format, confusingly named zlib, which wraps DEFLATE compressed data with a header and a checksum.

GZIP is another compression library which compresses data using DEFLATE. In fact, most implementations of GZIP actually uses the zlib library internal to conduct DEFLATE/INFLATE compression operations. GZIP produces its own data format, confusingly named GZIP, which wraps DEFLATE compressed data with a header and a checksum.

Unfortunately, the HTTP/1.1 RFC does a poor job when describing the allowable compression schemes for the Accept-Encoding and Content-Encoding headers. It defines Content-Encoding: gzip to mean that the response body is composed of the GZIP data format (GZIP headers, deflated data, and a checksum). It also defines Content-Encoding: deflate but, despite its name, this does not mean the response body is a raw block of DEFLATE compressed data. According to RFC-2616, Content-Encoding: deflate means the response body is:

[the] "zlib" format defined in RFC 1950 [31] in combination with the "deflate" compression mechanism described in RFC 1951 [29].

So, DEFLATE, and Content-Encoding: deflate, actually means the response body is composed of the zlib format (zlib header, deflated data, and a checksum).

This "deflate the identifier doesn’t mean raw DEFLATE compressed data" idea was rather confusing. Early versions of Microsoft’s IIS web server was programmed to return raw DEFLATE compressed data for Accept-Encoding: deflate requests instead of a zlib formatted response. And naturally versions of Internet Explorer at the time expected responses with a Content-Encoding: deflate header to have raw DEFLATE response bodies.

As Mark Adler, one of the authors of zlib, explains in this StackOver thread:

However early Microsoft servers would incorrectly deliver raw deflate for "Deflate" (i.e. just RFC 1951 data without the zlib RFC 1950 wrapper). This caused problems, browsers had to try it both ways, and in the end it was simply more reliable to only use GZIP.

As Mark says, browsers receive Content-Encoding: deflate had to handle two possible situations: the response body is raw DEFLATE data, or the response body is zlib wrapped DEFLATE. So, how well do modern browser handle raw DEFLATE or zlib wrapped DEFLATE responses? Verve Studios put together a test suite and tested a huge number of browsers. The results are not good.

All those fractional results in the table means the browser handled raw-DEFLATE or zlib-wrapped-DEFLATE inconsistently, which is really another way of saying "It’s broken and doesn’t work reliably." This seems to be a tricky bug that browser creators keep re-introducing into their products. Safari 5.0.2? No problem. Safari 5.0.3? Complete failure. Safari 5.0.4? No problem. Safari 5.0.5? Inconsistent and broken.

Sending raw DEFLATE data is just not a good idea. As Mark says "[it's] simply more reliable to only use GZIP."

It should be also noted that all browsers that support DEFLATE also support GZIP, but all browser that support GZIP do not support DEFLATE. Some browsers, such as Android, don’t include deflate in their Accept-Encoding request header. Since you are going to have to configure your web server to use GZIP anyway, you might as well avoid the whole mess with Content-Encoding: deflate.

Luckily, avoiding DEFLATE isn’t all that difficult.

The Apache module which handles all HTTP compression is mod_deflate. Despite its name, mod_deflate don’t not support deflate at all. It’s impossible to get a stock version of Apache 2 to send either raw DEFLATE or zlib wrapped DEFLATE. Nginx, like Apache, does not support deflate at all. It will only send GZIP compressed responses. Sending an Accept-Encoding: deflate request header will result in an uncompressed response.

Microsoft’s IIS web server can send both gzip and deflate responses and you can enabled or disable each scheme individually. For IIS6, you can , you can edit the metabase to disable DEFLATE support. For IIS7, you can disable DEFLATE support by editing the DEFLATE compression scheme section in the <schemes> element of the <httpCompression> element of the various IIS7 .config files.

Both Zoompf’s free and commercial products have a check built-in, “Obsolete Compression Format”, which will detect if your web server is sending content compressed with DEFLATE.

Netscape 4 and Internet Explorer 6 Are Screwing You. Again.

So by now you should have your web server configured to:

  1. Properly compress what needs to be compressed.
  2. Avoid compressing already compressed content.
  3. Configured to only use GZIP.

Now you need to ensure that your configuration is not actually excluding perfectly capable browsers.

While HTTP compression is a mature feature today, there were some problems early on. Netscape 4 only supported HTTP compression for HTML documents even though it sent an Accept-Encoding: deflate, gzip for all requests. Serving it HTTP compressed CSS or JS documents would make it crash. For reasons that aren’t quite clear, the developers of Apache decided to address this client-side bug with a server-side fix. They added the following seemingly harmless line into the Apache configuration file:

BrowserMatch ^Mozilla/4 GZIP-only-text/html

Any browser calling itself Mozilla/4 would only receive HTTP compressed HTML files. Since Apache was and is the most popular web server on the Internet, this caused enormous problems which still affect us today.

First of all, this was the middle of the browser wars and Internet Explorer 4, Internet Explorer 5 and even Internet Explorer 6 all identified themselves as Mozilla/4 in their User-Agent strings. But these browsers could accept HTTP compression for non-HTML responses. Trying to patch around one buggy browser caused another to be slow! Since IE6 would ultimately achieve over 95% market share, it was a problem that IE6 would download webpages more slowly from Apache than from other web servers. To resolve this, the Apache developers were forced to add another configuration directive:

BrowserMatch \bMSI[E] !no-GZIP !GZIP-only-text/html

This line means: if the User-Agent has MSIE in it, then turn off the no-GZIP and GZIP-only-text/html options, thereby instructing Apache to use HTTP compression for all responses if IE asked for it. And all was good, until it wasn’t.

You see, IE6 on Windows XP also multiple problems with HTTP compression. Most of these issues dealt with compressed CSS or JavaScript files being cached as compressed items and which were then read from the cache assuming they were not HTTP compressed. So again another Mozilla/4 browser had problems with compression, and so again the Apache developers had to "fix" the issue with another configuration directive:

BrowserMatch \bMSIE\s6 GZIP-only-text/html

This directive instructed the web server to only send compressed content for HTML responses if the browser was IE6. While this helps dealt with the majority of the issues, some of these bugs caused so many extreme edge-case problems that, for reliability reasons, larger sites would completely disable HTTP compression for IE6 entirely:

BrowserMatch \bMSIE\s6 no-GZIP

Eventually Microsoft fixed these issues with hot fixes and, comprehensively, with Windows XP Service Pack 2. But this created a fragmentation problem, where some IE6 browsers could handle HTTP compression for all content, and some could not. Another rule was added in an attempt to serve compressed content to IE6 browsers that had SP2 installed. This was done by looking for the poorly named SV1 identifier in IE6′s User-Agent string:

BrowserMatch "^Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1" !no-GZIP !GZIP-only-text/html

This chain of "deny this, but not this, unless it’s this, but not if it is also this" directives made configuring a web server to properly serve compressed documents to the appropriate browsers difficult and prone to error. Since these bug/solution cycles happened numerous times over several years these configuration directives mutated. Blog posts from 2004 would tell you to do one thing and blog posts from 2006 would say another. Much like a child’s game of telephone short comings, errors, missing edge cases, and missing corner cases were magnified as people reused old configuration files and shared the "correct" advice. Even today, many of the top Google search results for configuring HTTP compression for Apache using mod_deflate contain different and incorrect directives.

As I wrote in Advice on Trusting Advice it all comes down to where you get your advice from. Follow the advice on this top search result and IE9+ gets no compression at all. Follow the advice on this top search result and IE6 gets no compression at all. Follow the advice from this search result and no version of IE will get anything using HTTP compression, except for IE7. Follow advice from IBM, no version of IE will ever get a non-HTML file using HTTP compression.

Depending on which directives were used, and how match criteria is configured, you ended up with several possible scenarios:

  • HTTP compression is completely disabled for all Mozilla/4 browsers.
  • HTTP compression is completely disabled for IE6
  • HTTP compression is completely disabled for IE6 except SV1
  • HTTP compression is completely disabled for all versions of IE
  • HTTP compression is completely disabled but all versions of IE, except IE6 (so no compression for IE > 6)
  • HTTP compression for non-HTML files is disabled for all Mozilla/4 browsers.
  • HTTP compression for non-HTML files is disabled for IE6
  • HTTP compression for non-HTML files is disabled for IE6 except SV1
  • HTTP compression for non-HTML files is disabled for all versions of IE
  • HTTP compression for non-HTML files is disabled but all versions of IE, except IE6 (so no compression for IE > 6)

Apache makes it quite easy to mess this up. Nginx is much easier. It completely ignores the old Netscape 4 browsers and does not attempt to work around them. It also has a very simply mechanism to avoid sending compressed content to bad versions of IE6. You don’t need to manually define "this is good" and "this is bad" regexs, allows you to avoid making a mistake.

In practice, you should just not even try to work around these problematic browsers. The problem browser have all been updated or patched. Even the most recent of the affected browsers, IE6, was fixed nearly a decade ago. Even on platforms that are no longer supported, this issue has been fixed. You should review you configuration file and remove any browser filtering code used for HTTP compression.

Hopefully this section has also taught you that fixing a client-side bug with a server-side fix it rarely a good or sustainable idea. As I discussed in The Big Performance Improvement in IE9 No One is Talking About, this approach of using the User-Agent as a factor in content generation forced the widespread use of the Vary: User-Agent header. The Vary header used in this manner effectively nullifies the shared caching which reduces the overall performance of the web.

Extension Vs. MIME Type

It is important to review how your web server is configured to compress content. Most browsers allow you to specify either a list of file extensions to compress, or a list of MIME types to compress, or both. Be careful to review this list.

Let’s say you have configured your application to serve text/javascript responses using compression. Are you sure that’s the only MIME type you application uses when serving for JavaScript files? What about text/x-javascript or application/x-javascript or application/javascript? What MIME type does your API serve for JSON responses? text\json? application\json? Something else? How about HTML? Are all of your HTML files using text/html? Do you have some sections from the XHTML days which use other MIME types like application/xhtml+xml or text\xhtml or application\xhtml? Is all of the markup generated by your application served using a single and consistent MIME type? And let’s not forget about the code you didn’t write. What MIME type does that opaque charting library use to send data to the client? Or that auto-completing textbox widget you got from Github?

If you are configuring the web server to use compression using file extensions, did you get all of them? .htm or .html or is it something else? What about your 404 handler? A request happens for the non-existent file /foo/bar.jpg. Since the file extension is not explicitly defined as something that should be compressed (or, being an image, is explicitly defined not to be compressed), the 404 response isn’t sent with compression.

Care must be taken when configuring your web server to ensure that uncompressed content is not slipping through due to a missing file extension or MIME type declaration.

Properly Configuring HTTP Compression

So, given all these challenges, how should you go about configuring HTTP compression properly?

To see where you might have made a mistake configuring your server, your need a something to compare it to. I am a big fan of the .htaccess file from the HTML5 Boilerplate Project. This is an Apache configuration file specifically crafted for web performance optimizations. It provides a great starting point for implementing HTTP compression properly. It also serves as a nice guide to compare to an existing web server configuration to verify you are following best practices. At the very least, the HTML5 Boilerplate .htaccess file provides a comprehensive list of common web content which should or should not get served using HTTP compression.

Getting a good starting point is only half the battle. The configuration for HTTP compression on a web server only works when it matches the application running on that server. Even the HTML5 Boilerplate configuration file can fail you if there is a discrepancy between the file extensions and MIME types in the configuration file and those used by your application. It’s easy to forget or overlook a MIME type or a file extension that you application uses. To ensure your application matches your configuration, the best thing to do is carefully review:

  1. How is your web server configured to map MIME types to content or file extensions?
  2. How is your web server configured to compress content relative to those MIME types or extensions?
  3. How are your application’s filenames and extensions structured?
  4. How does your application change or override a response’s MIME type?
  5. What third party libraries use MIME types?

Once you think you have properly configured the web server, you need to validate it. Web Sniffer is a great, free, web-based tool that let you make individual HTTP requests and see the responses. Web Sniffer gives you some control over the User-Agent and Accept-Encoding header to ensure that compressed content is delivered properly. Hurl is another web-based HTTP tool you can use. It allows for more control than Web Sniffer, but requires you to manually enter more information to get the same results:

Hurl and Web Sniffer only test a single page at a time. You can use Zoompf’s free scan and Zoompf WPO can be used to scan multiple pages to verify no uncompressed content is slipping through.

Conclusions

As this post shows, there are many challenges which must be overcome to properly configure HTTP compression. Make sure all non-natively compressed content is served using HTTP compression. Don’t waste load time, CPU cycles, and bandwidth compressing content that is already compressed. Only use GZIP compression to ensure compatibility. Don’t try to work around old browsers since it is easy to make a mistake and end up not delivering compressed content to a capable browser. Review your application code and server configuration to make sure the application’s content and structure matches your HTTP compression settings. Don’t forget about compressing 404′s. Finally, don’t just assume your configuration works. Use a tool to validate that is works.

Want to see what performance problems your website has? Content Served Without Compression, Compressed Content Served with Compression, Bigger With Compression, and Obsolete Compression Format are just 4 of the nearly 400 performance issues Zoompf detects when testing your web applications. You can get a free performance scan of you website now and at a look at our Zoompf WPO product at Zoompf.com today!

Performance Questions to Ask Hosting Providers: Secure Website Access

Posted: December 14, 2009 at 3:06 pm

(This is the third article in a series of articles about performance questions you should ask when choosing a hosting provider. The first article, “What control do I have over the web server?” and the second article “What access do you provide to web server logs?” are also available.)

So far in this series we have talked a lot about questions to ask hosting providers to make sure you can configure your website for performance and access the raw traffic logs of your website to spot performance problems. All of this is moot of course if you cannot get content onto your website. That’s why this post of “Questions to ask a hosting provider” is all about:

“Can I Securely Communicate With My Website?”

ethernet-locked

It has happened to everyone. You are out at a coffee shop, a client site, or at a conference and you need to make changes to your website. Perhaps you need to upload a few new PHP files or some images. Perhaps you need to update your web server configuration to set up a new email address for an event. Perhaps you simply saw something cool and want to write a WordPress post. But can you do anything of these things securely using a public network? This question is best answered with an analogy.

Imagine you are at a formal cocktail party. You drift from room to room, through a sea of lavishly dressed party goers and dine on mouth-watering morsels served on silver trays by waiters in white gloves. As you approach a side table of crystal champagne glasses you overhear bits and pieces of the conversations around you.

  • “We cannot wait. It should be a lovely vacation and it’s the perfect time for us to get away for a week.”
  • “That’s right, with the nanny! Walked right in on them! And he tried to say that she was only choking!”
  • “Chris starts there next spring, just like his father.”

Well attended cocktail parties are loud and noisy. Its almost impossible not to hear what everyone else is saying! Of course we are taught that to be polite we should ignore the conversations other people are having unless we are involved. You are on the honor system not to eavesdrop.

Public networks such as wireless networks are just like cocktail parties. Your wireless card is like a party guest. It broadcasts out to the room when it “speaks” and “listens” to everyone within range to hear a response. Like a real party guest, wireless cards are supposed to ignore any conversations that they overhear that is not meant for them. They do this by dropping the data and not bubbling it up to the computer. However nothing forces network devices to ignore data they receive that is not meant for them. In fact, all networking devices (not just wireless devices) can be placed into “Promiscuous Mode” where any data they receive, even data that is not addressed to themselves, is received and bubbled up to the computer to process. This allows any networking device to become a giant listening device that hears and records all the information on the network! Promiscuous mode is not some evil hacker trick. It’s a fully intended feature of networking devices that has many legitimate uses.

Diagram showing how clients in a wireless network hear each others' traffic

But wait! I use Encryption!

“The conference wireless network or the coffee shops wireless network is encrypted. They tell me they use something called WPA2 with a key of a million bits! I’m secure right?”

No, you are not secure.

Let’s go back to the cocktail party analogy. The hosts don’t want just anyone coming into their party and drinking all their fine wines. So they place a bouncer at the door of the party. Only people that know the password are allowed into the party. If you know the password you get into the party and can listen to all the other guests. If you do not know the password you remain outside the building and cannot hear anything that is going on inside.

Encrypted wireless networks are just cocktail parties with bouncers. You need the “password” to join the wireless network. Once you are connected you can listen to everyone else’s traffic just like before because on the network everyone is using the same password to transmit and receive their data. (This is the only scalable solution. Otherwise the wireless network administrator would have to create a new, unique password for each and every person that joins the network). In other words, an encrypted network uses the password solely to protect and restrict “access” to the network. It does nothing to protect the users of the network from themselves or from each other.

The Danger of Sniffing (packets)

So What! Who cares if someone can listen to my network traffic. It’s not a big deal. After all they will just see the blog content I was about to post anyway. Unfortunately this is not true. Using any system that requires a username and a password on a wireless network? You may have shouted to the entire cocktail party that username and password. And chances are you use that same username and password somewhere else on the Internet. Like your bank. Or an online store. Are you already logged into a system like Gmail or your WordPress administration panel? You are shouting your HTTP Cookies to the entire cocktail party. Someone can steal your HTTP session cookies and use session hijacking to access Gmail or WordPress as if they were you without needing your username and password. Next thing you know you are on The Wall Of Sheep!

Secure Communications With Your Website

Remember: network encryption protects networks and application encryption protects applications! You need to make sure you are using encrypted application protocols to properly protect yourself. What protocols you use and how you use them will vary with different use cases.

Uploading Content

How do you upload content to your website? If the answer is FTP you are in trouble. FTP sends usernames and passwords in the clear. You need an encrypted file transfer mechanism like SFTP or SCP. If you have shell access to your web server using SSH you also have the ability to use either SFTP or SCP as they are simply subsets of the functionality of SSH. By default most hosting companies provide an insecure file transfer system like FTP. Ask if they provide (for free) a secure file transfer system like SFTP or SCP. Make sure they understand you don’t need full SSH functionality and are only interested in secure file transfer. If this is not available you might need to upgrade your account or purchase an add-on to get SSH access for your website.

Writing Content

Do you use a web interface to write content for your blog platform or CMS system? Does it use SSL? Check the address bar. Does it start with https? If not you are not using SSL. Do you write your content using other software? Does that software directly publish the content to your blog using a web API like RSD or XMLRPC? Does that use SSL? Check the settings and see if you are using “https” to access the API interface. If you are not using SSL to communicate with these web resources then anyone can capture your username and password or cookies (which are just as good as your username and password).

Website Administration

How do you administer your website? Do you use a web interface like cPanel? These web administration interfaces are most common in shared hosting environments and typically run on a different hostname or an odd port number. Ask the hosting provider if they offer SSL access to the interface. Hosting providers often get confused and think you want to create an SSL certificate for your website. While this would secure a CMS you configure like WordPress (see previous use case) it does not help you secure the web administration interface because that is often running on a separate system. Make sure they understand you want secure access to their interface, not your website. This discussion may take several emails back and forth but most hosting providers are willing to supply SSL access to cPanel or other administration interfaces.

Summary

In conclusion, the questions about secure communications you should ask your hosting provider are:

  • “Do you provide a secure file transfer mechanism like SFTP or SCP? Is it provided for free or is it extra? If you don’t do you offer SSH access to the web server? Is it free?”
  • “If you provide a web-based website administration interface like cPanel do you provide access to it using SSL?”
  • “Do you provide an SSL certificate for my CMS? What is the cost?”

How to judge their answers will vary from person to person based on need. Personally, a secure file transfer mechanism is a requirement. Too many times have I needed to upload a presentation, PDF, or file to my website from a public network at a conference or client site. If you have a heavy blogger secure access to your content management system is going to be critical. After all, it is difficult to write a blog post about an event from the event if you cannot securely access your blog to write the post!

The Challenge of Dynamically Generating Static Content

Posted: December 7, 2009 at 12:17 am
php_code

Time and time again I see people using PHP or some other application logic to try and hack around some issue they are facing. We saw this in our previous post Questions to Ask Hosting Providers: Web Server Configuration where people would use PHP to emulate mod_deflate or mod_expires. Andrew King, in his book Website Optimization talks about wrapping developer comments in CSS or JavaScript files in <?php ?> tags and using the PHP interpreter to remove them. People use PHP to combine CSS or JavaScript resources together. And today I read an article from the always awesome Chris Coyier over at css-tricks.com about using PHP to emulate CSS variables.

Don’t get me wrong. I was actually bemoaning the lack of variables in CSS two days before Chris wrote his article. (Actually, what we really want is more like C/C++ macros but that’s another story). Anyone who has tried to implement CSS sprites, change margins or element sizes, or modify color values knows what a pain it is to go through a CSS file and type the same thing over and over.

Using PHP to solve this problem, or any of the other problems listed above, makes perfect sense at first. Because it makes things easy. Because you are all being lazy. You are using a runtime mechanism to try and simplify your life.

Stop Being Lazy!

Now, under normal circumstances programmers should be lazy! After all your very job is to create something that does work for you! Unfortunately in this case your laziness is harming the performance of your application. Using application logic to dynamically generate static content at runtime is a massively bad idea. Consider these 4 consequences:

  • You take an order of magnitude performance hit for invoking the application tier instead of just serving a flat static file from the file system.
  • Since the web server is not serving a static file, there will be no Last-Modified header sent by default. That means no conditional GETs and no 304 responses which means lots of bandwidth consumption.
  • PHP, like virtually all application tiers, produces a chucked response. This is because the web server has no idea what the content length will be because it is dynamically generated. Dynamically generated chunked responses will not send the Accept-Range header. This means no pausing or resuming or error recovering. The entire resource must be re-downloaded.
  • Chunked encoding is not supported with HTTP/1.0, so any HTTP/1.0 device (like every caching proxy ever made) has to flip into “store and forward” mode where it downloads the entire response before passing it along.

And as if all these downsides for invoking the application tier was not enough, we have my personal favorite: Web Security! As someone who professionally broke into computer systems for many years when I see:

http://example.com/combine.php?files=a.js|b.js|c.js

I get very excited. Think about what a resource combiner script does. “Hey website, I’m going to give you a list of files on your hard drive, and I want you to read them off the disk, one at a time, and dump their raw contents into a response and send it to me!” Jackpot baby! This is what we call a Local File Inclusion vulnerability just waiting to happen. The developer has not so much created a resource combiner as they have provided me with a rudimentary remote file download service! I immediately do something like this:

http://example.com/combine.php?files=db.inc

In about 45 seconds I have downloaded the /etc/password file, your httpd.conf, your .htaccess, your raw mysql database, you app config files filled or user names, passwords, and database connection strings, and each PHP file to retrieve all your source code. Or worse I perform remote file inclusion, thereby injecting a PHP-Shell, which allows me to completely take over your website! (BTW: Roughly one in every 3 PHP resource combiner scripts I have seen contains these security vulnerabilities. Beware where you get your source code!)

The Fundamental Problem

The fundamentally problem in all of these examples are developers are getting lazy and are using PHP code to do something at runtime that should have been done earlier.

Properly Generating Static Content

Great! So what is a web developers to do? Go back to the dark ages where you cannot leverage all that great application logic in the generation of our content? I want my CSS variables and I want them now! Notice I never said you cannot dynamically generate static content! I just said you should not dynamically generate static content at runtime! Want CSS variables? Want to use a PHP script to combine resources or minify or whatever?Go ahead and do it! Just do it ahead of time. You can run your PHP script form the command line, produce your CSS file, complete with all the correct CDN paths and color values, and upload that to your website. And this isn’t just for PHP. Use Perl, Python, Ruby, Java, or whatever. You can even do it in QBASIC!

'CSSGEN.BAS - kicking it old school CDN$ = "http://zoompf.com/" LOGO$ = "includes/logo.png" PRINT ".logo {" PRINT " background: url("; CDN$ + LOGO$ + ");" PRINT "}"

And the output:

qbasic-css-gen

(Thats right. I totally just used QBasic 1.1 from DOS 5.0 to automate publishing a web application on 64bit Vista. Oh yeah!)

The moral of the story is never make the user pay for your laziness. Do not use the application tier of a website to dynamically generate static content at runtime. Instead do it at publishing time or even do it in a daily or hourly cron job. This approach allows you all the advantages of using application logic without drastically reducing the very web performance you were trying to improve in the first place!

Performance Questions to Ask Hosting Providers: Log File Access

Posted: November 25, 2009 at 2:39 pm

(This is the second article in a series of articles about performance questions you should ask when choosing a hosting provider. The first article in the series is here)

Last time we covered the most important question you should ask a hosting provider: What control do I have over the web server. This time we will be showcasing another important question to ask a hosting provider:

“What Access Do You Provide to Web Server Logs?”

web server configuration

The main reason you want access to log files from the web server is to learn how visitors are accessing your content. This will reveal a wealth of knowledge about the raw traffic patterns of your web application and expose various performance issues and limitations. Often these performance issues will not be detected by page-based performance tools like Yahoo’s YSlow or Google’s Page Speed.

Web server logs come in many different formats. Usually they are large text files where every request is logged on its own line. Several pieces of data about each request are logged in different fields on the line separated by commas. Typically information that is logged for each request is:

  • URL requested.
  • IP address of the visitor.
  • Date and time request was received.
  • The program or browser used to request the URL. This is called the User-Agent.
  • The Referring webpage (if any).
  • HTTP version used to request the page.
  • Status code of the response.
  • Size of the body of the response.

Log files are a very granular view of your web traffic. Sometimes it can be difficult to see the forest through the trees. For example, what pages did user XYZ visit, in what order, and how long did the user stay on each page? It is usually very difficult to get this information from logs alone because web server logs only track users by a specific IP address. To provide a larger view and answer questions like those listed above web developers use web analytics packages like Omniture, Hitbox, or Google Analytics. Web analytics packages uses cookies and JavaScript to gather detailed information about your visitors, the capabilities of their browsers, and their actions through your web application. Web analytics packages are simple to add to a website. Typically all that is involved is inserting a block of JavaScript at the end of each HTML page. This is very easy to do on templated or dynamically generated websites. So if web analytics provides you with “bigger picture” and richer data than web server logs that begs the question:

Are Web Analytics Reports Good Enough?

Actually, no. Web analytics reports are not good enough. Web analytics data abstracts away the raw traffic of your web application and which can hide several important problems. Web analytics packages only track visitor requests and activity for HTML files that are served with a 200 status code. Out of the box, here are things that most web analytics packages do not track:

  • All requests to non-HTML resources.
    • Images
    • JavaScript
    • Style sheets
    • Feeds (RSS, Atom)
    • RIA files (HTC, Flash, Silverlight, Java, etc)
    • Access files (robots.txt, sitemaps.xml, crossdomain.xml, etc)
    • Documents (PDF, Office docs, Zip files)
    • Other resources (Fonts, Cursors)
  • Most error pages (404, 5xx, etc).
  • Conditional requests that return “304 Not Modified.”
  • Requests from non browser User-Agents (spiders, mash-ups, etc).
  • Users who have JavaScript disabled for accessibility or (more commonly) security reasons.

This valuable information is completely missed if you are only using web analytics data to understand your traffic. Consider the valuable questions you can answer with web server logs:

  • Where are your redirects? Which can be removed to decrease page load time?
  • What web resources are using the most bandwidth? This is calculated by simply adding up all the body sizes that are returned for a resource and sorting. Can you reduce the size of these files somehow using compression, minification, or by removing meta data?
  • What are the most requested resources on your website? Can you use caching or other methods to minimize the number of times the resource is requested? If you cannot cache those files because they are dynamically generated can you add programming logic to use a Last-Modified header to reduce bandwidth? Can you remove resources like external JavaScript or CSS files that are referenced by not actually used by that web page?
  • How often does your static content actually change? This is calculated by counting the 304s for a resource. Perhaps you can use a longer Expires time on your content.
  • How often are search engine crawlers visiting your site? Are there crawlers that are missing? You should submit your website to their indexes.
  • Are crawlers finding your high value content? Perhaps you should be using or modify your sitemap.
  • Do crawlers request a large amount of low value content? Do crawlers “get stuck” on part of your website? Perhaps you need to fix your robots.txt.
  • Which web resources are you still getting a lot of requests for that no longer exist? You should use a redirect that points to the correct content.

Real Life Examples

At Zoompf we have detected and solved numerous performance issues just by examining a client’s web server log files. Here are some of our more interesting stories:

  • A social networking client used their robots.txt file to prevent crawlers from indexing content from new, untrusted members that could be content spammers. As such their robots.txt was 250 kilobytes! Because the file was updated so often all of the search crawlers would request it multiple times a day. These factors resulted in the client using 5 gigs of bandwidth a month just to serve its robots.txt file! Turning on HTTP compression for text files reduced this by 70%. The client is currently implementing the use of <META> tags and rel=”nofollow” attributes to limit search engine indexing for the web pages of untrusted users. This will result in even higher performance savings.
  • A software client found that by far their most popular pages were for their product documentation. These pages were constructed using a PHP templating system. However these files never changed once that version of the product shipped. The client moved to pre-rendering the web pages to static HTML files and using a far future Expires header on the HTML files. This drastically improved performance and reduced bandwidth consumption and server load.
  • An ecommerce client discovered crawlers were requested all the available colors for each item they sold. For example, the crawlers would visit /item1/, /item1/color/red/, /item1/color/blue/, and so on. By creating a robots.txt rule to prevent crawlers from requesting every color for every item the client reduced their bandwidth by nearly 80% while still having their important content indexed.
  • A client discovered that their most important content was not getting indexed. A developer had copied a code snippet from the Internet into the top of their template to solve a CSS problem they were having. Unfortunately this code snippet also included a <META> tag to telling search engines not to index the web page.
  • A client learned that its logo had not changed for over 4 years and it was also an uncompressed BMP file even though the logo had a JPEG extension. They change the logo to be a proper image format for web and increased the logo’s Expires time.
  • A client discovered that no one had every requested their iPhone application image. They removed the <LINK> tag and reduced the size of all of their HTML pages.
  • A client discovered their favicon.ico file was consuming huge amounts of bandwidth. This is because it contained multiple versions of the same icon at different dimensions. Removing all but the 16 by 16 pixel version from the ICO file reduced file size by 97%.

Typically Log Access

If you have your own server access to the web server logs is usually unrestricted. However in most shared hosting environments you will not have direct access. Typical web server log options you have are:

  • The raw Apache, IIS, or NCSA log files in a directory outside of your web root that you can access using FTP or sFTP. This is the ideal case.
  • The raw Apache, IIS, or NCSA log files placed directly in your web root. While this provides you with the raw logs anyone on the internet can also access your log files. This is a security risk as log files can often contain sensitive data like credentials or “hidden” areas of your web application. Talk with your hosting provider about moving the location of the web logs.
  • An option through a web-based website administration system like cPANEL that lets you download the raw log file.
  • An option or interface in the web admin system that lets you view or download a specially formatted version of the logs.

If you cannot access the raw log files don’t panic. As long as the log file contains the follow information you will have all the data you need:

  • URL requested
  • Date and time request was received
  • The program or browser used to request the URL. This is called the User-Agent.
  • Status code of the response
  • Size of the body of the response

Another question to ask hosting providers is not only “what information is in the log file” but also “how much time does the log file cover?” You can imagine in large sharing hosting environments how log files can quickly go to hundreds of megabytes for potentially thousands of customers. Hosting providers often limit the log file in different ways including:

  • Record only a week of traffic and replace the log with a new empty file every week.
  • Limit the total size of the log file. Each new entry removes an entry from the start of the log
  • Provide a night copy of the log file for all the traffic of the site received that day. These copies are usually removed after a certain about of time.

If you do have a time window make sure grab a copy of the log file. Some interfaces like cPANEL offer a scheduling services that can email you the log file or place them in a special location that you can then download. You can schedule an FTP download or use wget or curl to download these log files.

Processing Log Files

Depending on how much log data you have, you might want to concatenate your log files together until you have a big enough sample. At Zoompf we suggest collecting a sample between 500,000-1,000,000 requests, or a week’s worth of web traffic, depending on which is larger. Programs like awstats are very helpful for processing and provide reports with your most popular and least popular files, largest files in terms of bandwidth, and other data already broken out. Directly processing the logs yourself always you to discovered more detailed data and not as hard as you would thing. Some basic regular expressions can make it very easy to gather metrics like “show all of the 304s, 404s, 500s, etc.”

Remember, examining your web logs is a key technique to discovering and solving performance problems with your web applications. Those pretty graphs from Google Analytics or other web analytics data is simply not good enough to detect performance issues and bottlenecks. You need access to the information about all the requests the web server is processing. Make sure you ask your hosting provider how you can access the raw web server log files. Find out how much web traffic data the logs contain and how you can easily collect this data so you can analysis. If your hosting provider does not provide this you should consider that a deal breaker and find another provider.

Performance Questions to Ask Hosting Providers: Web Server Configuration

Posted: November 17, 2009 at 3:45 pm

Hosting a web application can be annoying and time consuming. There is the cost of the hardware. There is the time configuring, administering, and patching the operating system, web server, and other software. There is the security risk of exposing a machine onto the Internet. So it’s no surprise that many people and companies use a 3rd party hosting provider to host their web application and manage the infrastructure. Choosing a hosting provider should not be made lightly. You no longer have full control over the machine running your web application. For those interested in creating high performance web applications you must ensure that you don’t give up control over the features that you need to make your web application run as fast as possible.

This is the first in a series articles of performance questions you should ask a hosting provider. While hosting providers do offer dedicated hosting (where your application runs on a single machine all by itself) the vast majority of people choose shared hosting environments. While we will be references hosting services that use the Apache web server all of the advice in this series is applicable to Windows hosting as well.

Without a doubt the first and most important question you should ask a hosting provider is:

“What Control Do I Have Over Web Server Configuration?”

Image of Server Room

This questions is critical. Many of the easiest and most impactful performance improvements you can make to your web application, such as HTTP compression and caching, are configured at the web server level. You should start off by asking what modules are installed already. The Apache modules most relevant to performance are:

  • mod_deflate (for Apache 2) or mod_gzip (for Apache 1) – This module enables HTTP compression.
  • mod_expires – This module enables HTTP caching.
  • mod_rewrite – This module enables on-the-fly URL rewriting which is very helpful when maintaining and updating resources while using far future caching.

All of these modules are installed with the typical default installation of Apache. While this depends on the platform and the distribution they are almost always present by default. Sometimes web hosting companies will compile their own version of Apache from source to maximize performance for their particular server machines. Often they will remove modules to save space and time. If you find a hosting provider like this explain to them that you would like these modules installed. Tell them this is a reasonable request as these modules are part of the default installation of Apache. You should be able to convince them to turn these modules on for you. If not, this is a deal breaker and you should not use that hosting provider. The vast majority of hosting providers offer these modules even at the lowest pricing tiers.

Even if the hosting provider offers these modules, you should ask them for a list of all available modules as well as their policy is for enabling new modules. While mod_deflate, mod_expires, and mod_rewrite and the most helpful modules from a performance point of view there might be other modules, such as mod_cband or mod_bw, that you might want to use for performance reasons.

Once you know what you can configure on the web server your next question should be “how do I configure it?” In most shared hosting environments you will not have access to the main Apache configuration file httpd.conf but usually can control the web server through the use of .htaccess files. This is the best solution since it allows you to directly configure the web server. You simply edit the .htaccess file in the root directory for your web application and upload it to the hosting provider.

Some hosting providers supply you with a web interface to control web server configuration typically through a web administration system like CPanel. If this is the case ask to see examples of the interface. It could be simply a web form that allows to you edit a raw .htaccess file. It could be a more structured web interface with check boxes to turn on modules or forms to add new rules. Be very wary of any type of web-based server configuration. The interface will limit what you are able to configure. If a web interface is available ask if you can still manually upload your own .htaccess file to control the web server. If you cannot do this your ability to configure the web server will be severely limited. If the web interface does not provide the functionality you need you should not use that hosting provider. In general you should not use hosting providers that only offer web-based server configuration.

Bad Idea: Hacking Around Limits

Some developers like to point out that you can use server side application logic to compress content or implement caching for static resources like images or JavaScript files or CSS files. This means you don’t have to have access to the web server to configure things like HTTP compression or caching. Unfortunately this actually hurts performance more than it helps! With this method, PHP (or some other application logic layer) is invoked for all requests. Remember that the vast majority of requests are for static content and do not hit the application layer. The overhead of invoking PHP dozens if not hundreds of times for a page load removes any performance benefit of compressing or caching. We will explore this method more in a future post. For now you should completely avoid it. Never use application code to hack around the blatant shortcomings of a hosting provider.

Remember when choosing a hosting provider the single most important performance question you can ask is “how do I configure the web server?” In our next post we will explore more performance questions you should ask when choosing your hosting provider.