March 8, 2010

META Refresh Nullifies Caching for IE6 and IE7

There has been some interesting discussion recently on the mailing list for Google’s Page Speed performance tool. Brian Brophy rediscovered a critical performance bug in Internet Explorer that Joseph Smarr had found nearly 3 years ago. Both Internet Explorer 6 and 7 are affected by this bug . IE8 is not affected.

To summarize, the bug is this: When a site uses a <META> refresh tag to send the visitor to a URL, IE6 and IE7 treat that as if the user had clicked the “Refresh” or “Reload” button on the browser. This means IE does use any items that are in the cache and instead re-requests everything on that page. In short, for IE6 and IE7, a <META> refresh will nullify any HTTP caching.

The word "META" written on a luggage tag

Its best to see an example. Let’s say we have a page, start.html, which contains a <META> refresh tag that redirects to main.html. The <META> Refresh tag looks like this <META http-equiv=”Refresh” content=”0;main.html”> Let’s say main.html has 3 images on it. All of those images are served with a far future Expires header. This means repeat visitors should have all 3 images referenced by main.html cached. Here is what happens:

  • The visitor clicks a link to start.html.
  • start.html uses a <META> refresh to send the visitor to main.html.
  • Visitor’s IE browser fetches main.html.
  • Visitor’s IE browser does not use the cached images. Instead it sends 3 conditional GET requests to the web server for the 3 images with If-Not-Modified headers.

There were already several reasons not to use a <META> tag to perform a refresh. Zoompf Check #99 (one of the first checks we wrote) flags on web pages that used <META> tag for redirects. Originally we flagged META refreshes because of it was a bloated and oversized solution as well all the problems <META> refreshes cause with web crawlers and accessibility. Zoompf’s remediation advice was to use an HTTP redirect and we flagged this as a low severity issue. In light of these IE performance problems, we have changed the severity to a high (which is the same severity as not using caching at all).

Want to see what performance problems your website has? META Refresh Tag Used As Redirect is just one of the 300+ web performance issues Zoompf detects when scanning your web applications. Get your instant free web performance assessment at Zoompf.com today!

January 28, 2010

Cruft inside Microsoft Word HTML files

We were recently on-site with a client helping them fix some issues when we happened to see this directory containing some HTML files.

Well that’s odd. Why do some of those HTML files have one icon and different HTML files have another icon? We examined the source code for one of the HTML files with the odd icon and saw this:

Turns out these HTML files were created by Microsoft Word! Due of a series of different web designs and designers over a number of years, as well as a healthy bit of editing by the marketing department, 1 in 4 web pages of our client’s current website were created or modified using Microsoft Word!

As we scrolled through the HTML file we saw large amounts of extra data that no normal web browser would ever interpret. A little research explained it for us. Microsoft allows you to save a document as an HTML file. They also want you to be able to open an HTML file that was created using Microsoft Office and resume editing it just like a normal document. Since Microsoft Office has all sorts of features that HTML and CSS doesn’t this allows Office to preserve certain information inside the HTML file between edits.

The some of the data stored is obvious: when the document was created and by whom, who made what edits when, paragraph count, etc. Other less obvious data such as VML, DHTML behaviors, column and page spacing, Word styling information, embedded objects data, and more is also stored inside the file. All of this Office specific data is stored inside HTML file and is wrapped inside of special conditional comments such as <!--[if gte mso 9]>. This hides the content from other programs that read the HTML. Furthermore Word isn’t the only Office program that does inserts this extra data into HTML files. Excel does too.

Keep in mind we are not talking about the general bloat that WYSIWYG HTML editors tend to add. Bloat such as empty <P> tags, large numbers of &nbsp; entries, table based layouts, overly long style attributes, are all hallmarks of WYSIWYG editors. However this is beyond that. This is extra data that is used exclusively by Office and is completely ignored by all web browsers that don’t support conditional comments (in other words any program besides Internet Explorer). In fact, the data is ignored by Internet Explorer as well since the conditional comments apply only to Microsoft Office and not for any version of IE.

So we have a bunch of useless cruft inside of these HTML files. Not a big deal right? Unfortunately all this useless data has a cost. Of the files we sampled we found that 20-35% of the HTML content was Microsoft Office specific data. That means 20-35% of the bytes going down the pipe to a user are completely wasted for these files.

Cleaning Up the Cruft

Luckily Word includes an option that allows you to save a filtered HTML file. A filter HTML file will not contain any of this useless Microsoft Office specific data. Under “Save As” you want to select the “Web Page, Filtered” option as shown below.

If you don’t happen to have a copy of Office around (or you have a few hundred HTML files to clean) you can still remove this useless content. Since all of this extra data is stored in conditional comments that are looking for the “mso” user agent you can easily write a regular expression to remove it. In fact you should create a script that detects and removes this extra data and include it as part of your publishing process.

But I Would Never edit HTML with Word!

I’m sure that you wouldn’t. But do you create all the content for your organization’s websites? Do you hand vet every piece of content before it goes out the door? At Zoompf, we have clients with over a hundred of web properties, produced by hundreds of individual content providers, both internal and external, who report into dozens of different departments. You better believe stuff like this slips through the cracks all the time.

There is also a huge install base and large user base of WYSIWYG HTML editors. Microsoft just sold $4.75 billion dollars worth of Office in the Q2 of Fiscal Year 2010 alone! Adobe’s Creative Suite with DreamWeaver is wildly popular as are other WYSIWYG tools . And that is not to mention the 15 years or so of legacy content on the Internet already that was written using who knows what kind of tool or coding standard.

So yes, the ideal should be “we should never write bloated web pages.” However the reality is “this happens and we need tools and processes to ensure we do not publish bloated web pages.” Checking for bloated web pages produced by tools like Microsoft Word is part of what web performance optimization is about.

Want to see what performance problems your website has? Finding unfiltered Microsoft Office HTML documents is just one of the 200+ performance issues Zoompf detects when checking your web applications. You can sign up for a free mini web performance assessment at Zoompf.com today!

December 3, 2009

Browser Performance Issues with Charsets

Not defining a character set or where you do define it can cause poor performance for your website’s visitors. In this post we will discuss character sets and how best to define them to avoid web performance problems.

At their core, HTML documents are just a series of bytes. The character set (or charset) for an HTML document tells your web browser how it should process those bytes to construct characters. The browser then interprets those characters to render the web page. The 2 most common ways to tell the web browser what charset to use for an HTML page are by specifying it in the HTTP Content-Type header or by using a <META> tag to emulate an HTTP Content-Type header. When the web content author is the same person as the web server administrator it is possible to directly configure the web server to use the appropriate charset for the appropriate URLs. In this world of virtual hosts, Content Management Systems, and blogs this is rarely the case anymore. As such more and more web developers are using <META> tags to define the charset for HTML documents.

This leads to a Chicken-and-the-Egg problem. The HTML document contains text which tells the browser how to read the document. Hmmm. So how does the browser read the document without a charset? While it varies with browser and version, most assume a Latin alphabet charset like US-ASCII, Latin-1, or ISO-8859-1. The browser then reads the HTML document using this charset scanning for charset information. At this point one of three things happens:

  1. There is no <META> tag with charset information.
  2. There is <META> tag with a charset and it’s what the browser guessed.
  3. There is a <META> tag with a charset, but it’s a different charset than the browser guessed.

If there is no charset information the browser is in an odd position. At this point most browsers attempt some type of charset detection. With several years of web security experience believe me when I tell you that in theory this is an awesome idea but in practice this is a horrible idea. Web browsers or servers trying to “fix” broken data is the root a number of nasty web security vulnerabilities (such as UTF-7 XSS attacks and various other injection evasions). Regardless, no charset information of any kind forces the browser to do more processing which can produce a very small performance hit at best and a hacked website at worse.

If there is a <META> tag whose charset is the same as what the browser guessed there is no issue. Nothing else needs to occur.

If there is a <META> tag and it specifies a charset different than the assumed charset the browser has a problem. It has already interpreted some amount of the HTML document but it was the wrong charset. That information is all bad. The document needs to be reprocessed using the correct charset. So right now at best you are talking about a small performance penalty as the browser has to reparse the beginning of the HTML document.

But it can get worse! This is because browsers don’t scan the entire HTML document looking for a charset. They want to start rendering content! If they don’t see a charset defined “near the top” of the HTML document they start rendering content and executing JavaScript using the assumed charset. (“Near the top” varies from browser to browser which we will discuss in a minute). But once the browser gets going interpreting and executing content and then finds a <META> tag with charset information it’s in a real bind. Because now it has already been executing code, and requesting other resources, and render content using the wrong charset! Those URLs could be wrong, that JavaScript could have syntax errors, or the CSS rules could be misspelled all because the browser read them using the wrong charset information.

“Near the top” for Firefox 3.5 means within the first 2048 bytes. If Firefox does not detect charset information in the first 2048 of an HTML document (and no charset was defined in the HTTP headers) it starts rendering the page and executing script using an assumed charset (I did not investigate other browsers). Consider this example web page adapted from a Simon Pieters test case. It contains some JavaScript, whitespace, and starting just after 2048 bytes, a <META> tag defining the charset. In Firefox the JavaScript and pop an alert box showing a Euro sign. After 2048 bytes there is a <META> tag changing the charset from the assumed Latin-1. Firefox has to reprocess and re-render the page which will execute the JavaScript again with a Cyrillic character appearing in the alert box this time.

It is also interesting what the browser does if it has already made a request. If Firefox has already requested a URL and then detects a new charset the URL must be re-requested. Consider this example page. Here JavaScript make a request to a nonexistent image from www.google.com (we include the alert box to create a delay in thus simple test case to ensure Firefox has already started fetching the resource). The URL contains a character changes based on the charset so it must be re-requested. Using an HTTP proxy we see the browser made 2 requests to 2 different URLs (with URL encoding to encode the characters being sent)

charset

Note: it appears that Firefox does not try to re-request a URL if the change in the charset did not affected the change the meaning of the URL. If you modify the 2nd example to request “abc.gif” it does not appear that Firefox fetches this twice. More testing is needed here.

So there you have it. Browsers take a performance hit of varying severity when you fail to specify the charset near the very top of your HTML document. Always make sure to include some type of character set information so the browser does not waste time auto detecting one. This can slightly help performance and avoid security vulnerabilities. If you are using <META> tags to specify the character set information of your web pages make sure to place it a high in the <HEAD> of your HTML document as possible. The W3C standard specifically mentions this problem and solution. For Firefox, you only need 2048 characters before the <META> charset tag to cause this problem. A <SCRIPT> tag, a <STYLE> tag, an HTML comment, or even a <META> description tag or long <META> keywords tag can easily consume 2048 bytes. While other browsers may be more tolerant and allow a larger window they would still take a performance hit of having to reparse the byte stream. For these reasons Zoompf recommends you place the <META> charset tags as the first element inside of the <HEAD> of your HTML document to avoid any performance problems.

Want to see what performance problems you have? An appropriately placed <META> charset tag is just one of the 200+ performance issues Zoompf detects while assessing your web applications. You can sign up for a free mini web performance assessment at Zoompf.com today!