up

Zoompf's Web Performance Blog

Note: Archived Content

This is the archived version of the Zoompf blog. Since our acquisition by Rigor, all our new research and posts on web performance are being published on The Rigor Blog

Browser Performance Issues with Charsets

 Billy Hoffman on December 3, 2009. Category: Optimization

Not defining a character set or where you do define it can cause poor performance for your website’s visitors. In this post we will discuss character sets and how best to define them to avoid web performance problems.

At their core, HTML documents are just a series of bytes. The character set (or charset) for an HTML document tells your web browser how it should process those bytes to construct characters. The browser then interprets those characters to render the web page. The 2 most common ways to tell the web browser what charset to use for an HTML page are by specifying it in the HTTP Content-Type header or by using a <META> tag to emulate an HTTP Content-Type header. When the web content author is the same person as the web server administrator it is possible to directly configure the web server to use the appropriate charset for the appropriate URLs. In this world of virtual hosts, Content Management Systems, and blogs this is rarely the case anymore. As such more and more web developers are using <META> tags to define the charset for HTML documents.

This leads to a Chicken-and-the-Egg problem. The HTML document contains text which tells the browser how to read the document. Hmmm. So how does the browser read the document without a charset? While it varies with browser and version, most assume a Latin alphabet charset like US-ASCII, Latin-1, or ISO-8859-1. The browser then reads the HTML document using this charset scanning for charset information. At this point one of three things happens:

  1. There is no <META> tag with charset information.
  2. There is <META> tag with a charset and it’s what the browser guessed.
  3. There is a <META> tag with a charset, but it’s a different charset than the browser guessed.

If there is no charset information the browser is in an odd position. At this point most browsers attempt some type of charset detection. With several years of web security experience believe me when I tell you that in theory this is an awesome idea but in practice this is a horrible idea. Web browsers or servers trying to “fix” broken data is the root a number of nasty web security vulnerabilities (such as UTF-7 XSS attacks and various other injection evasions). Regardless, no charset information of any kind forces the browser to do more processing which can produce a very small performance hit at best and a hacked website at worse.

If there is a <META> tag whose charset is the same as what the browser guessed there is no issue. Nothing else needs to occur.

If there is a <META> tag and it specifies a charset different than the assumed charset the browser has a problem. It has already interpreted some amount of the HTML document but it was the wrong charset. That information is all bad. The document needs to be reprocessed using the correct charset. So right now at best you are talking about a small performance penalty as the browser has to reparse the beginning of the HTML document.

But it can get worse! This is because browsers don’t scan the entire HTML document looking for a charset. They want to start rendering content! If they don’t see a charset defined “near the top” of the HTML document they start rendering content and executing JavaScript using the assumed charset. (“Near the top” varies from browser to browser which we will discuss in a minute). But once the browser gets going interpreting and executing content and then finds a <META> tag with charset information it’s in a real bind. Because now it has already been executing code, and requesting other resources, and render content using the wrong charset! Those URLs could be wrong, that JavaScript could have syntax errors, or the CSS rules could be misspelled all because the browser read them using the wrong charset information.

“Near the top” for Firefox 3.5 means within the first 2048 bytes. If Firefox does not detect charset information in the first 2048 of an HTML document (and no charset was defined in the HTTP headers) it starts rendering the page and executing script using an assumed charset (I did not investigate other browsers). Consider this example web page adapted from a Simon Pieters test case. It contains some JavaScript, whitespace, and starting just after 2048 bytes, a <META> tag defining the charset. In Firefox the JavaScript and pop an alert box showing a Euro sign. After 2048 bytes there is a <META> tag changing the charset from the assumed Latin-1. Firefox has to reprocess and re-render the page which will execute the JavaScript again with a Cyrillic character appearing in the alert box this time.

It is also interesting what the browser does if it has already made a request. If Firefox has already requested a URL and then detects a new charset the URL must be re-requested. Consider this example page. Here JavaScript make a request to a nonexistent image from www.google.com (we include the alert box to create a delay in thus simple test case to ensure Firefox has already started fetching the resource). The URL contains a character changes based on the charset so it must be re-requested. Using an HTTP proxy we see the browser made 2 requests to 2 different URLs (with URL encoding to encode the characters being sent)

charset

Note: it appears that Firefox does not try to re-request a URL if the change in the charset did not affected the change the meaning of the URL. If you modify the 2nd example to request “abc.gif” it does not appear that Firefox fetches this twice. More testing is needed here.

So there you have it. Browsers take a performance hit of varying severity when you fail to specify the charset near the very top of your HTML document. Always make sure to include some type of character set information so the browser does not waste time auto detecting one. This can slightly help performance and avoid security vulnerabilities. If you are using <META> tags to specify the character set information of your web pages make sure to place it a high in the <HEAD> of your HTML document as possible. The W3C standard specifically mentions this problem and solution. For Firefox, you only need 2048 characters before the <META> charset tag to cause this problem. A <SCRIPT> tag, a <STYLE> tag, an HTML comment, or even a <META> description tag or long <META> keywords tag can easily consume 2048 bytes. While other browsers may be more tolerant and allow a larger window they would still take a performance hit of having to reparse the byte stream. For these reasons Zoompf recommends you place the <META> charset tags as the first element inside of the <HEAD> of your HTML document to avoid any performance problems.

Want to see what performance problems you have? An appropriately placed <META> charset tag is just one of the 200+ performance issues Zoompf detects while assessing your web applications. You can sign up for a free mini web performance assessment at Zoompf.com today!

Comments

Have some thoughts, a comment, or some feedback? Talk to us on Twitter @zoompf or use our contact us form.