Not defining a character set or where you do define it can cause poor performance for your website’s visitors. In this post we will discuss character sets and how best to define them to avoid web performance problems.
At their core, HTML documents are just a series of bytes. The character set (or charset) for an HTML document tells your web browser how it should process those bytes to construct characters. The browser then interprets those characters to render the web page. The 2 most common ways to tell the web browser what charset to use for an HTML page are by specifying it in the HTTP Content-Type header or by using a <META> tag to emulate an HTTP Content-Type header. When the web content author is the same person as the web server administrator it is possible to directly configure the web server to use the appropriate charset for the appropriate URLs. In this world of virtual hosts, Content Management Systems, and blogs this is rarely the case anymore. As such more and more web developers are using <META> tags to define the charset for HTML documents.
This leads to a Chicken-and-the-Egg problem. The HTML document contains text which tells the browser how to read the document. Hmmm. So how does the browser read the document without a charset? While it varies with browser and version, most assume a Latin alphabet charset like US-ASCII, Latin-1, or ISO-8859-1. The browser then reads the HTML document using this charset scanning for charset information. At this point one of three things happens:
- There is no <META> tag with charset information.
- There is <META> tag with a charset and it’s what the browser guessed.
- There is a <META> tag with a charset, but it’s a different charset than the browser guessed.
If there is no charset information the browser is in an odd position. At this point most browsers attempt some type of charset detection. With several years of web security experience believe me when I tell you that in theory this is an awesome idea but in practice this is a horrible idea. Web browsers or servers trying to “fix” broken data is the root a number of nasty web security vulnerabilities (such as UTF-7 XSS attacks and various other injection evasions). Regardless, no charset information of any kind forces the browser to do more processing which can produce a very small performance hit at best and a hacked website at worse.
If there is a <META> tag whose charset is the same as what the browser guessed there is no issue. Nothing else needs to occur.
If there is a <META> tag and it specifies a charset different than the assumed charset the browser has a problem. It has already interpreted some amount of the HTML document but it was the wrong charset. That information is all bad. The document needs to be reprocessed using the correct charset. So right now at best you are talking about a small performance penalty as the browser has to reparse the beginning of the HTML document.
Note: it appears that Firefox does not try to re-request a URL if the change in the charset did not affected the change the meaning of the URL. If you modify the 2nd example to request “abc.gif” it does not appear that Firefox fetches this twice. More testing is needed here.So there you have it. Browsers take a performance hit of varying severity when you fail to specify the charset near the very top of your HTML document. Always make sure to include some type of character set information so the browser does not waste time auto detecting one. This can slightly help performance and avoid security vulnerabilities. If you are using <META> tags to specify the character set information of your web pages make sure to place it a high in the <HEAD> of your HTML document as possible. The W3C standard specifically mentions this problem and solution. For Firefox, you only need 2048 characters before the <META> charset tag to cause this problem. A <SCRIPT> tag, a <STYLE> tag, an HTML comment, or even a <META> description tag or long <META> keywords tag can easily consume 2048 bytes. While other browsers may be more tolerant and allow a larger window they would still take a performance hit of having to reparse the byte stream. For these reasons Zoompf recommends you place the <META> charset tags as the first element inside of the <HEAD> of your HTML document to avoid any performance problems.
Want to see what performance problems you have? An appropriately placed <META> charset tag is just one of the 200+ performance issues Zoompf detects while assessing your web applications. You can sign up for a free mini web performance assessment at Zoompf.com today!