March 8, 2010

META Refresh Nullifies Caching for IE6 and IE7

There has been some interesting discussion recently on the mailing list for Google’s Page Speed performance tool. Brian Brophy rediscovered a critical performance bug in Internet Explorer that Joseph Smarr had found nearly 3 years ago. Both Internet Explorer 6 and 7 are affected by this bug . IE8 is not affected.

To summarize, the bug is this: When a site uses a <META> refresh tag to send the visitor to a URL, IE6 and IE7 treat that as if the user had clicked the “Refresh” or “Reload” button on the browser. This means IE does use any items that are in the cache and instead re-requests everything on that page. In short, for IE6 and IE7, a <META> refresh will nullify any HTTP caching.

The word "META" written on a luggage tag

Its best to see an example. Let’s say we have a page, start.html, which contains a <META> refresh tag that redirects to main.html. The <META> Refresh tag looks like this <META http-equiv=”Refresh” content=”0;main.html”> Let’s say main.html has 3 images on it. All of those images are served with a far future Expires header. This means repeat visitors should have all 3 images referenced by main.html cached. Here is what happens:

  • The visitor clicks a link to start.html.
  • start.html uses a <META> refresh to send the visitor to main.html.
  • Visitor’s IE browser fetches main.html.
  • Visitor’s IE browser does not use the cached images. Instead it sends 3 conditional GET requests to the web server for the 3 images with If-Not-Modified headers.

There were already several reasons not to use a <META> tag to perform a refresh. Zoompf Check #99 (one of the first checks we wrote) flags on web pages that used <META> tag for redirects. Originally we flagged META refreshes because of it was a bloated and oversized solution as well all the problems <META> refreshes cause with web crawlers and accessibility. Zoompf’s remediation advice was to use an HTTP redirect and we flagged this as a low severity issue. In light of these IE performance problems, we have changed the severity to a high (which is the same severity as not using caching at all).

Want to see what performance problems your website has? META Refresh Tag Used As Redirect is just one of the 300+ web performance issues Zoompf detects when scanning your web applications. Get your instant free web performance assessment at Zoompf.com today!

Useless Duplicate Cookies

In our last post where we described the 300 issues Zoompf checks your website for during its web performance asessment we said that the #1 way we discover new web performance issues is simply looking at web responses. This story is a perfect example of how that actually happens. Today (in fact, about 2 hours ago) we were helping a client optimize their site when we noticed a rather long HTTP Set-Cookie header. This is what we saw:

Now that is rather difficult to look at. So we cleaned up the code, trimmed out the expires and path information for each cookie declaration, and aligned each cookie name/value pair on its own line. This is the clean version:

Set-Cookie:
cisession=a%3A4%3A%7Bs%3A10%3A%22session_id... [snip],
cisession=a%3A4%3A%7Bs%3A10%3A%22session_id... [snip],
cisession=a%3A4%3A%7Bs%3A10%3A%22session_id... [snip],
cisession=a%3A4%3A%7Bs%3A10%3A%22session_id... [snip],
cisession=a%3A4%3A%7Bs%3A10%3A%22session_id... [snip],
cisession=a%3A4%3A%7Bs%3A10%3A%22session_id... [snip],
cisession=a%3A4%3A%7Bs%3A10%3A%22session_id... [snip],
cisession=a%3A4%3A%7Bs%3A10%3A%22session_id... [snip],
cisession=a%3A4%3A%7Bs%3A10%3A%22session_id... [snip],

As you can see, the web application is setting the cisession cookie 9 separate times! And every time it gets set to the very same value. Now each distinct cookie name can only have one value. The web browser will use the last declaration. So this response needlessly sets the cookie 8 times. The original Set-Cookie header’s value was 3681 bytes long. But when you remove the first 8 cisession cookie declarations and instead only have 1 cisession cookie that size is reduce to 409 bytes, a reduction of 89%.

Well that’s a nice find. But then things got worse. This site used rotating cookie values where the value of the cookie is changes on each and every page (this is often done in banking and e-commerce applications to mitigate session hijacking). In this case that meant every page generated by PHP hadthese 9 cookie declarations. By identifying and resolving this problem we helped the client take 3 kilobytes of every HTML response! Now that’s a really nice performance optimization!

Cause of the Issue

This client had an online store. To uniquely identify each visitor and provide them with a shopping cart the application code had to set a session identifier for the visitor. They had a single function which would verify the client had a session identifier and set the new appropriate value. This function was called 9 separate times in different parts of the code during page generation. However the function did not check to see if the session identifier had already been set for this cycle. It just appended on a new cookie declaration. So every time a page was generated, 9 cookie declarations would be added on to the HTTP response.

This issue was hard to detect. Since the browser only uses the last declaration, HTTP requests back to the server only contain 1 cookie, not 9. For the same reason if you use a browser add-on to examine the stored cookies you will only see 1 cookie and not 9. In fact, we had to modify Zoompf’s code to detect this. The System.Net classes in Microsoft .NET were automatically collapsing the 9 redundant cookies into a single cookie. This means our code only saw one cookie as well.

One-off Issue or Plague?

We wanted to see how prevalent the issue of Duplicate Cookies is. So we wrote some quick code and we then re-analyzed approximately 700 web performance scans we have already performed on other websites to see who else had the issue. We found 16 other websites, or around 2.5% of websites we had assessed had this issue. While it is by no means as common an issue as say Images without any caching information (Check #172) we were surprised at how common the issue is. Spot checking those 16 website shows the same fundamental issue: the same cookie getting set to the same value multiple times in a single HTTP response. Again, this is most likely caused by repeated execution of the same function or code path which sets the cookie value.

Since it is a fairly easy mistake to make and is not a one-off issue, we decided to promote this to a full fledged performance check. So we wrote Zoompf check: #316: Duplicate Cookies to detect this issue.

Want to see what performance problems your website has? Duplicate Cookies is just two of the 300+ web performance issues Zoompf detects when scanning your web applications. Get your instant free web performance assessment at Zoompf.com today!

March 5, 2010

Details of Zoompf Performance Issues

When talking with web developers and front end designers we almost always get asked these two questions: “Do you really have 300 Checks?” and “What performance issues does Zoompf look for?” (The answer to the first question is: No, we actually have 315. However not all of them are specifically for front-end performance issues). In this post we will detail what issues Zoompf detects while assesses your web applications and help illustrated how Zoompf’s deep and broad analysis compares with other front-end performance tools.

(Of course, you can see how awesome Zoompf’s performance analysis is right now! Anyone can receive a free performance scan of their website)

To understand and appreciate the scope of Zoompf’s analysis it is helpful to create categories of different performance issues. This way we can discuss the typical performance issues that people test websites for and showcase all the additional issues Zoompf detects. In our reports, Zoompf groups performance issues into 4 broad categories based on what the desired goal for solving each problem. The four performance categories Zoompf uses are:

  • Reducing Response Size
  • Reducing Request Count
  • Maximizing Browser Performance
  • Server and Miscellaneous Issues

Let’s examine what these categories mean and list examples of the performance issues fall into each category.

Reducing Response Size

Reducing response is all about minimizing the number of bytes that have to be pushed down the network pipe to the client. Typically examples for issues in this category are things like using HTTP compression, minifying CSS JavaScript files, and crunching images. Other tools tend to check only for these obvious issues.

Zoompf however goes further to find more bloated content and unnecessary data that can be removed to reduce the of a web response. Zoompf not only has standard HTML minification checks (#38, #44) but also tells you HTML content that should be completely removed such as: <TABLE> tags that are used for layout purposes (#111, #299, #301); unnecessary or redundant content such as <META> tags used for caching, or character set info, or other meta data and multiple page elements like DOCTYPES (#170, #302-305, #97); common style attribute or onX event attributes that can be communized into single declarations (#283, #28-#30); Excessive ASP.NET ViewState (#212); and more.

We find ways to reduce the size of content in other types of files by finding issues like: Unused CSS rules (#33); Flash or Silverlight applications that has not been compressed, or compiled with debugging symbols, or contain uncrunched images, or aren’t using assembly caching (#148, #149,#231 #232, #256 ,#257); content that can be rezipped (#230); Already compressed Content using HTTP compression (#58); PNG8 Candidate images (#284, #285) and more.

As of today Zoompf detects 103 different issues so you can optimize your web content to be as small as possible without sacrificing features or compatibility. Reducing response size provides a good improvement to page load times and a larger impact on operational resources like bandwidth and server load.

Reducing Request Count

Reducing request count issues are all about how to reduce the number of HTTP requests needed to render the page. Typical examples for this category include things like combining CSS or JavaScript files, CSS sprites, and using HTTP caching.

Again Zoompf goes beyond the status quo and detects even more ways that you can reduce the request count for a web page. This includes things like: hyperlinks and images that can be converted to client-side image maps (#185); server–side image maps (#169); wasteful redirects due to no trailing slash, or to a default page, or to the WWW/non-WWW or SSL version of the site (#129, #204, #247, #248); resources that should be cached by caching proxies but aren’t due to query strings, URL contents, or conflicting and misconfigured cache headers (#68, #191-#197, #36); news feeds that aren’t using caching, or blackout periods, or Last-modified support (#225-229, #233, #235); Style sheets that only import other style sheets (#264, #269); external JavaScript files with no executable content (#37); and many more.

As of today Zoompf checks for 70 different issues that will reduce the number of requests your web server must handle per page. These issues have an enormous effect on page load times and an equally massive effect on bandwidth consumption and network usage.

Maximizing Browser Performance

Maximizing browser performance is all about using correct features and organizing content to allow the browsers to render the page and execute the content as fast as possible. This includes obvious things like domain sharding and cookie-free domains, avoiding CSS expressions or AlphaImageLoader, and properly placing reference to external JavaScript or CSS files.

Again Zoompf goes beyond typical front-end tools and detects issues such as: <SCRIPT> that blocks rendering (#286, #152); Images or objects without dimensions (#237 #238, #262); <CANVAS> issues (#291); out-of-date and poorly performance JavaScript libraries like older jQuery or Google Analytic’s urchin.js (#272, #293); downgrades to HTTP/1.0 (#56); JavaScript code performance issues (#158, #160, #221, #222); premature persistent connection closure (#177, #178); and more.

Zoompf currently checks your web application for 41 different issues which decrease the performance of your visitors’ browser and which directly lead to slower page load times and application functionality.

Server and Miscellaneous Issues

The “Server and Miscellaneous Issues” category contains issues which reduce the performance of the web server or which waste the server’s resource. While front-end optimization techniques can offer enormous performance gains there are many easy to fix server issues that can be detected by simply crawling the web server. Missing or misconfigured Robots.txt files with suboptimal rules or no crawl-delay (#7, #289, #102, #91); broken or incorrect content given the way it was references (#40-43, #253-255); application specific issues like misconfigured server-side object caches like memcached or WP Super Cache (CMS or PHP op-code caching systems(#239-#245); and a few more.

As of today Zoompf scans for 36 different issues that decrease the performance of the server or needlessly waste its resources.

Other Issues Zoompf Detects

There are 2 other categories of issues that Zoompf looks for. The first category consists of quality issues. Zoompf classifies quality issues as blatant errors or critical problems with your website’s functionality. We include quality issues because you really shouldn’t be trying to finding and fix performance issues while your web application is fundamentally not working properly. We mentioned quality issues before in our post about supporting other languages. These quality issues detect problems such web server errors (broken status codes, misconfigured modules, SSL certificate issues, etc); application tier issues (stack traces, framework exceptions, and unexecuted server-side source code, etc) and database errors and exceptions. In all Zoompf detects 37 different quality issues. While Zoompf is not meant to replace QA testers we want to alert you to any critical or broken functionality on your website that we detect.

We call the final group “Prototype Checks.” These look for all different kinds of issues, such as obscure or developing performance issues, search engine optimization best practices, usability and accessibility issues, browser compatibility issues, and even web security issues. We have written these checks so we can gather different pieces of data or statistics about these issues in all the web assessments we do. We do not include any of these issues in our web performance reports or in any other public reports (though this might change). Some of these issues do get promoted to performance or quality checks, while others allow us to play with new ideas or technologies.

Follow the Leader

So now you know what performance issues Zoompf checks for and how we provider a richer and deeper analysis of your web application than other tools. In fact, other people are starting taking cues from Zoompf on new performance issues to include in their tools. When Google released version 1.6 of PageSpeed in February they added support to find a new performance issue : Specify a Character Set Early. This is something we researched and blogged about late last year and Google even references Zoompf work as source material for including the new check!

Front-end performance is a very young and exciting space. We will continue to discover and publicize new techniques to optimize your website’s performance. It’s going to be a fun ride!

Experience the Matrix

Much like The Matrix, reading about what performance issues Zoompf detects is nothing compared to experiencing what performance issues Zoompf can detect for yourself. Right now you can go to Zoompf.com can get a free performance assessment of your website. What things will we find that other tools have missed? Think you have a fully optimized website that cannot possibly be improved? Try Zoompf’s free performance assessment today and find out!

March 4, 2010

Zoompf Check #300! Or: Gateway’s got a problem…

People often ask how we discover new performance issues. Without a doubt the #1 way we discover new issues is simply by looking at websites and seeing how they work. Not a customer engagement goes by where we don’t find at least one new web performance issue that we add to our growing database of web performance issues. This why Zoompf has added over 150 performance checks since we went public with our offerings. In this blog post we are going to give you the back story around a cool performance problem we found. In fact, it was so interesting and impact it became our 300th check in our database of web performance issues!

Picture of a cork popping out of a bottle

The Strange Case of CSS Resources

We recently made a few changes to our CSS parser and we were testing the new features against a few dozen pages to ensure we hadn’t broken anything. While testing, we noticed something odd on Gateway’s website. All of the background images in one of the style sheets for gateway.com were served over an encrypted SSL connection. In other words, the style sheet http://www.gateway.com/css/cms/styles.css is served over a plain unencrypted HTTP connection. However it includes links to background images like https://cdn.gateway.com/media/cphp/themes/default_bg.gif which are served over an encrypted SSL connection. What’s odd is the web server will serve those background images just fine if you request them using HTTP instead of HTTPs. It’s even weirder since very little of Gateway’s website even uses SSL. In fact, trying to access the root of the Gateway’s website using HTTPS will redirect you to the HTTP version.

So, certainly something weird, but is this a performance issue? After all, the developers just added a few extra “s” characters to their CSS. So what if maybe a little bit of SSL gets used. What are the performance implications of that?

Actually, they are huge!

SSL Primer

HTTPS is HTTP traffic sent over an encrypted SSL tunnel. SSL is expensive for 2 reasons (documented here and here). The first is the SSL handshake, where it is negotiated between the client and the server what protocols will be used and keys are exchanged. This involves the use of asymmetric key encryption with is extremely math heavy and quite slow. Then there is bulk data encryption, which is where the server is using symmetric key encryption. Symmetric key encryption is much faster than asymmetric key encryption, but can be much slower than sending unencrypted data. (How much slower is beyond the scope of this article. Suffice it to say that the performance impact of SSL is sufficiently large that there is an entire market for SSL acceleration products).

So what’s the impact in this situation? Well, the extremely expensive initial SSL negotiation must be done for the two initial connections to the web server. Each of these negotiations takes between 250 milliseconds to 350 milliseconds. This is in addition to the overhead of making the TCP connection and the negative impact of TCP’s Slow start and congestion control. Even reusing pre-negotiated keys a connection close is expensive, typically costing between 60 milliseconds and 100 milliseconds.

The Performance Impact

That doesn’t sound too bad but see how it affects Gateway. Examining this waterfall graph for www.gateway.com shows that 1.5 seconds, or over 25% of the page load time for Gateway.com, is due to the overhead of SSL! It’s hard to believe but the simple additional of 14 “s” characters to the page caused 1.5 seconds of delay! Further more Gateway’s web server is going to be working about twice as hard to push encrypt and serve those resources! Server load, power, cooling, and availability are all adversely affected.

(Notice there are a lot of other problems with Gateway’s website that contributed to the damage the SSL issue caused. Specifically they were not using persistent connection which caused a dozen of the smaller re-use negotiations to occur. Luckily Zoompf Check #177 and #178 detected all these closed persistent connections!).

Wow. So how did this happen? It could be for a lot of reasons, all of them fairly simple to make or innocent. It could have been a simple typo. It could have been left over from a redesign where that style sheet was used exclusively inside of the SSL portion of the site. It could be that this style sheet is used in both the SSL and non-SSL portions of the site, but to avoid a mixed content warning, the resources are referenced using SSL. The solution to this is simple. The style sheet should use protocol relative URLs to reference the images. That way if the style site is in the SSL portion of the website (and referenced using an https URL) the background images will requested using HTTPS. And when the style sheet is in the non-SSL portion of the website (and referenced using an http URL) the background images will be requested using HTTP.

It is both cool and scary that such a simple issue can have huge performance implications. And unfortunately none of the free tools like YSlow or PageSpeed would have alerted you to the issue. Until now.

It is with great pleasure and pride that we announce Zoompf Check #300: SSL Resources on Non-SSL Page. That this is such a simple issue to have and at the same time it causes such as massively negative impact makes it worthy to be our 300th check.

Want to see what performance problems your website has? Finding SSL Resource on Non-SSL pagesis just one of the 300 issues Zoompf detects when analyzing your web applications. Get your instant free mini web performance assessment at Zoompf.com today!

February 4, 2010

Choosing PNG8 Candidate Images

Have you heard about PNG8 yet? No? Well, PNG8 is a PNG image that is used an indexed palette of 256 colors instead of a true color PNG which can support several million different colors. There has already been a number of excellent articles and blog posts written about PNG8. These articles have discussed things about PNG8 such as its benefits, how to create PNG8, and using them across different browsers.

But this article isn’t about any of that.

This article is about how to choose PNG images that are good candidates for converting to PNG8. Historically this has not been an easy or a straight forward decision.

We already know that PNG images can be optimized. Crunching PNGs (using pngcrush, or optipng, etc) to reduce their size is a lossless operation. This process removes unnecessary data chucks like comments and recompresses the DEFLATE streams using tweaked settings. When optimizing images by crunching them none of the image data changes. The number of colors in the image does not change and their color values and hues are exactly the same.

But converting a true color PNG to a PNG8 image is a lossly operation. You lose data. PNG8 images can have at most 256 distinct colors while true color PNGs can have several million. Because of this not all PNG images should be converted to PNG8. Some images look absolutely horrible when converted. Working with our clients we have created two guidelines to evaluate whether a PNG image can be converted to PNG8 without any noticeable loss in quality. This is a easier and scalable solution than converting all of the PNG images on your website and then manually verifying the resulting PNG8 images are acceptable.

Guideline #1: Number of Colors

To understand the impact of limiting a PNG image to only 256 distinct colors we must understand how many colors PNG images typically have. Converting to PNG8 can quite significantly reduce the number of colors in the image. A PNG8 version of a true color PNG image with 1,000 distinct colors has 25% as many colors of the original image. A PNG8 version of a true color PNG image with 10,000 colors has only 2.5% as many colors of the original image! So the number of distinct colors in the image (and thus the number of distinct colors you are destroying when converting to PNG8) has a huge impact on how acceptable the resulting PNG8 will be. In plain terms the PNG8 version of a 10,000 color PNG image will look worse than the PNG8 version of a 1,000 color PNG image.

While true color PNGs can have several million distinct colors they rarely do. (If you have a PNG image that has even a few tens of thousands of colors it should probably be saved as a JPEG instead). Examining a sample set of PNG images we found that PNGs tend to have several thousand distinct colors. We also have found that images containing only a few thousand colors will easily convert to a PNG8 image without any noticeable loss in quality. Consider the Zoompf logo:

The Zoompf logo is a true color PNG image consisting of 1999 distinct colors. That sounds like a lot. Let’s convert this to a PNG8 image using pngquant. Here is the original logo and the PNG8 version of the logo side by side for comparison. The original logo is on the top and new PNG8 version is on the bottom.

Wow, They look virtually identical even though the PNG8 version’s file size is over 50% smaller and uses 87% less colors than the original. Only if you zoom in very close do you start to see some differences. The greens are slightly lighter and the gray in the swoosh lines are a little different.

We have found that images will less than 2500-3000 distinct colors tend to provide the best trade off in terms of maximum reduction in file size without any noticeable difference to quality. This is purely subjective. There are some true color PNG images with 6,000 colors or more that look just fine when converted to PNG8. You should experiment and see what works best for you.

Guideline #2: Image Dimensions

Another factor is the dimensions of the image. Small images, even if they have more than 7,500 distinct colors, convert to PNG8 with not visible loss of quality. This is because your brain has trouble detecting some many similar colors in such a small area. Consider this true color PNG image of a cat.

This picture consists of 8,853 distinct colors. That’s an enormous amount when you realize this image has only 9900 pixels total! Almost every pixel is a completely unique color. That’s tons of distinct colors given the area they are displayed in. Again we use pngquant to convert this cat image into a PNG8 image and compare it to the original. The original image is on the top and PNG8 version is on the bottom.

Again, they look virtually identical even though the PNG8 version’s file size is over 50% smaller and uses 97.2% less colors than the original. As if the logo, only if you zoom in very close you can start to see the differences between the original image and the PNG8 version.

We have found that images less than 100 pixels by 100 pixels, or an image whose area is less than 10,000 pixels can easily be converted into a PNG8 without any noticeable difference in quality. This is a purely subjective guideline. Some larger true color PNG images look just fine when converted to PNG8.

Using the Guidelines

These guidelines can be used separately. A PNG image does not have to be both small and not using many colors to be a good candidate for converting to PNG8. As as example, Graphviz, a program that generates node-and-edge style graphs, regularly produces images that are thousands of pixels wide by thousands of pixels tall. This would violate our image area guideline. However these images usually contain a few hundred colors. This satisfies our color count guideline. Sure enough, converting the output of Graphviz to PNG8 saves a lot of space with no perceivable loss of quality.

PNG8 and CSS Sprites

A lot of times people want to combine PNG8′s advantage of a very small file size with CSS Sprite’s’ advantage of reducing the number of HTTP requests. At first glance this makes a lot of sense. Individual CSS background images inside of the sprite are often small images, fitting our image dimensions guideline. Also CSS background images are often icon-style images used on buttons, toolbars, etc. This means each individual image tends to have only several dozen distinct colors fitting our image colors guideline.

Unfortunately this is looking at the trees instead of the forest. That’s because a CSS Sprite saved as a PNG8 image has to use only 256 distinct colors for all the sub-images inside the sprite. So while each sub-image that makes up the CSS Sprite might look fine as an individual PNG8 image (each with its own 256 color palette) all the sub-images together in a single PNG8 CSS Sprite using a single common 256 color palette could not. This is especially true with gradients and othe graphics that use different shades for color transitions. In fact, Stoyan pointed out a recent article talked about how converting a CSS Sprite to PNG8 caused a very noticeable loss in quality. The solution was the hand edit the PNG8′s 256 color palette to preserve as many shades of the gradient as possible to improve quality.

Zoompf Color Counter

Since so much savings can occur from converting from PNG24 to PNG8 where appropriate, developers are left with the challenge of trying to quickly find candidate images. While Zoompf’s free web performance scanning service will detect candidate images developers will also want to test images that are not yet uploaded or test images on a website that is not yet in production. To help developers, Zoompf has released the Zoompf Color Counter.

Screen shot of Zoompf's Color Counter program

Zoompf Color Counter is a Windows program that will analyze an image and tell the user how many distinct colors it has. Simply open an image inside Zoompf Color Counter or drag and drop an image on top of Zoompf Color Counter to learn the number of distinct colors. Download Zoompf Color.

Summary

There are a lot of blog posts and articles on the Internet about how converting true color PNG images into PNG8 images is an excellent optimization technique. However knowing how to choose true color PNG images that are good candidates to convert to PNG8 can be difficult and time consuming. In this post we provide 2 guidelines to help:

  1. Images with less than 2500 to 3000 distinct colors can usually be converted to PNG8 without any noticeable differences
  2. Image less than 100 pixels by 100 pixels (or an image whose area is less than 10,000 pixels) can usually be converted to PNG8 without any noticeable differences.

In addition we have release Zoompf Color Counter to help developers find candidate images. Remember that converting to PNG8 is a lossly process. How much loss is tolerable will vary from person to person. You should use our guidelines but also experiment to see if you will tolerate more. Finally, be careful converting your CSS Sprites image into PNG8, especially if you use images with gradients.

Want to see what performance problems your website has? Finding Candidate PNG8 Images based on color count or image dimensions are just two of the 300+ performance issues Zoompf detects when checking your web applications. You can sign up for a free mini web performance assessment at Zoompf.com today!

February 2, 2010

Whoops! Error in the System!

We unfortunately had an email hiccup this morning and two of our free mini reports did not get to the right people. If you requested a Free Zoompf mini performance assessment for either www.ipingtest.com or www.getesignature.co.uk . Please resubmit your request so we can get the report to you. Thanks for the interest and sorry!

January 28, 2010

Cruft inside Microsoft Word HTML files

We were recently on-site with a client helping them fix some issues when we happened to see this directory containing some HTML files.

Well that’s odd. Why do some of those HTML files have one icon and different HTML files have another icon? We examined the source code for one of the HTML files with the odd icon and saw this:

Turns out these HTML files were created by Microsoft Word! Due of a series of different web designs and designers over a number of years, as well as a healthy bit of editing by the marketing department, 1 in 4 web pages of our client’s current website were created or modified using Microsoft Word!

As we scrolled through the HTML file we saw large amounts of extra data that no normal web browser would ever interpret. A little research explained it for us. Microsoft allows you to save a document as an HTML file. They also want you to be able to open an HTML file that was created using Microsoft Office and resume editing it just like a normal document. Since Microsoft Office has all sorts of features that HTML and CSS doesn’t this allows Office to preserve certain information inside the HTML file between edits.

The some of the data stored is obvious: when the document was created and by whom, who made what edits when, paragraph count, etc. Other less obvious data such as VML, DHTML behaviors, column and page spacing, Word styling information, embedded objects data, and more is also stored inside the file. All of this Office specific data is stored inside HTML file and is wrapped inside of special conditional comments such as <!--[if gte mso 9]>. This hides the content from other programs that read the HTML. Furthermore Word isn’t the only Office program that does inserts this extra data into HTML files. Excel does too.

Keep in mind we are not talking about the general bloat that WYSIWYG HTML editors tend to add. Bloat such as empty <P> tags, large numbers of &nbsp; entries, table based layouts, overly long style attributes, are all hallmarks of WYSIWYG editors. However this is beyond that. This is extra data that is used exclusively by Office and is completely ignored by all web browsers that don’t support conditional comments (in other words any program besides Internet Explorer). In fact, the data is ignored by Internet Explorer as well since the conditional comments apply only to Microsoft Office and not for any version of IE.

So we have a bunch of useless cruft inside of these HTML files. Not a big deal right? Unfortunately all this useless data has a cost. Of the files we sampled we found that 20-35% of the HTML content was Microsoft Office specific data. That means 20-35% of the bytes going down the pipe to a user are completely wasted for these files.

Cleaning Up the Cruft

Luckily Word includes an option that allows you to save a filtered HTML file. A filter HTML file will not contain any of this useless Microsoft Office specific data. Under “Save As” you want to select the “Web Page, Filtered” option as shown below.

If you don’t happen to have a copy of Office around (or you have a few hundred HTML files to clean) you can still remove this useless content. Since all of this extra data is stored in conditional comments that are looking for the “mso” user agent you can easily write a regular expression to remove it. In fact you should create a script that detects and removes this extra data and include it as part of your publishing process.

But I Would Never edit HTML with Word!

I’m sure that you wouldn’t. But do you create all the content for your organization’s websites? Do you hand vet every piece of content before it goes out the door? At Zoompf, we have clients with over a hundred of web properties, produced by hundreds of individual content providers, both internal and external, who report into dozens of different departments. You better believe stuff like this slips through the cracks all the time.

There is also a huge install base and large user base of WYSIWYG HTML editors. Microsoft just sold $4.75 billion dollars worth of Office in the Q2 of Fiscal Year 2010 alone! Adobe’s Creative Suite with DreamWeaver is wildly popular as are other WYSIWYG tools . And that is not to mention the 15 years or so of legacy content on the Internet already that was written using who knows what kind of tool or coding standard.

So yes, the ideal should be “we should never write bloated web pages.” However the reality is “this happens and we need tools and processes to ensure we do not publish bloated web pages.” Checking for bloated web pages produced by tools like Microsoft Word is part of what web performance optimization is about.

Want to see what performance problems your website has? Finding unfiltered Microsoft Office HTML documents is just one of the 200+ performance issues Zoompf detects when checking your web applications. You can sign up for a free mini web performance assessment at Zoompf.com today!

January 27, 2010

Apple’s iPad and Web Caching

Like most tech folks, I spent the afternoon watching and reading about Apple’s new iPad. To call it beautiful and innovative is an understatement. I want to purchase one. As in, right now. At $500 price point I wouldn’t even consider buying a netbook. Since I already have a netbook, I’m seriously considering replacing it with an iPad because web browsing looks amazing on the iPad. After all Steve Jobs himself promised me “It is the best browsing experience you’ve ever had.”

Only I’m not sure if that’s true.

Web Performance on the iPhone

We always talk about web performance as something that only the site owners care about. Few people talk about web performance when it comes to choosing a browser. Certainly no one talks about choosing a browser based on simple performance features like which one supports compression, or caching, or conditional requests, or resumable downloads. That’s because this isn’t 1997 and all the browsers do these basic features equally well.

Only I am sure that’s not true.

Stoyan Stefanov wrote an excellent and detailed article on how the cache for Safari on the iPhone works, or rather, doesn’t work. You should read the entire article. For this article we are most interested in two shortcomings of iPhone Safari’s disk cache that severely impact the browsing experience.

Resources > 15K aren’t cached.

No resource larger than 15 kilobytes will be a cached. This is pretty horrible. 15K is not a lot of content. Worst of all that’s the uncompressed size so HTTP compression will not help you fit an otherwise oversized resource into the cache. So how big is 15K? No modern JavaScript library is less than 15K. A quick check of the Top 10 non-search engine websites and none of them have CSS files that will fit in the cache. Images tend to be small enough to fit, however CSS sprites can quickly get too large to fit.

Total Cache Size is only 1.5 Megs.

Safari on the iPhone will not cache more than a total of 1.5 megabytes of content.This is a ridiculously small cache. Your computer’s processor has an L2 data cache etched into the silicon of the chip that is 50% -200% larger than iPhone Safari has for a web cache. On first glance you might think this is completely horrible. The main page of CNN and all its JavaScript, CSS, and images weights in at 752 kilobytes and would consume over half of Safari’s cache! And that’s just one website! However, as we just mentioned, any resource over 15K doesn’t get cached at all. So the first failing of iPhone Safari’s cache makes the 2nd failing of the iPhone cache a little less painful!

The moral here is that 1.5 megs of cache is just way too small to be helpful. Furthemore, the cache can get cleared inadvertently several ways, such as closing Safari without certain tabs or some types of powering the iPhone up and down. This means the meager assistance the cache provides can be undercut

These two limits means the disk cache for Safari on the iPhone can reasonably store a few hundred objects. How quickly does that fill up? Of the 32 images on the main page of CNN right now, 29 of them are less than 15K and would get cached. (Ironically the photo of Steve Jobs holding an iPad is too large to be cached).

“It is the best browsing experience you’ve ever had.”

The long and short of it is the version of Safari that runs on the iPhone is just awful when it comes to caching. And as we know, the fastest request a browser can make is none at all. As such caching is a important aspect of web performance optimization, caching directly affects page load times, and caching is critical to the end user’s web browsing experience.

So far, it seems like much of the iPad is running the iPhone OS with the iPhone apps. If this is the case than I am not hopeful about the web browsing experience on the iPad. If Apple is really going to give “the best browsing experience you’ve ever had” they simply must improve the web caching for Safari on the iPad. Otherwise the iPad will be like a DeLorean when it comes to web browsing: beautiful, but underpowered.

Want to see what performance problems you have? Which web resources are cachable on the iPhone is just one of the 200+ performance issues Zoompf detects while assessing your web applications for performance. You can sign up for a free mini web performance assessment at Zoompf.com today!

January 19, 2010

Foreign Object Detected

We are getting very excited as Zoompf continues to expand. We are adding new clients, gaining more mentions and followers on Twitter, and every day more web developers and IT administrators receive a free miniature performance assessment. Since the start of the new year, we have been getting increased inquiries from people in Europe in particular. A few of these European performance junkies have asked whether Zoompf will work with non-English websites.

The answer is yes.

Zoompf crawls and analyzes your website for over 200 performance issues. The vast majority of those checks don’t examine the web content itself. Instead they are looking at HTTP headers, HTML tags, link relationships and structure, image meta data, Silverlight manifests, or Flash tags. All of these checks find performance issues regardless of whether the site’s content is written BelgianDutch or Polish or English.

At Zoompf, our goal is to help you make your websites blazingly fast. But, if in the process, we notice that the font file you are trying to dynamically load inside of the CSS for that cool theme is throwing a Java stack trace, shouldn’t we tell you that too?

We think so.

That’s why Zoompf includes a handful of additional quality issues that look for defects such as application, framework, or web server error messages. Since error messages tend to be in English these quality checks look for the English version of these error messages. So in that regard there are a some English-only checks in Zoompf, but they are not looking for performance issues.

A good example of these English-only error message checks are database error messages. Zoompf will flag web pages that contain database error messages such as a MySQL database connection error. You would be amazed at how often you will see these on the Internet! But if someone has a localized German database server running that returns database error messages in German Zoompf would not be able to detect these errors. Keep in mind that all of the English-only checks are for general website quality issues. There are no performance checks in Zoompf that are English specific. Consider these extra quality checks a bonus that no other tools provide! They help flag other, serious issues with your web application that Zoompf noticed while looking for performance issues.

Today we are adding a new check, #280, which flags on web pages with non-English content. This is to help our customers understand which web pages contain content in other languages and could have extra non-performance issues that Zoompf could not detect.

Thanks for all the interest and excitement in Zoompf. It’s going to be a fun year!

January 15, 2010

Should You Use JavaScript Library CDNs?

The concept is simple. Hundreds of thousands of websites use JavaScript libraries like jQuery or Prototype. Different websites you visit each download another identical copy of these libraries. You probably have a few dozen copies of jQuery in your browser’s cache right now. That’s silly. We should fix that.

How? Well, if there was a 3rd party repository of common JavaScript libraries, websites could simply load their JavaScript files from them. Now imagine the repository implemented caching. SiteA, SiteB, and SiteC all have <SCRIPT SRC> tags that reference http://some-code-respo.com/javascript/jquery.js. When someone visits any one of these sites, the JavaScript library jQuery is downloaded and cached. If that same person visits one of the other sites, that person will not have to re-download jQuery again. The idea is that sites will load faster because these libraries should not have to be re-downloaded very often at all. Of course, this only works if a lot of people all use the common repository. If only a few people use the common repository, then virtually no one benefits because the library will not have been downloaded and cached by a previous website and has to be re-downloaded.

This is an example of the Network effect. The more people that use a system the more valuable the system becomes.

Implementations of this idea of a central shared repository of common JavaScript libraries are called several different things. Google calls their implementation Google AJAX Library API. Yahoo doesn’t have a clear name for their implementation. I’ve seen “Free YUI hosting” or “YUI Dependencies”, or even Yahoo YUI CDN. Microsoft calls their implementation the Microsoft AJAX CDN. To keep things simple, I will collectively refer to these repositories of common JavaScript libraries as JavaScript Library CDNs.

JavaScript Library CDNs seem like a performance no brainer. Use the service, your site loads faster and consumes less bandwidth. This post will explore if and under what conditions does a JavaScript Library CDN actually improve web performance.

The Choice

Consider this situation. You are speed conscious web developer. You have a website that uses jQuery 1.3.2 as well as some additional site specific JavaScript. Because you value web performance, you know you should concatenate all your JavaScript files into as few files as possible, minify them, and serve them using gzip compression. You have 2 choices:

  1. Serve all your JavaScript locally. You will have a single <SCRIPT SRC> tag that points to a JavaScript file containing jQuery 1.3.2 and your site specific JavaScript.
  2. Serve some of the JavaScript using a JavaScript Library CDN. You will have 2 <SCRIPT SCR> tags. The first tag will point to a single file on your website containing your site specific JavaScript files. The second tag will point to the copy of jQuery 1.3.2 on Google AJAX Library API.

What’s the difference? Well a minified, gipped copy of jQuery 1.3.2 is 19,763 bytes in length. If you choose option 1 all your users will have to download these 19,763 bytes regardless of what other sites they may have already visited. That’s the cost: downloading 19,763 bytes. Notice there is no cost of an additional HTTP request and response or other overhead because those bytes of jQuery content are included inside the response for the site specific JavaScript content which the visitor already has to make. This is important, so I will repeat: The cost of not using a JavaScript Library CDN is only the downloading of JavaScript content and not any additional HTTP requests or overhead.

In the second option, you are going to gamble with a JavaScript Library CDN. You are hoping a visitor has already browsed another website which also uses Google to serve jQuery 1.3.2. If you are right, then that visitor does not need to download 19,763 bytes. If you wrong, the visitor needs to download 19,763 bytes from Google. That’s the prize in a nutshell. And downloading 19,763 bytes doesn’t sound bad! Who cares where it comes from?

The Price of Missing

Unfortunately an HTTP request to Google’s JavaScript Library CDN is more expensive than an HTTP request to your own website! This is because a visitor’s browser has to perform a DNS lookup for ajax.googleapis.com and establish a new TCP connection with Google’s systems. If the additional request was to your site instead the visitor’s browser would not need to make another DNS lookup and the HTTP request would be sent over an existing HTTP connection.

Unfortunately this is a stubborn process. DNS lookups and establishing TCP connections involve a few number of very small packets. Having a faster Internet connection will not significantly impact the speed of these operations. Two different runs on WebPageTest showed that it takes 1/3 of a second for a web browser to make a connection to Google’s JavaScript Library CDN and start downloading it. (And remember, these are CDNs so where I make the request from should not matter as the CDN makes sure I’m downloading the content from a web server that is geographically near me.)

Let me repeat that: Using Google’s JavaScript Library CDN comes with a 1/3 of a second tax on missing. (Note that a tax like this applies to opening connections to a any new host: JavaScript Library CDNs, advertisers, analytics and visitor tracking, etc. This is why you should try to reduce the number of different hostnames you serve content from.) Even if this number is smaller for other users, say, 100 milliseconds, it is still a tax that is paid for using a JavaScript Library CDN and missing.

It gets worse because downloading a file over a new TCP connection with Google is slower than downloading a file over an existing TCP connection with your website! This is due to TCP’s slow start and congestion control. Newly created connections transmit data slower than existing connections do. (This is why persistent connections are so important!)

The Odds of Winning

Since JavaScript Library CDNs utilize the Network Effort, they are only valuable if a large number of websites use them. After all, the only way your visitors can “win” in the JavaScript Library CDN gamble is if they have already been to a site that also uses the same CDN. So, how many people actually use Google?

Well, according to the great folks at BuiltWith, only 13% of all websites use some kind of 3rd party CDN. Of those websites using a CDN, 25.56% of them are using Google’s Ajax Library API. So only 3.89% of all websites surveyed are using Google’s AJAX Library API.

I wanted to gather more data than BuiltWith. I also didn’t like that way they grouped Traditional CDNs (like Akamai) with JavaScript Library CDNs (like Google) with private site-specific CDNs (like Turner’s CDN). So I performed my own survey. I visited the top 2000 sites on Alexa and analyzed each one to see who is using Google’s JavaScript Library CDN. The result? Only 69 sites out of 2000, or 3.45%, are using Google’s JavaScript Library CDN. My data is on track with BuiltWith’s data which is good.

Unfortunately you do not vaguely or abstractly “use a JavaScript Library CDN.” You reference a specific URL for the specific JavaScript Library and version number. You only get a benefit from the CDN if you referencing the specific URL that other websites are referencing. So we have to dig deeper and see what versions of what JavaScript libraries are in use. Below is the a table of JavaScript libraries that Alexa Top 2000 sites use served by Google’s AJAX Library API.

JavaScript LibraryNumber of Alexa Top 2000
sites serving the library
from Google’s CDN
jQuery48
Prototype6
SWFObject6
YUI6
jQuery UI4
Script.aculo.us3
MooTools3
Dojo1

We see that 48 sites are using Google’s JavaScript Library CDN to serve jQuery, and of those 36 sites are using jQuery 1.3.2. That means jQuery 1.3.2 is used by 1.8% of the Alexa 2000 websites. SWFObject and Prototype came in next at 6 sites each, or less than 0.334% of the sites. When you factor in version numbers, their penetration drops to around 0.10%.

So what is the best case here? What are the odds that someone would have jQuery 1.3.2 served from Google’s JavaScript Library CDN sitting in their browser cache? If I have clear browser cache, and I visit 35 randomly selected websites from the Alexa top 2000, and then I visit your site, there is only a 47% chance that I will have a cached copy of jQuery 1.3.2 ready for you to use. You calculate this by first determining the probably of randomly picking 35 websites that don’t have jQuery 1.3.2 and subtracting 1. The formula is: 1 – ( (1 – .018) ^ 35 ).

Those are not very good odds. And they only are applicable if you are using jQuery 1.3.2. Anything else is not practical. You also should consider the makeup of the sites on the list. I have probably only visited 30 or so of the websites listed in the Alexa top 2000 list ever and I probably only visit 5-10 with any regularity. We have determined that the odds of “winning” in the CDN gamble are fairly small. How small the odds are will depend on your site content and your visitors. However I think it is safe to say, as of January 2010, the majority of your users will not have visited a site that uses a JavaScipt Library CDN for the JavaScript library that you use.

Getting More Data

So maybe the odds aren’t good. But is it still worth it to potentially help some people?

Let’s go back to our hypothetical situation where we are deciding if we should use a JavaScript CDN or not. Consider someone with 768 kilobyte per second Internet connection where 768 * 1024= 786,432 bits downloaded per second. Let’s say it is operating at only 80% efficiency to account for overhead like IP, TCP, congestion, packet loss, etc. That 629,145 bits downloaded per second, gives us 78,643 bytes downloaded per second or 26,214 bytes downloaded in 1/3 of a second. A minified and gzipped copy of jQuery 1.3.2 is 19,763 bytes long. This means anyone using a 768 kbps internet connection can download the contents of jQuery 1.3.2 in 1/3 of a second. In other words, downloading jQuery 1.3.2 on that connection takes the same amount of time as simply connecting to Google’s JavaScript Library CDN.

This simplifies the decision in our hypothetical situation on where to host jQuery. In the locally hosted option, we are asking our visitors to download some amount of content X. X is all our HTML, images, site specific JavaScript, and includes the 19,763 bytes of jQuery 1.3.2. In the “use a CDN” option, we still have X amount of content. The only difference is the CDN has the 19,763 bytes of jQuery and our site has X – 19,763 bytes of content. If a visitor does not have cached copy of JavaScript Library they still download a total of X amount of content. It is served from our website and from Google. Under these conditions we are led to the following points:

  1. If you are using a CDN and the visitor does not have cached copy, they download the site 1/3 of a second slower than if they had downloaded all the content from your web server.
  2. If you are using a CDN and the visitor does have cached copy, they download all of the content 1/3 of a second faster than if they had downloaded all the content from your web server.

Or, more simply: If we use Google’s JavaScript Library CDN, we are asking the majority of our website visitors (who don’t have jQuery already cached) to take a 1/3 of a second penalty (the time to connection to Google’s CDN) to potentially save a minority of our website visitors (those who do have a cached copy of jQuery) 1/3 of a second (the length of time to download jQuery 1.3.2 over a 768kps connection).

That does not make sense. It makes even less sense as the download speed of your visitors increases. Try to avoid serving 20 or 30 kilobytes of content at the cost of using a 3rd party just doesn’t make sense.

Conclusions

JavaScript Library CDNs use the network effect. Our survey of the Alexa 2000 shows that right now there are too few people in the network to get any value. Only Google’s AJAX Library API has anywhere near the penetration to provide any benefit and only if you are using a specific version of a single JavaScript library. Even in that remote case, serving jQuery 1.3.2 using Google will slow down the majority of your users at the expense of a possibly nonexistent minority. Zoompf recommends the vast majority of websites avoid using JavaScript Library CDNs until they gain more market penetration.

I will discuss the very select group of sites that should use CDNs, as well as some other interesting data discovered while surveying the Alexa 2000 in posts early next week.

Want to see what performance problems you have? Using JavaScript Library CDNs appropriately are just a few of the 200+ performance issues Zoompf detects while assessing your web applications for performance. You can sign up for a free mini web performance assessment at Zoompf.com today!