Cruft inside Microsoft Word HTML files

Posted: January 28, 2010 at 11:58 pm

We were recently on-site with a client helping them fix some issues when we happened to see this directory containing some HTML files.

Well that’s odd. Why do some of those HTML files have one icon and different HTML files have another icon? We examined the source code for one of the HTML files with the odd icon and saw this:

Turns out these HTML files were created by Microsoft Word! Due of a series of different web designs and designers over a number of years, as well as a healthy bit of editing by the marketing department, 1 in 4 web pages of our client’s current website were created or modified using Microsoft Word!

As we scrolled through the HTML file we saw large amounts of extra data that no normal web browser would ever interpret. A little research explained it for us. Microsoft allows you to save a document as an HTML file. They also want you to be able to open an HTML file that was created using Microsoft Office and resume editing it just like a normal document. Since Microsoft Office has all sorts of features that HTML and CSS doesn’t this allows Office to preserve certain information inside the HTML file between edits.

The some of the data stored is obvious: when the document was created and by whom, who made what edits when, paragraph count, etc. Other less obvious data such as VML, DHTML behaviors, column and page spacing, Word styling information, embedded objects data, and more is also stored inside the file. All of this Office specific data is stored inside HTML file and is wrapped inside of special conditional comments such as <!--[if gte mso 9]>. This hides the content from other programs that read the HTML. Furthermore Word isn’t the only Office program that does inserts this extra data into HTML files. Excel does too.

Keep in mind we are not talking about the general bloat that WYSIWYG HTML editors tend to add. Bloat such as empty <P> tags, large numbers of &nbsp; entries, table based layouts, overly long style attributes, are all hallmarks of WYSIWYG editors. However this is beyond that. This is extra data that is used exclusively by Office and is completely ignored by all web browsers that don’t support conditional comments (in other words any program besides Internet Explorer). In fact, the data is ignored by Internet Explorer as well since the conditional comments apply only to Microsoft Office and not for any version of IE.

So we have a bunch of useless cruft inside of these HTML files. Not a big deal right? Unfortunately all this useless data has a cost. Of the files we sampled we found that 20-35% of the HTML content was Microsoft Office specific data. That means 20-35% of the bytes going down the pipe to a user are completely wasted for these files.

Cleaning Up the Cruft

Luckily Word includes an option that allows you to save a filtered HTML file. A filter HTML file will not contain any of this useless Microsoft Office specific data. Under “Save As” you want to select the “Web Page, Filtered” option as shown below.

If you don’t happen to have a copy of Office around (or you have a few hundred HTML files to clean) you can still remove this useless content. Since all of this extra data is stored in conditional comments that are looking for the “mso” user agent you can easily write a regular expression to remove it. In fact you should create a script that detects and removes this extra data and include it as part of your publishing process.

But I Would Never edit HTML with Word!

I’m sure that you wouldn’t. But do you create all the content for your organization’s websites? Do you hand vet every piece of content before it goes out the door? At Zoompf, we have clients with over a hundred of web properties, produced by hundreds of individual content providers, both internal and external, who report into dozens of different departments. You better believe stuff like this slips through the cracks all the time.

There is also a huge install base and large user base of WYSIWYG HTML editors. Microsoft just sold $4.75 billion dollars worth of Office in the Q2 of Fiscal Year 2010 alone! Adobe’s Creative Suite with DreamWeaver is wildly popular as are other WYSIWYG tools . And that is not to mention the 15 years or so of legacy content on the Internet already that was written using who knows what kind of tool or coding standard.

So yes, the ideal should be “we should never write bloated web pages.” However the reality is “this happens and we need tools and processes to ensure we do not publish bloated web pages.” Checking for bloated web pages produced by tools like Microsoft Word is part of what web performance optimization is about.

Want to see what performance problems your website has? Finding unfiltered Microsoft Office HTML documents is just one of the 200+ performance issues Zoompf detects when checking your web applications. You can sign up for a free mini web performance assessment at Zoompf.com today!

Apple’s iPad and Web Caching

Posted: January 27, 2010 at 5:10 pm

Like most tech folks, I spent the afternoon watching and reading about Apple’s new iPad. To call it beautiful and innovative is an understatement. I want to purchase one. As in, right now. At $500 price point I wouldn’t even consider buying a netbook. Since I already have a netbook, I’m seriously considering replacing it with an iPad because web browsing looks amazing on the iPad. After all Steve Jobs himself promised me “It is the best browsing experience you’ve ever had.”

Only I’m not sure if that’s true.

Web Performance on the iPhone

We always talk about web performance as something that only the site owners care about. Few people talk about web performance when it comes to choosing a browser. Certainly no one talks about choosing a browser based on simple performance features like which one supports compression, or caching, or conditional requests, or resumable downloads. That’s because this isn’t 1997 and all the browsers do these basic features equally well.

Only I am sure that’s not true.

Stoyan Stefanov wrote an excellent and detailed article on how the cache for Safari on the iPhone works, or rather, doesn’t work. You should read the entire article. For this article we are most interested in two shortcomings of iPhone Safari’s disk cache that severely impact the browsing experience.

Resources > 15K aren’t cached.

No resource larger than 15 kilobytes will be a cached. This is pretty horrible. 15K is not a lot of content. Worst of all that’s the uncompressed size so HTTP compression will not help you fit an otherwise oversized resource into the cache. So how big is 15K? No modern JavaScript library is less than 15K. A quick check of the Top 10 non-search engine websites and none of them have CSS files that will fit in the cache. Images tend to be small enough to fit, however CSS sprites can quickly get too large to fit.

Total Cache Size is only 1.5 Megs.

Safari on the iPhone will not cache more than a total of 1.5 megabytes of content.This is a ridiculously small cache. Your computer’s processor has an L2 data cache etched into the silicon of the chip that is 50% -200% larger than iPhone Safari has for a web cache. On first glance you might think this is completely horrible. The main page of CNN and all its JavaScript, CSS, and images weights in at 752 kilobytes and would consume over half of Safari’s cache! And that’s just one website! However, as we just mentioned, any resource over 15K doesn’t get cached at all. So the first failing of iPhone Safari’s cache makes the 2nd failing of the iPhone cache a little less painful!

The moral here is that 1.5 megs of cache is just way too small to be helpful. Furthemore, the cache can get cleared inadvertently several ways, such as closing Safari without certain tabs or some types of powering the iPhone up and down. This means the meager assistance the cache provides can be undercut

These two limits means the disk cache for Safari on the iPhone can reasonably store a few hundred objects. How quickly does that fill up? Of the 32 images on the main page of CNN right now, 29 of them are less than 15K and would get cached. (Ironically the photo of Steve Jobs holding an iPad is too large to be cached).

“It is the best browsing experience you’ve ever had.”

The long and short of it is the version of Safari that runs on the iPhone is just awful when it comes to caching. And as we know, the fastest request a browser can make is none at all. As such caching is a important aspect of web performance optimization, caching directly affects page load times, and caching is critical to the end user’s web browsing experience.

So far, it seems like much of the iPad is running the iPhone OS with the iPhone apps. If this is the case than I am not hopeful about the web browsing experience on the iPad. If Apple is really going to give “the best browsing experience you’ve ever had” they simply must improve the web caching for Safari on the iPad. Otherwise the iPad will be like a DeLorean when it comes to web browsing: beautiful, but underpowered.

Want to see what performance problems you have? Which web resources are cachable on the iPhone is just one of the 200+ performance issues Zoompf detects while assessing your web applications for performance. You can sign up for a free mini web performance assessment at Zoompf.com today!

Foreign Object Detected

Posted: January 19, 2010 at 12:55 am

We are getting very excited as Zoompf continues to expand. We are adding new clients, gaining more mentions and followers on Twitter, and every day more web developers and IT administrators receive a free miniature performance assessment. Since the start of the new year, we have been getting increased inquiries from people in Europe in particular. A few of these European performance junkies have asked whether Zoompf will work with non-English websites.

The answer is yes.

Zoompf crawls and analyzes your website for over 200 performance issues. The vast majority of those checks don’t examine the web content itself. Instead they are looking at HTTP headers, HTML tags, link relationships and structure, image meta data, Silverlight manifests, or Flash tags. All of these checks find performance issues regardless of whether the site’s content is written BelgianDutch or Polish or English.

At Zoompf, our goal is to help you make your websites blazingly fast. But, if in the process, we notice that the font file you are trying to dynamically load inside of the CSS for that cool theme is throwing a Java stack trace, shouldn’t we tell you that too?

We think so.

That’s why Zoompf includes a handful of additional quality issues that look for defects such as application, framework, or web server error messages. Since error messages tend to be in English these quality checks look for the English version of these error messages. So in that regard there are a some English-only checks in Zoompf, but they are not looking for performance issues.

A good example of these English-only error message checks are database error messages. Zoompf will flag web pages that contain database error messages such as a MySQL database connection error. You would be amazed at how often you will see these on the Internet! But if someone has a localized German database server running that returns database error messages in German Zoompf would not be able to detect these errors. Keep in mind that all of the English-only checks are for general website quality issues. There are no performance checks in Zoompf that are English specific. Consider these extra quality checks a bonus that no other tools provide! They help flag other, serious issues with your web application that Zoompf noticed while looking for performance issues.

Today we are adding a new check, #280, which flags on web pages with non-English content. This is to help our customers understand which web pages contain content in other languages and could have extra non-performance issues that Zoompf could not detect.

Thanks for all the interest and excitement in Zoompf. It’s going to be a fun year!

Should You Use JavaScript Library CDNs?

Posted: January 15, 2010 at 1:40 pm

The concept is simple. Hundreds of thousands of websites use JavaScript libraries like jQuery or Prototype. Different websites you visit each download another identical copy of these libraries. You probably have a few dozen copies of jQuery in your browser’s cache right now. That’s silly. We should fix that.

How? Well, if there was a 3rd party repository of common JavaScript libraries, websites could simply load their JavaScript files from them. Now imagine the repository implemented caching. SiteA, SiteB, and SiteC all have <SCRIPT SRC> tags that reference http://some-code-respo.com/javascript/jquery.js. When someone visits any one of these sites, the JavaScript library jQuery is downloaded and cached. If that same person visits one of the other sites, that person will not have to re-download jQuery again. The idea is that sites will load faster because these libraries should not have to be re-downloaded very often at all. Of course, this only works if a lot of people all use the common repository. If only a few people use the common repository, then virtually no one benefits because the library will not have been downloaded and cached by a previous website and has to be re-downloaded.

This is an example of the Network effect. The more people that use a system the more valuable the system becomes.

Implementations of this idea of a central shared repository of common JavaScript libraries are called several different things. Google calls their implementation Google AJAX Library API. Yahoo doesn’t have a clear name for their implementation. I’ve seen “Free YUI hosting” or “YUI Dependencies”, or even Yahoo YUI CDN. Microsoft calls their implementation the Microsoft AJAX CDN. To keep things simple, I will collectively refer to these repositories of common JavaScript libraries as JavaScript Library CDNs.

JavaScript Library CDNs seem like a performance no brainer. Use the service, your site loads faster and consumes less bandwidth. This post will explore if and under what conditions does a JavaScript Library CDN actually improve web performance.

The Choice

Consider this situation. You are speed conscious web developer. You have a website that uses jQuery 1.3.2 as well as some additional site specific JavaScript. Because you value web performance, you know you should concatenate all your JavaScript files into as few files as possible, minify them, and serve them using gzip compression. You have 2 choices:

  1. Serve all your JavaScript locally. You will have a single <SCRIPT SRC> tag that points to a JavaScript file containing jQuery 1.3.2 and your site specific JavaScript.
  2. Serve some of the JavaScript using a JavaScript Library CDN. You will have 2 <SCRIPT SCR> tags. The first tag will point to a single file on your website containing your site specific JavaScript files. The second tag will point to the copy of jQuery 1.3.2 on Google AJAX Library API.

What’s the difference? Well a minified, gipped copy of jQuery 1.3.2 is 19,763 bytes in length. If you choose option 1 all your users will have to download these 19,763 bytes regardless of what other sites they may have already visited. That’s the cost: downloading 19,763 bytes. Notice there is no cost of an additional HTTP request and response or other overhead because those bytes of jQuery content are included inside the response for the site specific JavaScript content which the visitor already has to make. This is important, so I will repeat: The cost of not using a JavaScript Library CDN is only the downloading of JavaScript content and not any additional HTTP requests or overhead.

In the second option, you are going to gamble with a JavaScript Library CDN. You are hoping a visitor has already browsed another website which also uses Google to serve jQuery 1.3.2. If you are right, then that visitor does not need to download 19,763 bytes. If you wrong, the visitor needs to download 19,763 bytes from Google. That’s the prize in a nutshell. And downloading 19,763 bytes doesn’t sound bad! Who cares where it comes from?

The Price of Missing

Unfortunately an HTTP request to Google’s JavaScript Library CDN is more expensive than an HTTP request to your own website! This is because a visitor’s browser has to perform a DNS lookup for ajax.googleapis.com and establish a new TCP connection with Google’s systems. If the additional request was to your site instead the visitor’s browser would not need to make another DNS lookup and the HTTP request would be sent over an existing HTTP connection.

Unfortunately this is a stubborn process. DNS lookups and establishing TCP connections involve a few number of very small packets. Having a faster Internet connection will not significantly impact the speed of these operations. Two different runs on WebPageTest showed that it takes 1/3 of a second for a web browser to make a connection to Google’s JavaScript Library CDN and start downloading it. (And remember, these are CDNs so where I make the request from should not matter as the CDN makes sure I’m downloading the content from a web server that is geographically near me.)

Let me repeat that: Using Google’s JavaScript Library CDN comes with a 1/3 of a second tax on missing. (Note that a tax like this applies to opening connections to a any new host: JavaScript Library CDNs, advertisers, analytics and visitor tracking, etc. This is why you should try to reduce the number of different hostnames you serve content from.) Even if this number is smaller for other users, say, 100 milliseconds, it is still a tax that is paid for using a JavaScript Library CDN and missing.

It gets worse because downloading a file over a new TCP connection with Google is slower than downloading a file over an existing TCP connection with your website! This is due to TCP’s slow start and congestion control. Newly created connections transmit data slower than existing connections do. (This is why persistent connections are so important!)

The Odds of Winning

Since JavaScript Library CDNs utilize the Network Effort, they are only valuable if a large number of websites use them. After all, the only way your visitors can “win” in the JavaScript Library CDN gamble is if they have already been to a site that also uses the same CDN. So, how many people actually use Google?

Well, according to the great folks at BuiltWith, only 13% of all websites use some kind of 3rd party CDN. Of those websites using a CDN, 25.56% of them are using Google’s Ajax Library API. So only 3.89% of all websites surveyed are using Google’s AJAX Library API.

I wanted to gather more data than BuiltWith. I also didn’t like that way they grouped Traditional CDNs (like Akamai) with JavaScript Library CDNs (like Google) with private site-specific CDNs (like Turner’s CDN). So I performed my own survey. I visited the top 2000 sites on Alexa and analyzed each one to see who is using Google’s JavaScript Library CDN. The result? Only 69 sites out of 2000, or 3.45%, are using Google’s JavaScript Library CDN. My data is on track with BuiltWith’s data which is good.

Unfortunately you do not vaguely or abstractly “use a JavaScript Library CDN.” You reference a specific URL for the specific JavaScript Library and version number. You only get a benefit from the CDN if you referencing the specific URL that other websites are referencing. So we have to dig deeper and see what versions of what JavaScript libraries are in use. Below is the a table of JavaScript libraries that Alexa Top 2000 sites use served by Google’s AJAX Library API.

JavaScript LibraryNumber of Alexa Top 2000
sites serving the library
from Google’s CDN
jQuery48
Prototype6
SWFObject6
YUI6
jQuery UI4
Script.aculo.us3
MooTools3
Dojo1

We see that 48 sites are using Google’s JavaScript Library CDN to serve jQuery, and of those 36 sites are using jQuery 1.3.2. That means jQuery 1.3.2 is used by 1.8% of the Alexa 2000 websites. SWFObject and Prototype came in next at 6 sites each, or less than 0.334% of the sites. When you factor in version numbers, their penetration drops to around 0.10%.

So what is the best case here? What are the odds that someone would have jQuery 1.3.2 served from Google’s JavaScript Library CDN sitting in their browser cache? If I have clear browser cache, and I visit 35 randomly selected websites from the Alexa top 2000, and then I visit your site, there is only a 47% chance that I will have a cached copy of jQuery 1.3.2 ready for you to use. You calculate this by first determining the probably of randomly picking 35 websites that don’t have jQuery 1.3.2 and subtracting 1. The formula is: 1 – ( (1 – .018) ^ 35 ).

Those are not very good odds. And they only are applicable if you are using jQuery 1.3.2. Anything else is not practical. You also should consider the makeup of the sites on the list. I have probably only visited 30 or so of the websites listed in the Alexa top 2000 list ever and I probably only visit 5-10 with any regularity. We have determined that the odds of “winning” in the CDN gamble are fairly small. How small the odds are will depend on your site content and your visitors. However I think it is safe to say, as of January 2010, the majority of your users will not have visited a site that uses a JavaScipt Library CDN for the JavaScript library that you use.

Getting More Data

So maybe the odds aren’t good. But is it still worth it to potentially help some people?

Let’s go back to our hypothetical situation where we are deciding if we should use a JavaScript CDN or not. Consider someone with 768 kilobyte per second Internet connection where 768 * 1024= 786,432 bits downloaded per second. Let’s say it is operating at only 80% efficiency to account for overhead like IP, TCP, congestion, packet loss, etc. That 629,145 bits downloaded per second, gives us 78,643 bytes downloaded per second or 26,214 bytes downloaded in 1/3 of a second. A minified and gzipped copy of jQuery 1.3.2 is 19,763 bytes long. This means anyone using a 768 kbps internet connection can download the contents of jQuery 1.3.2 in 1/3 of a second. In other words, downloading jQuery 1.3.2 on that connection takes the same amount of time as simply connecting to Google’s JavaScript Library CDN.

This simplifies the decision in our hypothetical situation on where to host jQuery. In the locally hosted option, we are asking our visitors to download some amount of content X. X is all our HTML, images, site specific JavaScript, and includes the 19,763 bytes of jQuery 1.3.2. In the “use a CDN” option, we still have X amount of content. The only difference is the CDN has the 19,763 bytes of jQuery and our site has X – 19,763 bytes of content. If a visitor does not have cached copy of JavaScript Library they still download a total of X amount of content. It is served from our website and from Google. Under these conditions we are led to the following points:

  1. If you are using a CDN and the visitor does not have cached copy, they download the site 1/3 of a second slower than if they had downloaded all the content from your web server.
  2. If you are using a CDN and the visitor does have cached copy, they download all of the content 1/3 of a second faster than if they had downloaded all the content from your web server.

Or, more simply: If we use Google’s JavaScript Library CDN, we are asking the majority of our website visitors (who don’t have jQuery already cached) to take a 1/3 of a second penalty (the time to connection to Google’s CDN) to potentially save a minority of our website visitors (those who do have a cached copy of jQuery) 1/3 of a second (the length of time to download jQuery 1.3.2 over a 768kps connection).

That does not make sense. It makes even less sense as the download speed of your visitors increases. Try to avoid serving 20 or 30 kilobytes of content at the cost of using a 3rd party just doesn’t make sense.

Conclusions

JavaScript Library CDNs use the network effect. Our survey of the Alexa 2000 shows that right now there are too few people in the network to get any value. Only Google’s AJAX Library API has anywhere near the penetration to provide any benefit and only if you are using a specific version of a single JavaScript library. Even in that remote case, serving jQuery 1.3.2 using Google will slow down the majority of your users at the expense of a possibly nonexistent minority. Zoompf recommends the vast majority of websites avoid using JavaScript Library CDNs until they gain more market penetration.

I will discuss the very select group of sites that should use CDNs, as well as some other interesting data discovered while surveying the Alexa 2000 in posts early next week.

Want to see what performance problems you have? Using JavaScript Library CDNs appropriately are just a few of the 200+ performance issues Zoompf detects while assessing your web applications for performance. You can sign up for a free mini web performance assessment at Zoompf.com today!

Top PNG Optimizers Don't Use zlib

Posted: January 5, 2010 at 12:26 pm

Oleg Kikin has an interesting chart comparing the performance multiple different PNG optimizing tools. The tools tested are:

Go take a look at the PNG comparison chart. I can wait.

So what do these results mean? Well I believe it shows how far image optimization has come in the last 2 years. Tools that just manipulate the parameters for the stock DEFLATE compressor code that is included in the zlib compression library and remove extra PNG chunks no longer produce the smallest optimized image. PNGOut and AdvanceCOMP produce the smallest PNGs because they use custom DEFLATE compressors that achieve better compression than zlib’s implementation. PNGOut’s deflate compressor was written from scratch and AdvanceCOMP uses the custom DEFLATE compressor written for 7Zip. We’ve talked about 7Zip and DEFLATE before in the Rezipping Web Resources for Fun and Profit post. I used 7Zip for my rezipping work because it’s optimized DEFLATE compressor compresses data better than the DEFLATE compressor in zlib. This in turn produces smaller ZIP files but the logic applies to image formats that use DEFLATE.

Unfortunately I cannot find any information about the command line options Oleg used with each tool.

It is interesting to note the difference between Smush.it and PNGCrush. According to Smush.it’s information page it is using PNGcrush under the covers. Any difference in the output of Smush.it and PNGCrush is entirely from the command line options that we know nothing about. It would be possible to reverse engineer what Smush.it is doing by using the service and comparing the output. I image they are using the -m option instead of the -brute option to reduce the number of rounds of PNGCrush and improve the response speed of the Smush.it web service.

What we really need is a web service that accepts images and tries several different optimization tools. Smush.it has hinted at this for a while now in their FAQ but improvements to the tool seem to have stalled since Yahoo took it over (to say nothing of the un-sexy-fying of the Smush.it UI). Hopefully something like this will appear.

Want to see what performance problems you have? Unoptimized PNG images are just one of the 200+ performance issues Zoompf detects while assessing your web applications. You can sign up for a free mini web performance assessment at Zoompf.com today!