Cruft inside Microsoft Word HTML files
We were recently on-site with a client helping them fix some issues when we happened to see this directory containing some HTML files.
Well that’s odd. Why do some of those HTML files have one icon and different HTML files have another icon? We examined the source code for one of the HTML files with the odd icon and saw this:
Turns out these HTML files were created by Microsoft Word! Due of a series of different web designs and designers over a number of years, as well as a healthy bit of editing by the marketing department, 1 in 4 web pages of our client’s current website were created or modified using Microsoft Word!
As we scrolled through the HTML file we saw large amounts of extra data that no normal web browser would ever interpret. A little research explained it for us. Microsoft allows you to save a document as an HTML file. They also want you to be able to open an HTML file that was created using Microsoft Office and resume editing it just like a normal document. Since Microsoft Office has all sorts of features that HTML and CSS doesn’t this allows Office to preserve certain information inside the HTML file between edits.
The some of the data stored is obvious: when the document was created and by whom, who made what edits when, paragraph count, etc. Other less obvious data such as VML, DHTML behaviors, column and page spacing, Word styling information, embedded objects data, and more is also stored inside the file. All of this Office specific data is stored inside HTML file and is wrapped inside of special conditional comments such as
<!--[if gte mso 9]>. This hides the content from other programs that read the HTML. Furthermore Word isn’t the only Office program that does inserts this extra data into HTML files. Excel does too.
Keep in mind we are not talking about the general bloat that WYSIWYG HTML editors tend to add. Bloat such as empty
<P> tags, large numbers of entries, table based layouts, overly long style attributes, are all hallmarks of WYSIWYG editors. However this is beyond that. This is extra data that is used exclusively by Office and is completely ignored by all web browsers that don’t support conditional comments (in other words any program besides Internet Explorer). In fact, the data is ignored by Internet Explorer as well since the conditional comments apply only to Microsoft Office and not for any version of IE.
So we have a bunch of useless cruft inside of these HTML files. Not a big deal right? Unfortunately all this useless data has a cost. Of the files we sampled we found that 20-35% of the HTML content was Microsoft Office specific data. That means 20-35% of the bytes going down the pipe to a user are completely wasted for these files.
Cleaning Up the Cruft
Luckily Word includes an option that allows you to save a filtered HTML file. A filter HTML file will not contain any of this useless Microsoft Office specific data. Under “Save As” you want to select the “Web Page, Filtered” option as shown below.
If you don’t happen to have a copy of Office around (or you have a few hundred HTML files to clean) you can still remove this useless content. Since all of this extra data is stored in conditional comments that are looking for the “mso” user agent you can easily write a regular expression to remove it. In fact you should create a script that detects and removes this extra data and include it as part of your publishing process.
But I Would Never edit HTML with Word!
I’m sure that you wouldn’t. But do you create all the content for your organization’s websites? Do you hand vet every piece of content before it goes out the door? At Zoompf, we have clients with over a hundred of web properties, produced by hundreds of individual content providers, both internal and external, who report into dozens of different departments. You better believe stuff like this slips through the cracks all the time.
There is also a huge install base and large user base of WYSIWYG HTML editors. Microsoft just sold $4.75 billion dollars worth of Office in the Q2 of Fiscal Year 2010 alone! Adobe’s Creative Suite with DreamWeaver is wildly popular as are other WYSIWYG tools . And that is not to mention the 15 years or so of legacy content on the Internet already that was written using who knows what kind of tool or coding standard.
So yes, the ideal should be “we should never write bloated web pages.” However the reality is “this happens and we need tools and processes to ensure we do not publish bloated web pages.” Checking for bloated web pages produced by tools like Microsoft Word is part of what web performance optimization is about.
Want to see what performance problems your website has? Finding unfiltered Microsoft Office HTML documents is just one of the 200+ performance issues Zoompf detects when checking your web applications. You can sign up for a free mini web performance assessment at Zoompf.com today!