November 30, 2009

Rezipping Web Resources for Fun and Profit

One large area of web performance optimization is reducing the size of your content. Most people know about obvious techniques like HTTP compression, minifying, or removing extra data from images. However there is one size-reduction technique that does not seem to be common knowledge for most web performance junkies: Rezipping.

zipper

Let us start with a little background. Zip archives consist of multiple compressed files that are package together into a single file. Zip archives are compressed using the DEFLATE compression algorithm. Deflate supports different compression levels from 1-9. These compression levels provides a trade-off between CPU and memory resources used to create the Zip file and the size of the resulting Zip file. Using a higher compression level consumes more resources but you end up with a smaller file. Most Zip programs tend to create Zip archives using a compression level of 5 or 7. While this can be a good trade off as the file is created quickly and is reasonable compressed it will not produce the smallest file possible.

Now all that is well and good. But why should frontend web developers care about Zip file optimization? Simple: Many of the most common files on the Internet are actually Zip files. By creating methods to make smaller Zip files we are actually optimizing multiple different types of web files. Optimizing these files will reduce bandwidth consumption and server load while improving page load times.

These “Files that don’t end in .zip but really are Zip Files” use the Zip file format as kind of a wrapper to collect all the bits and pieces that really make up the file and store them in a single compressed unit. For example, Silverlight applications have a XAP file extension. However Silverlight applications are just a Zip file containing compiled byte code, resources like images and sounds, and other configuration. Java Applets contained in JAR files are Zip files. All of the Microsoft Office’s OOXML documents (DOCX, XLSX, PPTX, etc) are Zip files. All of OpenOffice.org’s ODF documents (ODT, ODP, ODS, etc) are Zip flies. You can rename any of these types of files to “.zip” and open them with any Zip program.

Since all of these common web files are simply Zip files we can optimize them to improve web performance and operational costs. This is where Rezipping comes in. Rezipping is process of recompressing a Zip file to create a smaller file. The process is simple: you take any Zip file, unzip the contents, and then rezip the content at a higher compression level. To accomplish this, I am using the command line version of 7zip. 7zip’s implementation of the DEFLATE compressor is generally considered to compress files better than other Zip programs by 5% to 10%. The process looks like this:

//unzip the contents of the original zip into a temporary directory 7za.exe X original.zip -o"c:\tmp\" //rezip using maximum compression 7za.exe A -mx9 new.zip "c:\tmp\*" To see how much this could help web performance, I download several samples of different types of zip files off of the internet.

Silverlight

NameOriginal Size (kb)ReZipped Size (kb)% improvement
cached – SilverlightApplication1.xap3,9723,8991.84%
Everything-SilverlightApplication1.xap825,801782,5945.23%
Examples.CS.xap4,752,2623,376,41128.95%
GeoReference.xap388,898288,97725.69%
HoldemSimulatorUI.xap1,280,7141,243,9672.87%
ImageGallery_v25_9458063489vC.xap18,22617,5383.77%
SilverlightControl.xap678,995557,79117.85%

On average rezipping reduces a Silverlight application by 12.32%. This is quite good given that XAP files can contain many binary files like images or sounds that will not be recompressed. Some files created from Visual Studio saw an improvement or more than 25%! Also notice that “ImageGallery_v25″ is the Silverlight application used by Bing to change Bing’s background image. This heavily served file could be slimmed by nearly 4% simply be rezipping the XAP file!

Microsoft Excel Documents

NameOriginal Size (kb)ReZipped Size (kb)% improvement
Listedescourselearning.xlsx55,61840,75326.73%
ParticipatingMembers.xlsx170,382123,27527.65%
PartnerReadinessAndTrainingFY09.xlsx26,67321,34919.96%
PermissionTemplate.xlsx22,57015,96929.25%
Presentation_Skills_Providers.xlsx33,09227,14417.97%

On average rezipping Excel files saves about 25%. This makes sense as most Excel spreadsheets contain predominately text and not uncompressable binary data.

Microsoft PowerPoint Documents

NameOriginal Size (kb)ReZipped Size (kb)% improvement
AMP 8.0 Project Kickoff Template v1.2 07102009.pptx112,63796,75314.10%
CL01.pptx1,918,4401,692,78511.76%
CL02.pptx5,872,2285,448,8187.21%
EC2.pptx123,137100,01318.78%
MSDN_Admin_08.pptx2,006,0911,862,4967.16%
SharePoint_Buzz.pptx2,123,7782,040,2343.93%
speedgeeks-20091026.pptx3,408,3653,271,3844.02%
SupportingDistributedTeamwork.pptx2,454,3602,387,2572.73%

On average rezipping PowerPoint files saves about 9%. This can vary widely depending on the number of images that are contained inside the PPTX file as images are not recompressed (more on that in another article).

Microsoft Word Documents

NameOriginal Size (kb)ReZipped Size (kb)% improvement
ASC_3.0_Demo_Image_Release_Notes.docx431,220412,0344.45%
implementationchecklist.docx126,981120,0755.44%
MSCOM_Virtualizes_MSDN_TechNet_on_Hyper-V.docx115,23089,57222.27%
CompProposal.docx25,54821,39516.26%
Web content redline 2009-10-28.docx201,304180,86810.15%
WindowsSharePointServicesDatasheet.docx198,837172,08213.46%

On average rezipping Word documents saves about 12%.

Conclusions

Always Use Rezipping! Stop sending bytes down the pipe you don’t have to! The savings you receive from ReZipping is driven by the contents of the Zip file. Files with a large number of binary objects that will not be compressed (like images) will have a lower improvement. Also note that higher compression levels increase the time and memory to compress data. but they do not increase the time it takes to decompress data. This is because all the work is in finding out what can be reduced during compression, not in recreating the original data during decompression. There is no reason not to use rezipping.

By rezipping your files you can reduce the size of your content. This reduces bandwidth consumption and server load while improving page load times! There is more work to be done. There are a number of web flies that contain raw Deflate streams like Flash files, WOFF font files, SVGZ, and more. All of these could be redeflated using a compression level of 9 and make smaller, faster files. Stay tuned as we investigate this more.

November 25, 2009

Performance Questions to Ask Hosting Providers: Log File Access

(This is the second article in a series of articles about performance questions you should ask when choosing a hosting provider. The first article in the series is here)

Last time we covered the most important question you should ask a hosting provider: What control do I have over the web server. This time we will be showcasing another important question to ask a hosting provider:

“What Access Do You Provide to Web Server Logs?”

web server configuration

The main reason you want access to log files from the web server is to learn how visitors are accessing your content. This will reveal a wealth of knowledge about the raw traffic patterns of your web application and expose various performance issues and limitations. Often these performance issues will not be detected by page-based performance tools like Yahoo’s YSlow or Google’s Page Speed.

Web server logs come in many different formats. Usually they are large text files where every request is logged on its own line. Several pieces of data about each request are logged in different fields on the line separated by commas. Typically information that is logged for each request is:

  • URL requested.
  • IP address of the visitor.
  • Date and time request was received.
  • The program or browser used to request the URL. This is called the User-Agent.
  • The Referring webpage (if any).
  • HTTP version used to request the page.
  • Status code of the response.
  • Size of the body of the response.

Log files are a very granular view of your web traffic. Sometimes it can be difficult to see the forest through the trees. For example, what pages did user XYZ visit, in what order, and how long did the user stay on each page? It is usually very difficult to get this information from logs alone because web server logs only track users by a specific IP address. To provide a larger view and answer questions like those listed above web developers use web analytics packages like Omniture, Hitbox, or Google Analytics. Web analytics packages uses cookies and JavaScript to gather detailed information about your visitors, the capabilities of their browsers, and their actions through your web application. Web analytics packages are simple to add to a website. Typically all that is involved is inserting a block of JavaScript at the end of each HTML page. This is very easy to do on templated or dynamically generated websites. So if web analytics provides you with “bigger picture” and richer data than web server logs that begs the question:

Are Web Analytics Reports Good Enough?

Actually, no. Web analytics reports are not good enough. Web analytics data abstracts away the raw traffic of your web application and which can hide several important problems. Web analytics packages only track visitor requests and activity for HTML files that are served with a 200 status code. Out of the box, here are things that most web analytics packages do not track:

  • All requests to non-HTML resources.
    • Images
    • JavaScript
    • Style sheets
    • Feeds (RSS, Atom)
    • RIA files (HTC, Flash, Silverlight, Java, etc)
    • Access files (robots.txt, sitemaps.xml, crossdomain.xml, etc)
    • Documents (PDF, Office docs, Zip files)
    • Other resources (Fonts, Cursors)
  • Most error pages (404, 5xx, etc).
  • Conditional requests that return “304 Not Modified.”
  • Requests from non browser User-Agents (spiders, mash-ups, etc).
  • Users who have JavaScript disabled for accessibility or (more commonly) security reasons.

This valuable information is completely missed if you are only using web analytics data to understand your traffic. Consider the valuable questions you can answer with web server logs:

  • Where are your redirects? Which can be removed to decrease page load time?
  • What web resources are using the most bandwidth? This is calculated by simply adding up all the body sizes that are returned for a resource and sorting. Can you reduce the size of these files somehow using compression, minification, or by removing meta data?
  • What are the most requested resources on your website? Can you use caching or other methods to minimize the number of times the resource is requested? If you cannot cache those files because they are dynamically generated can you add programming logic to use a Last-Modified header to reduce bandwidth? Can you remove resources like external JavaScript or CSS files that are referenced by not actually used by that web page?
  • How often does your static content actually change? This is calculated by counting the 304s for a resource. Perhaps you can use a longer Expires time on your content.
  • How often are search engine crawlers visiting your site? Are there crawlers that are missing? You should submit your website to their indexes.
  • Are crawlers finding your high value content? Perhaps you should be using or modify your sitemap.
  • Do crawlers request a large amount of low value content? Do crawlers “get stuck” on part of your website? Perhaps you need to fix your robots.txt.
  • Which web resources are you still getting a lot of requests for that no longer exist? You should use a redirect that points to the correct content.

Real Life Examples

At Zoompf we have detected and solved numerous performance issues just by examining a client’s web server log files. Here are some of our more interesting stories:

  • A social networking client used their robots.txt file to prevent crawlers from indexing content from new, untrusted members that could be content spammers. As such their robots.txt was 250 kilobytes! Because the file was updated so often all of the search crawlers would request it multiple times a day. These factors resulted in the client using 5 gigs of bandwidth a month just to serve its robots.txt file! Turning on HTTP compression for text files reduced this by 70%. The client is currently implementing the use of <META> tags and rel=”nofollow” attributes to limit search engine indexing for the web pages of untrusted users. This will result in even higher performance savings.
  • A software client found that by far their most popular pages were for their product documentation. These pages were constructed using a PHP templating system. However these files never changed once that version of the product shipped. The client moved to pre-rendering the web pages to static HTML files and using a far future Expires header on the HTML files. This drastically improved performance and reduced bandwidth consumption and server load.
  • An ecommerce client discovered crawlers were requested all the available colors for each item they sold. For example, the crawlers would visit /item1/, /item1/color/red/, /item1/color/blue/, and so on. By creating a robots.txt rule to prevent crawlers from requesting every color for every item the client reduced their bandwidth by nearly 80% while still having their important content indexed.
  • A client discovered that their most important content was not getting indexed. A developer had copied a code snippet from the Internet into the top of their template to solve a CSS problem they were having. Unfortunately this code snippet also included a <META> tag to telling search engines not to index the web page.
  • A client learned that its logo had not changed for over 4 years and it was also an uncompressed BMP file even though the logo had a JPEG extension. They change the logo to be a proper image format for web and increased the logo’s Expires time.
  • A client discovered that no one had every requested their iPhone application image. They removed the <LINK> tag and reduced the size of all of their HTML pages.
  • A client discovered their favicon.ico file was consuming huge amounts of bandwidth. This is because it contained multiple versions of the same icon at different dimensions. Removing all but the 16 by 16 pixel version from the ICO file reduced file size by 97%.

Typically Log Access

If you have your own server access to the web server logs is usually unrestricted. However in most shared hosting environments you will not have direct access. Typical web server log options you have are:

  • The raw Apache, IIS, or NCSA log files in a directory outside of your web root that you can access using FTP or sFTP. This is the ideal case.
  • The raw Apache, IIS, or NCSA log files placed directly in your web root. While this provides you with the raw logs anyone on the internet can also access your log files. This is a security risk as log files can often contain sensitive data like credentials or “hidden” areas of your web application. Talk with your hosting provider about moving the location of the web logs.
  • An option through a web-based website administration system like cPANEL that lets you download the raw log file.
  • An option or interface in the web admin system that lets you view or download a specially formatted version of the logs.

If you cannot access the raw log files don’t panic. As long as the log file contains the follow information you will have all the data you need:

  • URL requested
  • Date and time request was received
  • The program or browser used to request the URL. This is called the User-Agent.
  • Status code of the response
  • Size of the body of the response

Another question to ask hosting providers is not only “what information is in the log file” but also “how much time does the log file cover?” You can imagine in large sharing hosting environments how log files can quickly go to hundreds of megabytes for potentially thousands of customers. Hosting providers often limit the log file in different ways including:

  • Record only a week of traffic and replace the log with a new empty file every week.
  • Limit the total size of the log file. Each new entry removes an entry from the start of the log
  • Provide a night copy of the log file for all the traffic of the site received that day. These copies are usually removed after a certain about of time.

If you do have a time window make sure grab a copy of the log file. Some interfaces like cPANEL offer a scheduling services that can email you the log file or place them in a special location that you can then download. You can schedule an FTP download or use wget or curl to download these log files.

Processing Log Files

Depending on how much log data you have, you might want to concatenate your log files together until you have a big enough sample. At Zoompf we suggest collecting a sample between 500,000-1,000,000 requests, or a week’s worth of web traffic, depending on which is larger. Programs like awstats are very helpful for processing and provide reports with your most popular and least popular files, largest files in terms of bandwidth, and other data already broken out. Directly processing the logs yourself always you to discovered more detailed data and not as hard as you would thing. Some basic regular expressions can make it very easy to gather metrics like “show all of the 304s, 404s, 500s, etc.”

Remember, examining your web logs is a key technique to discovering and solving performance problems with your web applications. Those pretty graphs from Google Analytics or other web analytics data is simply not good enough to detect performance issues and bottlenecks. You need access to the information about all the requests the web server is processing. Make sure you ask your hosting provider how you can access the raw web server log files. Find out how much web traffic data the logs contain and how you can easily collect this data so you can analysis. If your hosting provider does not provide this you should consider that a deal breaker and find another provider.

November 17, 2009

Performance Questions to Ask Hosting Providers: Web Server Configuration

Hosting a web application can be annoying and time consuming. There is the cost of the hardware. There is the time configuring, administering, and patching the operating system, web server, and other software. There is the security risk of exposing a machine onto the Internet. So it’s no surprise that many people and companies use a 3rd party hosting provider to host their web application and manage the infrastructure. Choosing a hosting provider should not be made lightly. You no longer have full control over the machine running your web application. For those interested in creating high performance web applications you must ensure that you don’t give up control over the features that you need to make your web application run as fast as possible.

This is the first in a series articles of performance questions you should ask a hosting provider. While hosting providers do offer dedicated hosting (where your application runs on a single machine all by itself) the vast majority of people choose shared hosting environments. While we will be references hosting services that use the Apache web server all of the advice in this series is applicable to Windows hosting as well.

Without a doubt the first and most important question you should ask a hosting provider is:

“What Control Do I Have Over Web Server Configuration?”

Image of Server Room

This questions is critical. Many of the easiest and most impactful performance improvements you can make to your web application, such as HTTP compression and caching, are configured at the web server level. You should start off by asking what modules are installed already. The Apache modules most relevant to performance are:

  • mod_deflate (for Apache 2) or mod_gzip (for Apache 1) – This module enables HTTP compression.
  • mod_expires – This module enables HTTP caching.
  • mod_rewrite – This module enables on-the-fly URL rewriting which is very helpful when maintaining and updating resources while using far future caching.

All of these modules are installed with the typical default installation of Apache. While this depends on the platform and the distribution they are almost always present by default. Sometimes web hosting companies will compile their own version of Apache from source to maximize performance for their particular server machines. Often they will remove modules to save space and time. If you find a hosting provider like this explain to them that you would like these modules installed. Tell them this is a reasonable request as these modules are part of the default installation of Apache. You should be able to convince them to turn these modules on for you. If not, this is a deal breaker and you should not use that hosting provider. The vast majority of hosting providers offer these modules even at the lowest pricing tiers.

Even if the hosting provider offers these modules, you should ask them for a list of all available modules as well as their policy is for enabling new modules. While mod_deflate, mod_expires, and mod_rewrite and the most helpful modules from a performance point of view there might be other modules, such as mod_cband or mod_bw, that you might want to use for performance reasons.

Once you know what you can configure on the web server your next question should be “how do I configure it?” In most shared hosting environments you will not have access to the main Apache configuration file httpd.conf but usually can control the web server through the use of .htaccess files. This is the best solution since it allows you to directly configure the web server. You simply edit the .htaccess file in the root directory for your web application and upload it to the hosting provider.

Some hosting providers supply you with a web interface to control web server configuration typically through a web administration system like CPanel. If this is the case ask to see examples of the interface. It could be simply a web form that allows to you edit a raw .htaccess file. It could be a more structured web interface with check boxes to turn on modules or forms to add new rules. Be very wary of any type of web-based server configuration. The interface will limit what you are able to configure. If a web interface is available ask if you can still manually upload your own .htaccess file to control the web server. If you cannot do this your ability to configure the web server will be severely limited. If the web interface does not provide the functionality you need you should not use that hosting provider. In general you should not use hosting providers that only offer web-based server configuration.

Bad Idea: Hacking Around Limits

Some developers like to point out that you can use server side application logic to compress content or implement caching for static resources like images or JavaScript files or CSS files. This means you don’t have to have access to the web server to configure things like HTTP compression or caching. Unfortunately this actually hurts performance more than it helps! With this method, PHP (or some other application logic layer) is invoked for all requests. Remember that the vast majority of requests are for static content and do not hit the application layer. The overhead of invoking PHP dozens if not hundreds of times for a page load removes any performance benefit of compressing or caching. We will explore this method more in a future post. For now you should completely avoid it. Never use application code to hack around the blatant shortcomings of a hosting provider.

Remember when choosing a hosting provider the single most important performance question you can ask is “how do I configure the web server?” In our next post we will explore more performance questions you should ask when choosing your hosting provider.

November 6, 2009

JSMin, Important Comments, and Copyright Violations

Since launching Zoompf last Friday I’ve performed dozens of free web performance scans. A few users reported to me what they thought was a bug. Zoompf was reporting that certain JavaScript files could be further minified. The issue was that these websites used already minified versions of well known JavaScript frameworks like jQuery . These frameworks had a JavaScript comment at the top of the file that included the copyright information and licensing information. Zoompf uses JSMin under the covers to determine which JavaScript files or inline JavaScript blocks can be minified. JSMin would see this comment and remove it. While this is a certainly a performance improvement it is also a copyright infringement violation! That’s not good. We need to solve this.

This is actually a broader problem then just Zoompf. It exists with all of the dozens of ports of JSMin out there as well as many other minifiers. It turns out YUI compressor does not have this issue because YUI supports so-called “important comments.” Important comments are special code comments that will not be removed by the minifier. Ajax libraries like jQuery and SWFObject use them around their copyright comments so important comment aware minifiers will leave them intact. They look like this:

/*!
    Somthing very important
*/

Important comments are an excellent idea. They provide a way to minify files while retaining copyright and licensing information. All minifiers should support this! So how can we achieve that? We must solve 2 challenges:

  1. Add support for important comments to existing JavaScript minifers
  2. Get JavaScript frameworks to start using important comments appropriately

Since JSMin is the canonical example, I went ahead and added support for important comments to the JavaScript port of JSMin written by Franck Marcia. You can download the updated version of JSMin with Important Comments support or download the JSMin Important Comments patch. (For tips on apply the patch read this article).

The change was made to the next() function, which is supposed to return the next character to be inserted into the minified output. This is the code that detects // and /* */ based comments and silently consumes them. I simply modified the code to check if the comment is important and if so return the entire comment to be written to the minified output stream instead of a single character. The main part of the patch is shown here:

case '*':
//this is a comment. What kind?
get();
if(peek() == '!') {
	//important comment
	var d = '/*!';
	for (;;) {
		c = get();
		switch (c) {
			case '*':
				if (peek() == '/') {
					get();
					return d+'*/';
				}
				break;
			case EOF:
				throw 'Error: Unterminated comment.';
			default:
				d+=c;
		}
	}
} else {
	//unimportant comment	gobbling here ...
}

This solution is easily added to any port of JSMin written in a weakly typed language. This is because the next() function is supposed to return only a single character. This patch makes it return a string containing the important comment block. In JSMin ports to JavaScript or PHP this is not a problem. In JSMin ports written in strongly typed languages like C, C#, or Java, the next() function returns an integer.

Now JSMin is like a Swiss-made watch. It is as elegant as it is compact. It is a good example of how state machines can be small and powerful. Unfortunately, like a Swiss watch, you cannot easily add a new feature without a significant amount of work. I toyed with a few very small patches. I tried modifying the next() function to simply not longer detect or silently consume comments that were important. This caused important comments to be are processed as if they were JavaScript source code. Not bad, but it minifies the comment’s content and could puke on things in the comment like unmatched quotes or by thinking a slash is an unclosed Regex literal. I’m going to continue to try and patch JSMin in a way that works with both weakly typed and strongly typed languages that retains the elegance of Douglas Crockford’s design. Stay tuned. For now the weakly typed solution is available.

I would say that minifiers that support important comments should not convert a /*! … */ comment to a /* … */ comment. This would allow minifiers to run on text that could have already been minified without removing anything important. The extra character per important comment cost is well worth the ability to chain together multiple tools without having to worry about losing anything.

The second challenge is getting JavaScript frameworks to start using important comments appropriately. Notice I said appropriately. We don’t want to just go adding a ! to the first comment or anything. Some frameworks, such as Mootools, include all sorts of content in there initial comment that should not be included in a minified version with important content. Ideally frameworks should use a important comment block the way SWFObject did. A simple, single, important comment that contains the name of the library, a URL to the framework’s web page, the license it uses, and a link to the license. Yahoo’s User Interface Library is another example of an appropriate comment block, though they are not using important comment syntax for their copyright and licensing declaration.

Know any other JavaScript minifiers that support important comments? Working on a JavaScript Framework and want to add important comments? Drop me a message and share all about it.

November 5, 2009

Web Optimization Presentation at Phreaknic

Phreaknic was a blast this past weekend! I’ve been attending and speaking there for 8 years now and while Halloween certainly cut into attendance this year there were some awesome presentations. Famulus‘s talk complete with pictures and video of his quest to build a functional Bussard fusion reactor was amazing. Adrian Crenshaw gave a great overview of the various anonymizing networks and darknets out there including some of the work Matt Wood and I did on Veiled. Finally Azureus‘s Tyler Pitchford gave a highly entertaining talk on reverse engineering. His best line? “And now you are on the NOP sled! Wheeeee!”

I spoke last on Friday night and presented on web performance. As usually I ran long but I was getting some excellent questions from the audience. You can download the slides from my presentation, Optimizing Web Performance, on Zoompf.com.