January 5, 2010

Top PNG Optimizers Don't Use zlib

Oleg Kikin has an interesting chart comparing the performance multiple different PNG optimizing tools. The tools tested are:

Go take a look at the PNG comparison chart. I can wait.

So what do these results mean? Well I believe it shows how far image optimization has come in the last 2 years. Tools that just manipulate the parameters for the stock DEFLATE compressor code that is included in the zlib compression library and remove extra PNG chunks no longer produce the smallest optimized image. PNGOut and AdvanceCOMP produce the smallest PNGs because they use custom DEFLATE compressors that achieve better compression than zlib’s implementation. PNGOut’s deflate compressor was written from scratch and AdvanceCOMP uses the custom DEFLATE compressor written for 7Zip. We’ve talked about 7Zip and DEFLATE before in the Rezipping Web Resources for Fun and Profit post. I used 7Zip for my rezipping work because it’s optimized DEFLATE compressor compresses data better than the DEFLATE compressor in zlib. This in turn produces smaller ZIP files but the logic applies to image formats that use DEFLATE.

Unfortunately I cannot find any information about the command line options Oleg used with each tool.

It is interesting to note the difference between Smush.it and PNGCrush. According to Smush.it’s information page it is using PNGcrush under the covers. Any difference in the output of Smush.it and PNGCrush is entirely from the command line options that we know nothing about. It would be possible to reverse engineer what Smush.it is doing by using the service and comparing the output. I image they are using the -m option instead of the -brute option to reduce the number of rounds of PNGCrush and improve the response speed of the Smush.it web service.

What we really need is a web service that accepts images and tries several different optimization tools. Smush.it has hinted at this for a while now in their FAQ but improvements to the tool seem to have stalled since Yahoo took it over (to say nothing of the un-sexy-fying of the Smush.it UI). Hopefully something like this will appear.

Want to see what performance problems you have? Unoptimized PNG images are just one of the 200+ performance issues Zoompf detects while assessing your web applications. You can sign up for a free mini web performance assessment at Zoompf.com today!

December 22, 2009

Hacking Stoyan and the Importance of Web Security

(I found a security vulnerability in some code that Stoyan recently released. I worked with Stoyan to resolve the issue. Nothing in this post still works. Usually stuff like this happens all the time and is never made public. Stoyan has very graciously allowed me to discuss the issue publicly so others can learn and avoid the same mistake. Thank you Stoyan. You Rock.)

Yesterday Stoyan released an alpha version of a cool tool he wrote, chuckview.php, which helps people visualize how web pages are delivered to the client. There is an excellent article to go with the tool and everyone should read it.

But this post isn’t about chunked encoding. This post is about how trivial mistakes in web applications can have serious consequences.

Under normal operation chunkview.php looks like this (click for larger view):

Now, I’m a very curious person. I want to know how things work. Nothing makes me happier than to poke and tinker with technology (not necessarily my own) and see what happens. So about 3 seconds after looking at this tool I thought “what happens if I don’t give it a URL?” So I typed “> into the textbox and clicked enter. Here is what came back (click for larger view):

Now this is getting interesting! I get some PHP errors! This starts to answer some questions about the how the application functions:

  • What: Stoyan is using PHP to fetch the URL the user supplies, parse the contents, and display the contents to the user.
  • How: Stoyan is using fopen() to fetch the contents of the URL. It also appears that PHP magic quotes is in use because while I input the string “> the application tried to open \”>.
  • Where: I have the absolute file system path of the PHP file. This tells me where the files are, as well as the username of Stoyan’s account, w3clubs.

So what can we do? Well, Stoyan has essentially created a “view this resource” application. He intended for the application to be used to view the contents of HTTP responses. However he is using fopen() to read the contents of the URL. fopen() can also access files on Stoyan’s web server. It is possible that Stoyan actually created a “view any file on this computer or any remote URL” application!

Let’s test this theory. We are going to try to use chunkview.php to view the source code of chunkview.php! I typed chunkview.php into the textbox and hit enter. Here is what I got (click for larger view):

Well that did not work. PHP gives us a nice error message telling us it could not find the file chunkview.php. Hmmm. Perhaps there is a problem resolving the path. Maybe if I input the absolute path and filename for chunkview.php we will be able to read the file. I typed /home/w3clubs/public_html/tools.w3clubs.com/chunkview/chunkview.php into the textbox and hit enter. Here is what I got (click for larger view):

Awesome! I am now looking at the source code for chunkview.php! I can also immediately see Stoyan’s mistake. It’s the line fopen($_GET['q']). He is passing input from an untrusted source (i.e. me) directly to a command that opens a file! This is known as a Local File Inclusion Vulnerability. (If this sounds familiar it should. We talked about how PHP scripts that combine multiple CSS or JavaScript files at runtime often contain Local File Inclusion vulnerabilities in The Challenge of Dynamically Generating Static Content post.)

From Vulnerability to Compromise

So what can an attacker actually do with this? So far all we have found is a toe hold. We have a way to read any file on the disk. But reading files != a hacked web server. At least not yet. I will now guide you through what an attacker would do to take this seemly small bit of unintended program behavior and leverage it into a compromised system. You should never never never do what I am about to describe. It is a crime.

The first thing I do is get a list of users that have accounts on the computer. That way an option we have would be to brute force their passwords and gain access to their account. We can see a list of users on the system by reading the contents of /etc/passwd.

Ok, we see a lot of these accounts don’t have shell access (because their login shells are set to /sbin/nologin). But we do know the name of an important account though that is probably used quite a bit: w3clubs! Let’s fetch the .bash_history file for this account, which should give us all the commands Stoyan has run on the system. We will view the source of the “web” page chunkview.php returns to see the contents of .bash_history file more clearly.

That much better. Notice the scroll bar. We have captured a list of approximately 1000 commands that Stoyan has run on this machine. I gain lots of valuable information by scrolling through the command history list:

  • He accesses several different MySQL databases. I know the database names and the user accounts to access them.
  • I see he is using WordPress on the box. I know where his wp-admin directory is and the names of all configuration files.
  • He has SSHed or SCPed into different boxes. I know the hostnames and the usernames he has used. Any passwords I find on this box I will use to try and break into more computers.

(Everything from here on I did not actually do. As soon as I confirmed the security issue existed I immediately stopped probing and contacted Stoyan to help him fix the vulnerability. Again, don’t ever ever ever do any of this!)

The next thing I would do is start fetching important files from his system. I grab his Apache configuration file, his raw MySQL databases, his PHP configuration file, his WordPress configuration files and plug-ins, his SSH keys, etc. I download the source every PHP file I know about on the system looking for any file that allows me to either upload a file or execute arbitrary PHP code. My goal is to find a username and password, or some way of getting a file or content from me onto the system.

An easy target is WordPress. The web-based administration portal has exactly the functionality I need. Gaining access to it is not very difficult given that I can read all his configuration files and the WordPress database. Let’s assume I leverage all this information and can login to Stoyan’s WordPress admin panel. Now what? I use the Editor feature in admin panel for WordPress and add a back door. This is fair simpler than it sounds. By including a bit of PHP code like system($_GET["command"], $blah) I can run any command I want to on Stoyan’s computer by passing it to his WordPress application through a query string. I can use this back door to upload files onto Stoyan’s web server (by running a command like wget to fetch it) and also use the backdoor to unzip, or untar, or execute what I upload.

Pwn3d

At this point it’s game over.

I can upload and execute arbitrary programs onto Stoyan’s computer. I can modify his web site. What I do now depends on my goals. I might modify his website to serve malware to 1 out of every 1000 visitors . I might create a hidden directory and use his computer to serve porn or stolen software. I might do nothing and simply use his computer as a platform to launch attacks against other systems.

Conclusions

You should never take user supplied data and just blindly use it. User supplied data means anything that comes from the user: query string values, POST data, Cookies, uploaded images or other files, and even HTTP headers like User-Agent or Referer (sic). Before you use any input you must validate it using whitelist input validation! In this case Stoyan and I fixed the issue by validating that URLs must start with http:// or https://. If you want to know more about Web Security I suggest these resources:

The point of this is not to publicly beat Stoyan. (thanks again for being a good sport Stoyan!) He’s a bright guy who has done amazing work in the web performance field. The point of this post is to show how a trivial mistake can lead to a complete failure and how easy it is for even smart people like Stoyan to make trivial mistakes. Please make sure you are validating all input in your web applications! It does not matter if you application is performant or not if the application is insecure.

December 14, 2009

Performance Questions to Ask Hosting Providers: Secure Website Access

(This is the third article in a series of articles about performance questions you should ask when choosing a hosting provider. The first article, “What control do I have over the web server?” and the second article “What access do you provide to web server logs?” are also available.)

So far in this series we have talked a lot about questions to ask hosting providers to make sure you can configure your website for performance and access the raw traffic logs of your website to spot performance problems. All of this is moot of course if you cannot get content onto your website. That’s why this post of “Questions to ask a hosting provider” is all about:

“Can I Securely Communicate With My Website?”

ethernet-locked

It has happened to everyone. You are out at a coffee shop, a client site, or at a conference and you need to make changes to your website. Perhaps you need to upload a few new PHP files or some images. Perhaps you need to update your web server configuration to set up a new email address for an event. Perhaps you simply saw something cool and want to write a WordPress post. But can you do anything of these things securely using a public network? This question is best answered with an analogy.

Imagine you are at a formal cocktail party. You drift from room to room, through a sea of lavishly dressed party goers and dine on mouth-watering morsels served on silver trays by waiters in white gloves. As you approach a side table of crystal champagne glasses you overhear bits and pieces of the conversations around you.

  • “We cannot wait. It should be a lovely vacation and it’s the perfect time for us to get away for a week.”
  • “That’s right, with the nanny! Walked right in on them! And he tried to say that she was only choking!”
  • “Chris starts there next spring, just like his father.”

Well attended cocktail parties are loud and noisy. Its almost impossible not to hear what everyone else is saying! Of course we are taught that to be polite we should ignore the conversations other people are having unless we are involved. You are on the honor system not to eavesdrop.

Public networks such as wireless networks are just like cocktail parties. Your wireless card is like a party guest. It broadcasts out to the room when it “speaks” and “listens” to everyone within range to hear a response. Like a real party guest, wireless cards are supposed to ignore any conversations that they overhear that is not meant for them. They do this by dropping the data and not bubbling it up to the computer. However nothing forces network devices to ignore data they receive that is not meant for them. In fact, all networking devices (not just wireless devices) can be placed into “Promiscuous Mode” where any data they receive, even data that is not addressed to themselves, is received and bubbled up to the computer to process. This allows any networking device to become a giant listening device that hears and records all the information on the network! Promiscuous mode is not some evil hacker trick. It’s a fully intended feature of networking devices that has many legitimate uses.

Diagram showing how clients in a wireless network hear each others' traffic

But wait! I use Encryption!

“The conference wireless network or the coffee shops wireless network is encrypted. They tell me they use something called WPA2 with a key of a million bits! I’m secure right?”

No, you are not secure.

Let’s go back to the cocktail party analogy. The hosts don’t want just anyone coming into their party and drinking all their fine wines. So they place a bouncer at the door of the party. Only people that know the password are allowed into the party. If you know the password you get into the party and can listen to all the other guests. If you do not know the password you remain outside the building and cannot hear anything that is going on inside.

Encrypted wireless networks are just cocktail parties with bouncers. You need the “password” to join the wireless network. Once you are connected you can listen to everyone else’s traffic just like before because on the network everyone is using the same password to transmit and receive their data. (This is the only scalable solution. Otherwise the wireless network administrator would have to create a new, unique password for each and every person that joins the network). In other words, an encrypted network uses the password solely to protect and restrict “access” to the network. It does nothing to protect the users of the network from themselves or from each other.

The Danger of Sniffing (packets)

So What! Who cares if someone can listen to my network traffic. It’s not a big deal. After all they will just see the blog content I was about to post anyway. Unfortunately this is not true. Using any system that requires a username and a password on a wireless network? You may have shouted to the entire cocktail party that username and password. And chances are you use that same username and password somewhere else on the Internet. Like your bank. Or an online store. Are you already logged into a system like Gmail or your WordPress administration panel? You are shouting your HTTP Cookies to the entire cocktail party. Someone can steal your HTTP session cookies and use session hijacking to access Gmail or WordPress as if they were you without needing your username and password. Next thing you know you are on The Wall Of Sheep!

Secure Communications With Your Website

Remember: network encryption protects networks and application encryption protects applications! You need to make sure you are using encrypted application protocols to properly protect yourself. What protocols you use and how you use them will vary with different use cases.

Uploading Content

How do you upload content to your website? If the answer is FTP you are in trouble. FTP sends usernames and passwords in the clear. You need an encrypted file transfer mechanism like SFTP or SCP. If you have shell access to your web server using SSH you also have the ability to use either SFTP or SCP as they are simply subsets of the functionality of SSH. By default most hosting companies provide an insecure file transfer system like FTP. Ask if they provide (for free) a secure file transfer system like SFTP or SCP. Make sure they understand you don’t need full SSH functionality and are only interested in secure file transfer. If this is not available you might need to upgrade your account or purchase an add-on to get SSH access for your website.

Writing Content

Do you use a web interface to write content for your blog platform or CMS system? Does it use SSL? Check the address bar. Does it start with https? If not you are not using SSL. Do you write your content using other software? Does that software directly publish the content to your blog using a web API like RSD or XMLRPC? Does that use SSL? Check the settings and see if you are using “https” to access the API interface. If you are not using SSL to communicate with these web resources then anyone can capture your username and password or cookies (which are just as good as your username and password).

Website Administration

How do you administer your website? Do you use a web interface like cPanel? These web administration interfaces are most common in shared hosting environments and typically run on a different hostname or an odd port number. Ask the hosting provider if they offer SSL access to the interface. Hosting providers often get confused and think you want to create an SSL certificate for your website. While this would secure a CMS you configure like WordPress (see previous use case) it does not help you secure the web administration interface because that is often running on a separate system. Make sure they understand you want secure access to their interface, not your website. This discussion may take several emails back and forth but most hosting providers are willing to supply SSL access to cPanel or other administration interfaces.

Summary

In conclusion, the questions about secure communications you should ask your hosting provider are:

  • “Do you provide a secure file transfer mechanism like SFTP or SCP? Is it provided for free or is it extra? If you don’t do you offer SSH access to the web server? Is it free?”
  • “If you provide a web-based website administration interface like cPanel do you provide access to it using SSL?”
  • “Do you provide an SSL certificate for my CMS? What is the cost?”

How to judge their answers will vary from person to person based on need. Personally, a secure file transfer mechanism is a requirement. Too many times have I needed to upload a presentation, PDF, or file to my website from a public network at a conference or client site. If you have a heavy blogger secure access to your content management system is going to be critical. After all, it is difficult to write a blog post about an event from the event if you cannot securely access your blog to write the post!

December 7, 2009

Browser Performance Problem with CSS "print" Media Type

I ran across an article today that shocked me. Geert De Deckere wrote how you can save an HTTP request by combining the CSS files for the print and screen media types.

Wait, I thought. What? Why do I need to do this? What behavior is this correcting? I was very confused. Maybe you are too.

image of CSS code in an editor

CSS allows you to define styling information for different media types. CSS could tell a browser that is rendering to a TV screen to style the same content differently from a browser rendering on a mobile phone. The HTML content is all the same. Media types simply define which style rules apply for which devices. CSS also defines a print media type which is the style to use when styling a page that is being printed. Browsers should be smart about only downloading the style sheet with the media type for the device they are rendering. Firefox on my laptop is should not fetch the mobile.css style sheet whose media type is handheld. And luckily browsers are smart and don’t download CSS files for media types that they don’t support.

Except for the CSS media type print.

Geert’s article and advice were predicated on the claim that web browsers will download external style sheets with the print media type even if you don’t print the page. Is this true? To find out I built a quick test page:

<html> <head> <title>CSS Media Tests</title> <link rel="stylesheet" type="text/css" href="screen.css" media="screen" /> <link rel="stylesheet" type="text/css" href="print.css" media="print" /> </head> <p> Hello! <img src="new-logo.png"> </p> </html>

Wow! Geert’s claim was true! All major desktop browsers that I tested (Firefox 3.5, IE 7, Chrome 3.0 and Safari 4.0) will download external style sheets whose media type is print even if you don’t print the page! This is hurts performance for no good reason. Currently your browser must make one HTTP request and download screen.css. But then your browser has to make an additional HTTP request to download a file full of content that it does not need. Worst of all, the browser will not start rendering the page until it has grabbed the completely unused print.css file!

This is very silly behavior. Especially given that virtually none of your website visitors are going to print any of your web pages, unless you are a website like Google Maps. Try and remember: “when was the last time your printed a web page?” Unfortunately all 4 browser I tested all downloaded print.css even though I never printed the page. Firefox, IE, and Chrome all downloaded print.css in order as if it was a external CSS file whose media type was screen. Looking through a proxy the request order was:

  1. css-media-test.html
  2. screen.css
  3. print.css
  4. new-logo.png

Safari 4.0 however, downloaded the content in this order:

  1. css-media-test.html
  2. screen.css
  3. new-logo.png
  4. print.css

Safari was smart enough to defer downloading but did still downloaded it. I do not know if Safari delayed firing the window.onload event until after print.css downloaded or not. WebPageTest confirms that IE does not start rendering the page until print.css is downloaded. The fact that Firefox and Chrome both requested content in the same order as IE leads me to think they also delay rendering.

Possible Solution?

Geert proposed a solution to this problem. He recommends combining the two external CSS files into a single CSS file and use @media directives inside the CSS file to separate the style info for screen from the style info for print. You end up with a single CSS file that looks like this

@media screen { /* contents of screen.css here */ } @media print { /* contents of print.css here */ }

This solution does not sit well with me. Yes, by combining the two CSS files and using @media directives you can remove an HTTP request. You now only have to download a single CSS file whose size will be smaller than the sum of the two original file sizes because a single large file will compress better. However your visitors still have to download a large amount of CSS content. 30-40% of that content is printer-centric style information which no one will ever actually use anyway, and the browser will not start to render the page until all this useless data has been downloaded. (Interestingly enough Zoompf free Web Performance Scan checks style blocks and CSS files for @media directives and recommends you break them into separate style sheets to prevent unnecessary rules from being downloaded. I had to modify the check to allow @media print directives when I found this solution.)

A Different Solution

I believe there is a different and perhaps better solution. You can defer downloading print.css by using JavaScript to dynamically add a <LINK> tag pointing to the external CSS file with the print media type after the page has loaded! This solution means the browser only needs to make 1 HTTP request and less CSS content needs to be downloaded to start drawing the page. This will have a faster “Time to Render” than a single CSS file as less data is downloaded. The extremely small number of people who do print your web page will still get the style sheet necessary for them to print. You can also use a <NOSCRIPT> tag in the <HEAD> to link to print.css. This means anyone who has JavaScript turned off will the performance hit all of your visitors are currently taking and request both external style sheets. The deferring print.css solution looks like this:

<html> <head> <title>CSS Media Tests</title> <link rel="stylesheet" type="text/css" href="screen.css" media="screen" /> <noscript> <link rel="stylesheet" type="text/css" href="print.css" media="print" /> </noscript> </head> <p> Hello! <img src="new-logo.png"> </p> <script> window.onload = function() { var cssNode = document.createElement('link'); cssNode.type = 'text/css'; cssNode.rel = 'stylesheet'; cssNode.href = 'print.css'; cssNode.media = 'print'; document.getElementsByTagName("head")[0].appendChild(cssNode); } </script> </html>

You reduce initial request count and download size at the cost of greater complexity and more markup. This code could be improved. A more scalable solution would be for the JavaScript code to look in the <HEAD> and parse any <LINK> tags inside of a <NOSCRIPT> with a print media type and create new LINK elements dynamically.

Solving the problem

A summary of the problems and the two solutions appears below. This table assumes two CSS files (screen.css and print.css) each 30 kilobytes and size and a combined CSS file (all.css) whose size is 55 kilobytes.

MethodHTTP Requests before “Start Render”CSS Downloaded before “Start Rendering”# HTTP Requests after “Onload”Content Download after “Onload”
No Optimization260 Kb00 Kb
Single CSS file all.css155 Kb00 Kb
Deferring print.css130 Kb130 Kb

Which solution works best will vary with your situtation. The status quo is 2 HTTP requests to deliver 60 Kb of content before the browser can start rendering. A single CSS file reduces that to 1 HTTP requests and 55 Kb of content before the browser can start rendering. Deffering print.css also only requires 1 HTTP request before pageload but only sends 30 Kb before the browser can start rendering. If you have a small print.css file it might be better to use a single CSS file with @media directives. The overhead of serving a single larger CSS file containing unused style dat aand the delay that adds until the browser can start rendering might be so small it does not matter. However if you have a larger print.css file deferring the print.css download until after page load would provide a great performance benefit.

The moral of the story here is that the browser creators need to remove this performance defect from their code. Ideally the print CSS media type data should not be downloaded until the print dialog box appears, either from user action or using window.print() in JavaScript. Next best solution would be for the browser to automatically defer the downloading of “print” CSS media type data until after the page has downloaded. In the mean time, you can use either the single CSS file solution or the deferring print.css solution to make your web pages load faster!

Want to see what performance problems you have? An appropriately placed <LINK> tag and proper use of CSS @media directives are just two of the 200+ performance issues Zoompf detects while assessing your web applications for performance. You can sign up for a free mini web performance assessment at Zoompf.com today!

The Challenge of Dynamically Generating Static Content

php_code

Time and time again I see people using PHP or some other application logic to try and hack around some issue they are facing. We saw this in our previous post Questions to Ask Hosting Providers: Web Server Configuration where people would use PHP to emulate mod_deflate or mod_expires. Andrew King, in his book Website Optimization talks about wrapping developer comments in CSS or JavaScript files in <?php ?> tags and using the PHP interpreter to remove them. People use PHP to combine CSS or JavaScript resources together. And today I read an article from the always awesome Chris Coyier over at css-tricks.com about using PHP to emulate CSS variables.

Don’t get me wrong. I was actually bemoaning the lack of variables in CSS two days before Chris wrote his article. (Actually, what we really want is more like C/C++ macros but that’s another story). Anyone who has tried to implement CSS sprites, change margins or element sizes, or modify color values knows what a pain it is to go through a CSS file and type the same thing over and over.

Using PHP to solve this problem, or any of the other problems listed above, makes perfect sense at first. Because it makes things easy. Because you are all being lazy. You are using a runtime mechanism to try and simplify your life.

Stop Being Lazy!

Now, under normal circumstances programmers should be lazy! After all your very job is to create something that does work for you! Unfortunately in this case your laziness is harming the performance of your application. Using application logic to dynamically generate static content at runtime is a massively bad idea. Consider these 4 consequences:

  • You take an order of magnitude performance hit for invoking the application tier instead of just serving a flat static file from the file system.
  • Since the web server is not serving a static file, there will be no Last-Modified header sent by default. That means no conditional GETs and no 304 responses which means lots of bandwidth consumption.
  • PHP, like virtually all application tiers, produces a chucked response. This is because the web server has no idea what the content length will be because it is dynamically generated. Dynamically generated chunked responses will not send the Accept-Range header. This means no pausing or resuming or error recovering. The entire resource must be re-downloaded.
  • Chunked encoding is not supported with HTTP/1.0, so any HTTP/1.0 device (like every caching proxy ever made) has to flip into “store and forward” mode where it downloads the entire response before passing it along.

And as if all these downsides for invoking the application tier was not enough, we have my personal favorite: Web Security! As someone who professionally broke into computer systems for many years when I see:

http://example.com/combine.php?files=a.js|b.js|c.js

I get very excited. Think about what a resource combiner script does. “Hey website, I’m going to give you a list of files on your hard drive, and I want you to read them off the disk, one at a time, and dump their raw contents into a response and send it to me!” Jackpot baby! This is what we call a Local File Inclusion vulnerability just waiting to happen. The developer has not so much created a resource combiner as they have provided me with a rudimentary remote file download service! I immediately do something like this:

http://example.com/combine.php?files=db.inc

In about 45 seconds I have downloaded the /etc/password file, your httpd.conf, your .htaccess, your raw mysql database, you app config files filled or user names, passwords, and database connection strings, and each PHP file to retrieve all your source code. Or worse I perform remote file inclusion, thereby injecting a PHP-Shell, which allows me to completely take over your website! (BTW: Roughly one in every 3 PHP resource combiner scripts I have seen contains these security vulnerabilities. Beware where you get your source code!)

The Fundamental Problem

The fundamentally problem in all of these examples are developers are getting lazy and are using PHP code to do something at runtime that should have been done earlier.

Properly Generating Static Content

Great! So what is a web developers to do? Go back to the dark ages where you cannot leverage all that great application logic in the generation of our content? I want my CSS variables and I want them now! Notice I never said you cannot dynamically generate static content! I just said you should not dynamically generate static content at runtime! Want CSS variables? Want to use a PHP script to combine resources or minify or whatever?Go ahead and do it! Just do it ahead of time. You can run your PHP script form the command line, produce your CSS file, complete with all the correct CDN paths and color values, and upload that to your website. And this isn’t just for PHP. Use Perl, Python, Ruby, Java, or whatever. You can even do it in QBASIC!

'CSSGEN.BAS - kicking it old school CDN$ = "http://zoompf.com/" LOGO$ = "includes/logo.png" PRINT ".logo {" PRINT " background: url("; CDN$ + LOGO$ + ");" PRINT "}"

And the output:

qbasic-css-gen

(Thats right. I totally just used QBasic 1.1 from DOS 5.0 to automate publishing a web application on 64bit Vista. Oh yeah!)

The moral of the story is never make the user pay for your laziness. Do not use the application tier of a website to dynamically generate static content at runtime. Instead do it at publishing time or even do it in a daily or hourly cron job. This approach allows you all the advantages of using application logic without drastically reducing the very web performance you were trying to improve in the first place!

December 3, 2009

Web Performance Book Recommendations

Stoyan has a good blog post today as part of his Performance Advent series about required reading for web developers. He covered some great books. All three of the three books that have been published are currently sitting on my bookshelf and you should buy them immediately if you don’t already own them. I thought I’d share a few more books that I have read that contain more web performance tips and tricks that I have not seen in the books he recommended. Some of helpful some are not. Having written a book myself on Ajax Security I know exactly how difficult it is to create a meaningful and lasting book of substance. All of these authors deserve respect, even if the book no longer is beneficial today. For each of these 4 books I have included my overview and opinion of the book, the key performance tips and ideas it contains, and my recommendation.

Web Caching. By Duane Wessels (O’Reilly, 2001)

Cover of book "Web Caching"

Duane Wessels is the perfect choice to write what is the definitive guide to web caching as he is the creator of the Squid Caching Proxy. While this book is targeted more at IT operational folks (specifically people who install, configure, monitor, and maintain web proxies) it provides excellent background into how caching proxies work and are deployed and what they will and will not cache. It also has, without a doubt, the best explanation about Cache-Control directives I have ever read. It explains what the directives mean, how they interact with each other, and how caching proxies and the browser cache act on those directives. Think you know what “no-cache” does? You are wrong.

Key Performance Information

This book has tidbits here and there that will help front-end performance such as: Using Cache-Control correctly. Adding support for stale resources. What will proxies not cache even if it’s allowed (URL’s with query strings, CGI-bin directories, etc). When is caching used but pointless (varying on cookies, host, etc). How can you improve your hit/miss ratio.

Verdict

BUY! Good background, worth the cost of the book for the exhaustive explanation of caching directives alone. A dozen or so front-end performance tidbits scattered throughout. Find a cheap used copy.

JavaScript: The Good Parts. By Douglas Crockford (O’Reilly 2008)

Cover of book "JavaScript:The Good Parts"

Written by JSON creator Douglas Crockford, JavaScript: The Good Parts provides a detailed analysis of JavaScript as a programming language and explorers what features of the language aid and what features hinder the creation of beautiful code and why. While targeted at JavaScript developers Chapter 10 and Appendixes A, B, and C provide a wealth of performance advice.

Key Performance Information

Half a dozen JavaScript performance tips mixed in throughout such as: Controlling scope chains of variables, dynamic compilation of code at runtime, avoiding type coercion, loop construction, regular expression performance.

Verdict

BUY! Will open your eyes about the elegance of JavaScript. If you like computer science and algorithms you will love this book. If you are only interested in the performance tips you’ll be disappointed if you pay full price for such a small book. Buy it used in that case.

Web Performance Tuning. By Patrick Killelea (O’Reilly 2002)

Cover of book "Web Performance Tuning"

Originally written in 1998 the 2nd edition with seemingly minimal updating was released in 2002. I really wanted to like this book. It is well written with tons of data tables, charts, and graphs. Unfortunately nearly the entirety of the book serves better as a reference manual and contains little and poorly actionable performance advice for web developers. For example, the chapter on “Security” is about SSL. (I will punch the next person in the face who equates web security with SSL and firewalls). This chapter contains some nice graphs of the performance of an obsolete Netscape web server. After all of that the “advice” is to “consider buying an SSL accelerator card.” What about performance of different algorithms? Or how to optimize SSL negotiation? Or the importance of keeping SSL connections open? Nothing (though I’ll be writing a blog post about optimizing SSL performance soon). The book also contains very outdated filler chapters such as choosing a modem, choosing a client and server OS, choosing client and server hardware, and an overview of non-HTTP network protocols.

That is not to say this is a bad book. There are some very enjoyable parts. I found the information about the chain of syscalls Apache makes to process an HTTP request and serve the response to be utterly fascinating. Chapter 19 is only chapter truly applicable to front-end performance. You should know everything in the chapter already but it is interesting largely because the advice it contains predates the current front-end performance movement by a good 7 years.

Key Performance Information

All but a very few bits of performance advice is obsolete and focuses entirely on the back-end. The main nuggets were things like: Use short filenames to save space. Minimize the use of symbolic links on the server. Turn off reverse DNS lookup for log files. Turn off mod_status. Set height and width HTML attributes to avoid repainting/re-rendering.

Verdict

Do Not Buy. This is no longer a useful book about web performance and based on the number of filler chapters I doubt its value when it was published. It is an enjoyable book if you are interested in learing more about how back-end web hardware functions. If so I suggest you find a used copy as the information this book contains is so out of date it’s not worth anywhere near its cover price. I purchased it for $2.77 from Amazon and was happy.

Building Scalable Web Sites. By Cal Henderson (O’Reilly 2006)

Cover of "Building Scalable Web Sites"

Cal is the lead developer of Flickr so he knows a thing or three about building complex web applications that have to performance for millions of users. Don’t pigeon-hole this book as a back-end hardware book. It is a holistic book that covers a lot of ground in just 320 pages. This book is a guide to the development processes and practices, as well as architectural and back-end design of web sites that can be maintained and scaled to immense levels of traffic. Yes there is information about load balancers and database clustering. But there is also information about coding practices: Using source code, branching, supporting international characters, abstracting away translations, abstracting/modulizing your code for easy updating, fail over, A/B testing of new features, and failover. Think of it as a modern version of Web Performance Tuning with current and proper information and no filler.

Key Performance Information

No specific advice per say. Instead this book is about how the design and building of web applications that are easy to maintain, expand and extended, and quickly replace based on the growth of your user base. It will change the way to build web applications.

Verdict

BUY! An excellent survey of the processes needed to build and grow truly scalable applications. Its information on building asynchronous remote systems is worth the price alone. I am using this as my bible as I design the web front-end to Zoompf’s scanning engine. I highly recommend this book to both web developers and IT operations.

Conclusions

There are some obvious must have web performance books available today. However there are additional books that provide insight into the tricks, tips, and processes needed to build high performance web applications that are not published elsewhere. Hopefully this post should help you build out your library of web performance books.

Did I miss one? Please comment below and tell me what other books you recommend that can contain good advice to improve website performance.

Browser Performance Issues with Charsets

Not defining a character set or where you do define it can cause poor performance for your website’s visitors. In this post we will discuss character sets and how best to define them to avoid web performance problems.

At their core, HTML documents are just a series of bytes. The character set (or charset) for an HTML document tells your web browser how it should process those bytes to construct characters. The browser then interprets those characters to render the web page. The 2 most common ways to tell the web browser what charset to use for an HTML page are by specifying it in the HTTP Content-Type header or by using a <META> tag to emulate an HTTP Content-Type header. When the web content author is the same person as the web server administrator it is possible to directly configure the web server to use the appropriate charset for the appropriate URLs. In this world of virtual hosts, Content Management Systems, and blogs this is rarely the case anymore. As such more and more web developers are using <META> tags to define the charset for HTML documents.

This leads to a Chicken-and-the-Egg problem. The HTML document contains text which tells the browser how to read the document. Hmmm. So how does the browser read the document without a charset? While it varies with browser and version, most assume a Latin alphabet charset like US-ASCII, Latin-1, or ISO-8859-1. The browser then reads the HTML document using this charset scanning for charset information. At this point one of three things happens:

  1. There is no <META> tag with charset information.
  2. There is <META> tag with a charset and it’s what the browser guessed.
  3. There is a <META> tag with a charset, but it’s a different charset than the browser guessed.

If there is no charset information the browser is in an odd position. At this point most browsers attempt some type of charset detection. With several years of web security experience believe me when I tell you that in theory this is an awesome idea but in practice this is a horrible idea. Web browsers or servers trying to “fix” broken data is the root a number of nasty web security vulnerabilities (such as UTF-7 XSS attacks and various other injection evasions). Regardless, no charset information of any kind forces the browser to do more processing which can produce a very small performance hit at best and a hacked website at worse.

If there is a <META> tag whose charset is the same as what the browser guessed there is no issue. Nothing else needs to occur.

If there is a <META> tag and it specifies a charset different than the assumed charset the browser has a problem. It has already interpreted some amount of the HTML document but it was the wrong charset. That information is all bad. The document needs to be reprocessed using the correct charset. So right now at best you are talking about a small performance penalty as the browser has to reparse the beginning of the HTML document.

But it can get worse! This is because browsers don’t scan the entire HTML document looking for a charset. They want to start rendering content! If they don’t see a charset defined “near the top” of the HTML document they start rendering content and executing JavaScript using the assumed charset. (“Near the top” varies from browser to browser which we will discuss in a minute). But once the browser gets going interpreting and executing content and then finds a <META> tag with charset information it’s in a real bind. Because now it has already been executing code, and requesting other resources, and render content using the wrong charset! Those URLs could be wrong, that JavaScript could have syntax errors, or the CSS rules could be misspelled all because the browser read them using the wrong charset information.

“Near the top” for Firefox 3.5 means within the first 2048 bytes. If Firefox does not detect charset information in the first 2048 of an HTML document (and no charset was defined in the HTTP headers) it starts rendering the page and executing script using an assumed charset (I did not investigate other browsers). Consider this example web page adapted from a Simon Pieters test case. It contains some JavaScript, whitespace, and starting just after 2048 bytes, a <META> tag defining the charset. In Firefox the JavaScript and pop an alert box showing a Euro sign. After 2048 bytes there is a <META> tag changing the charset from the assumed Latin-1. Firefox has to reprocess and re-render the page which will execute the JavaScript again with a Cyrillic character appearing in the alert box this time.

It is also interesting what the browser does if it has already made a request. If Firefox has already requested a URL and then detects a new charset the URL must be re-requested. Consider this example page. Here JavaScript make a request to a nonexistent image from www.google.com (we include the alert box to create a delay in thus simple test case to ensure Firefox has already started fetching the resource). The URL contains a character changes based on the charset so it must be re-requested. Using an HTTP proxy we see the browser made 2 requests to 2 different URLs (with URL encoding to encode the characters being sent)

charset

Note: it appears that Firefox does not try to re-request a URL if the change in the charset did not affected the change the meaning of the URL. If you modify the 2nd example to request “abc.gif” it does not appear that Firefox fetches this twice. More testing is needed here.

So there you have it. Browsers take a performance hit of varying severity when you fail to specify the charset near the very top of your HTML document. Always make sure to include some type of character set information so the browser does not waste time auto detecting one. This can slightly help performance and avoid security vulnerabilities. If you are using <META> tags to specify the character set information of your web pages make sure to place it a high in the <HEAD> of your HTML document as possible. The W3C standard specifically mentions this problem and solution. For Firefox, you only need 2048 characters before the <META> charset tag to cause this problem. A <SCRIPT> tag, a <STYLE> tag, an HTML comment, or even a <META> description tag or long <META> keywords tag can easily consume 2048 bytes. While other browsers may be more tolerant and allow a larger window they would still take a performance hit of having to reparse the byte stream. For these reasons Zoompf recommends you place the <META> charset tags as the first element inside of the <HEAD> of your HTML document to avoid any performance problems.

Want to see what performance problems you have? An appropriately placed <META> charset tag is just one of the 200+ performance issues Zoompf detects while assessing your web applications. You can sign up for a free mini web performance assessment at Zoompf.com today!

December 1, 2009

Expanding Rezipping

This post is a follow up to the previous post “Rezipping Web Resources for Fun and Profit.” In that article, we showed that many common web files, such as MS Office documents, Silverlight applications, Java Applets, and more are really just Zip files with a special structure of files inside. By rezipping a file (unzipping the contents and rezipping those contents using a higher compression level) web developers can reduce size of those files by 5-30%!

An obvious, but less useful expansion of rezipping is to extend it to other compression types, namely GZip compressed files or BZip2 compressed files. We can use 7-zip’s command line version 7za to accomplish this. It looks something like this:

//gunzip the file into temporary directory 7za X -tgzip original.gz -o"c:\tmp\" //regzip using maximum compression 7za A -tgzip -mx9 new.gz "c:\tmp\original"

This approach can be extended to BZip2 using “-tbzip2″ switch. I collected a few samples of GZip archives and using rezipping was able to reduce their size by an average of 5.03% as shown in the table below.

ArchiveOriginal Size(kb)Rezipped Size(kb)% Savings
bochs-2.4.2.tar.gz4,035,0103,879,1233.863%
dojo-release-1.3.2.tar.gz2,618,4932,471,0785.630%
expsummarytalk.ps.gz130,247121,5286.694%
httpd-2.2.14.tar.gz6,684,0816,420,9483.937%

Using rezipping on GZip or BZip2 archives is unfortunately less useful and beneficial than on Zip files. This is because so many files that served or downloaded on the web use Zip files as a wrapper. Finding ways to optimize Zip files lets you optimize a dozen other file types on the web. These files are either directly loaded and executed by the browser (like Silverlight or Applets) or are very common downloadable content like documents or presentations. However I know of no web content that uses a GZip file or BZip2 file as a wrapper file. While downloadable programs, source code, or other archives might use GZip or BZip2 you will not find any widely deployed document or content format that uses these as the wrapper file. This limits the usefulness of rezipping GZip or BZip2 archives.

As mention in the last post, one positive note is that while no widely deployed web files use GZip as a wrapper, many files contain raw GZip or DEFLATE streams. Flash files use GZip to compress the contents of the SWF tags. PDF’s uses DEFLATE to compress text streams. This means with a little parsing and some glue code proven tools like 7-zip should be able to be used to reduce the size of other files that are very common on the web today!

November 30, 2009

Rezipping Web Resources for Fun and Profit

One large area of web performance optimization is reducing the size of your content. Most people know about obvious techniques like HTTP compression, minifying, or removing extra data from images. However there is one size-reduction technique that does not seem to be common knowledge for most web performance junkies: Rezipping.

zipper

Let us start with a little background. Zip archives consist of multiple compressed files that are package together into a single file. Zip archives are compressed using the DEFLATE compression algorithm. Deflate supports different compression levels from 1-9. These compression levels provides a trade-off between CPU and memory resources used to create the Zip file and the size of the resulting Zip file. Using a higher compression level consumes more resources but you end up with a smaller file. Most Zip programs tend to create Zip archives using a compression level of 5 or 7. While this can be a good trade off as the file is created quickly and is reasonable compressed it will not produce the smallest file possible.

Now all that is well and good. But why should frontend web developers care about Zip file optimization? Simple: Many of the most common files on the Internet are actually Zip files. By creating methods to make smaller Zip files we are actually optimizing multiple different types of web files. Optimizing these files will reduce bandwidth consumption and server load while improving page load times.

These “Files that don’t end in .zip but really are Zip Files” use the Zip file format as kind of a wrapper to collect all the bits and pieces that really make up the file and store them in a single compressed unit. For example, Silverlight applications have a XAP file extension. However Silverlight applications are just a Zip file containing compiled byte code, resources like images and sounds, and other configuration. Java Applets contained in JAR files are Zip files. All of the Microsoft Office’s OOXML documents (DOCX, XLSX, PPTX, etc) are Zip files. All of OpenOffice.org’s ODF documents (ODT, ODP, ODS, etc) are Zip flies. You can rename any of these types of files to “.zip” and open them with any Zip program.

Since all of these common web files are simply Zip files we can optimize them to improve web performance and operational costs. This is where Rezipping comes in. Rezipping is process of recompressing a Zip file to create a smaller file. The process is simple: you take any Zip file, unzip the contents, and then rezip the content at a higher compression level. To accomplish this, I am using the command line version of 7zip. 7zip’s implementation of the DEFLATE compressor is generally considered to compress files better than other Zip programs by 5% to 10%. The process looks like this:

//unzip the contents of the original zip into a temporary directory 7za.exe X original.zip -o"c:\tmp\" //rezip using maximum compression 7za.exe A -mx9 new.zip "c:\tmp\*" To see how much this could help web performance, I download several samples of different types of zip files off of the internet.

Silverlight

NameOriginal Size (kb)ReZipped Size (kb)% improvement
cached – SilverlightApplication1.xap3,9723,8991.84%
Everything-SilverlightApplication1.xap825,801782,5945.23%
Examples.CS.xap4,752,2623,376,41128.95%
GeoReference.xap388,898288,97725.69%
HoldemSimulatorUI.xap1,280,7141,243,9672.87%
ImageGallery_v25_9458063489vC.xap18,22617,5383.77%
SilverlightControl.xap678,995557,79117.85%

On average rezipping reduces a Silverlight application by 12.32%. This is quite good given that XAP files can contain many binary files like images or sounds that will not be recompressed. Some files created from Visual Studio saw an improvement or more than 25%! Also notice that “ImageGallery_v25″ is the Silverlight application used by Bing to change Bing’s background image. This heavily served file could be slimmed by nearly 4% simply be rezipping the XAP file!

Microsoft Excel Documents

NameOriginal Size (kb)ReZipped Size (kb)% improvement
Listedescourselearning.xlsx55,61840,75326.73%
ParticipatingMembers.xlsx170,382123,27527.65%
PartnerReadinessAndTrainingFY09.xlsx26,67321,34919.96%
PermissionTemplate.xlsx22,57015,96929.25%
Presentation_Skills_Providers.xlsx33,09227,14417.97%

On average rezipping Excel files saves about 25%. This makes sense as most Excel spreadsheets contain predominately text and not uncompressable binary data.

Microsoft PowerPoint Documents

NameOriginal Size (kb)ReZipped Size (kb)% improvement
AMP 8.0 Project Kickoff Template v1.2 07102009.pptx112,63796,75314.10%
CL01.pptx1,918,4401,692,78511.76%
CL02.pptx5,872,2285,448,8187.21%
EC2.pptx123,137100,01318.78%
MSDN_Admin_08.pptx2,006,0911,862,4967.16%
SharePoint_Buzz.pptx2,123,7782,040,2343.93%
speedgeeks-20091026.pptx3,408,3653,271,3844.02%
SupportingDistributedTeamwork.pptx2,454,3602,387,2572.73%

On average rezipping PowerPoint files saves about 9%. This can vary widely depending on the number of images that are contained inside the PPTX file as images are not recompressed (more on that in another article).

Microsoft Word Documents

NameOriginal Size (kb)ReZipped Size (kb)% improvement
ASC_3.0_Demo_Image_Release_Notes.docx431,220412,0344.45%
implementationchecklist.docx126,981120,0755.44%
MSCOM_Virtualizes_MSDN_TechNet_on_Hyper-V.docx115,23089,57222.27%
CompProposal.docx25,54821,39516.26%
Web content redline 2009-10-28.docx201,304180,86810.15%
WindowsSharePointServicesDatasheet.docx198,837172,08213.46%

On average rezipping Word documents saves about 12%.

Conclusions

Always Use Rezipping! Stop sending bytes down the pipe you don’t have to! The savings you receive from ReZipping is driven by the contents of the Zip file. Files with a large number of binary objects that will not be compressed (like images) will have a lower improvement. Also note that higher compression levels increase the time and memory to compress data. but they do not increase the time it takes to decompress data. This is because all the work is in finding out what can be reduced during compression, not in recreating the original data during decompression. There is no reason not to use rezipping.

By rezipping your files you can reduce the size of your content. This reduces bandwidth consumption and server load while improving page load times! There is more work to be done. There are a number of web flies that contain raw Deflate streams like Flash files, WOFF font files, SVGZ, and more. All of these could be redeflated using a compression level of 9 and make smaller, faster files. Stay tuned as we investigate this more.

November 25, 2009

Performance Questions to Ask Hosting Providers: Log File Access

(This is the second article in a series of articles about performance questions you should ask when choosing a hosting provider. The first article in the series is here)

Last time we covered the most important question you should ask a hosting provider: What control do I have over the web server. This time we will be showcasing another important question to ask a hosting provider:

“What Access Do You Provide to Web Server Logs?”

web server configuration

The main reason you want access to log files from the web server is to learn how visitors are accessing your content. This will reveal a wealth of knowledge about the raw traffic patterns of your web application and expose various performance issues and limitations. Often these performance issues will not be detected by page-based performance tools like Yahoo’s YSlow or Google’s Page Speed.

Web server logs come in many different formats. Usually they are large text files where every request is logged on its own line. Several pieces of data about each request are logged in different fields on the line separated by commas. Typically information that is logged for each request is:

  • URL requested.
  • IP address of the visitor.
  • Date and time request was received.
  • The program or browser used to request the URL. This is called the User-Agent.
  • The Referring webpage (if any).
  • HTTP version used to request the page.
  • Status code of the response.
  • Size of the body of the response.

Log files are a very granular view of your web traffic. Sometimes it can be difficult to see the forest through the trees. For example, what pages did user XYZ visit, in what order, and how long did the user stay on each page? It is usually very difficult to get this information from logs alone because web server logs only track users by a specific IP address. To provide a larger view and answer questions like those listed above web developers use web analytics packages like Omniture, Hitbox, or Google Analytics. Web analytics packages uses cookies and JavaScript to gather detailed information about your visitors, the capabilities of their browsers, and their actions through your web application. Web analytics packages are simple to add to a website. Typically all that is involved is inserting a block of JavaScript at the end of each HTML page. This is very easy to do on templated or dynamically generated websites. So if web analytics provides you with “bigger picture” and richer data than web server logs that begs the question:

Are Web Analytics Reports Good Enough?

Actually, no. Web analytics reports are not good enough. Web analytics data abstracts away the raw traffic of your web application and which can hide several important problems. Web analytics packages only track visitor requests and activity for HTML files that are served with a 200 status code. Out of the box, here are things that most web analytics packages do not track:

  • All requests to non-HTML resources.
    • Images
    • JavaScript
    • Style sheets
    • Feeds (RSS, Atom)
    • RIA files (HTC, Flash, Silverlight, Java, etc)
    • Access files (robots.txt, sitemaps.xml, crossdomain.xml, etc)
    • Documents (PDF, Office docs, Zip files)
    • Other resources (Fonts, Cursors)
  • Most error pages (404, 5xx, etc).
  • Conditional requests that return “304 Not Modified.”
  • Requests from non browser User-Agents (spiders, mash-ups, etc).
  • Users who have JavaScript disabled for accessibility or (more commonly) security reasons.

This valuable information is completely missed if you are only using web analytics data to understand your traffic. Consider the valuable questions you can answer with web server logs:

  • Where are your redirects? Which can be removed to decrease page load time?
  • What web resources are using the most bandwidth? This is calculated by simply adding up all the body sizes that are returned for a resource and sorting. Can you reduce the size of these files somehow using compression, minification, or by removing meta data?
  • What are the most requested resources on your website? Can you use caching or other methods to minimize the number of times the resource is requested? If you cannot cache those files because they are dynamically generated can you add programming logic to use a Last-Modified header to reduce bandwidth? Can you remove resources like external JavaScript or CSS files that are referenced by not actually used by that web page?
  • How often does your static content actually change? This is calculated by counting the 304s for a resource. Perhaps you can use a longer Expires time on your content.
  • How often are search engine crawlers visiting your site? Are there crawlers that are missing? You should submit your website to their indexes.
  • Are crawlers finding your high value content? Perhaps you should be using or modify your sitemap.
  • Do crawlers request a large amount of low value content? Do crawlers “get stuck” on part of your website? Perhaps you need to fix your robots.txt.
  • Which web resources are you still getting a lot of requests for that no longer exist? You should use a redirect that points to the correct content.

Real Life Examples

At Zoompf we have detected and solved numerous performance issues just by examining a client’s web server log files. Here are some of our more interesting stories:

  • A social networking client used their robots.txt file to prevent crawlers from indexing content from new, untrusted members that could be content spammers. As such their robots.txt was 250 kilobytes! Because the file was updated so often all of the search crawlers would request it multiple times a day. These factors resulted in the client using 5 gigs of bandwidth a month just to serve its robots.txt file! Turning on HTTP compression for text files reduced this by 70%. The client is currently implementing the use of <META> tags and rel=”nofollow” attributes to limit search engine indexing for the web pages of untrusted users. This will result in even higher performance savings.
  • A software client found that by far their most popular pages were for their product documentation. These pages were constructed using a PHP templating system. However these files never changed once that version of the product shipped. The client moved to pre-rendering the web pages to static HTML files and using a far future Expires header on the HTML files. This drastically improved performance and reduced bandwidth consumption and server load.
  • An ecommerce client discovered crawlers were requested all the available colors for each item they sold. For example, the crawlers would visit /item1/, /item1/color/red/, /item1/color/blue/, and so on. By creating a robots.txt rule to prevent crawlers from requesting every color for every item the client reduced their bandwidth by nearly 80% while still having their important content indexed.
  • A client discovered that their most important content was not getting indexed. A developer had copied a code snippet from the Internet into the top of their template to solve a CSS problem they were having. Unfortunately this code snippet also included a <META> tag to telling search engines not to index the web page.
  • A client learned that its logo had not changed for over 4 years and it was also an uncompressed BMP file even though the logo had a JPEG extension. They change the logo to be a proper image format for web and increased the logo’s Expires time.
  • A client discovered that no one had every requested their iPhone application image. They removed the <LINK> tag and reduced the size of all of their HTML pages.
  • A client discovered their favicon.ico file was consuming huge amounts of bandwidth. This is because it contained multiple versions of the same icon at different dimensions. Removing all but the 16 by 16 pixel version from the ICO file reduced file size by 97%.

Typically Log Access

If you have your own server access to the web server logs is usually unrestricted. However in most shared hosting environments you will not have direct access. Typical web server log options you have are:

  • The raw Apache, IIS, or NCSA log files in a directory outside of your web root that you can access using FTP or sFTP. This is the ideal case.
  • The raw Apache, IIS, or NCSA log files placed directly in your web root. While this provides you with the raw logs anyone on the internet can also access your log files. This is a security risk as log files can often contain sensitive data like credentials or “hidden” areas of your web application. Talk with your hosting provider about moving the location of the web logs.
  • An option through a web-based website administration system like cPANEL that lets you download the raw log file.
  • An option or interface in the web admin system that lets you view or download a specially formatted version of the logs.

If you cannot access the raw log files don’t panic. As long as the log file contains the follow information you will have all the data you need:

  • URL requested
  • Date and time request was received
  • The program or browser used to request the URL. This is called the User-Agent.
  • Status code of the response
  • Size of the body of the response

Another question to ask hosting providers is not only “what information is in the log file” but also “how much time does the log file cover?” You can imagine in large sharing hosting environments how log files can quickly go to hundreds of megabytes for potentially thousands of customers. Hosting providers often limit the log file in different ways including:

  • Record only a week of traffic and replace the log with a new empty file every week.
  • Limit the total size of the log file. Each new entry removes an entry from the start of the log
  • Provide a night copy of the log file for all the traffic of the site received that day. These copies are usually removed after a certain about of time.

If you do have a time window make sure grab a copy of the log file. Some interfaces like cPANEL offer a scheduling services that can email you the log file or place them in a special location that you can then download. You can schedule an FTP download or use wget or curl to download these log files.

Processing Log Files

Depending on how much log data you have, you might want to concatenate your log files together until you have a big enough sample. At Zoompf we suggest collecting a sample between 500,000-1,000,000 requests, or a week’s worth of web traffic, depending on which is larger. Programs like awstats are very helpful for processing and provide reports with your most popular and least popular files, largest files in terms of bandwidth, and other data already broken out. Directly processing the logs yourself always you to discovered more detailed data and not as hard as you would thing. Some basic regular expressions can make it very easy to gather metrics like “show all of the 304s, 404s, 500s, etc.”

Remember, examining your web logs is a key technique to discovering and solving performance problems with your web applications. Those pretty graphs from Google Analytics or other web analytics data is simply not good enough to detect performance issues and bottlenecks. You need access to the information about all the requests the web server is processing. Make sure you ask your hosting provider how you can access the raw web server log files. Find out how much web traffic data the logs contain and how you can easily collect this data so you can analysis. If your hosting provider does not provide this you should consider that a deal breaker and find another provider.