Performance Questions to Ask Hosting Providers: Log File Access

Posted: November 25, 2009 at 2:39 pm

(This is the second article in a series of articles about performance questions you should ask when choosing a hosting provider. The first article in the series is here)

Last time we covered the most important question you should ask a hosting provider: What control do I have over the web server. This time we will be showcasing another important question to ask a hosting provider:

“What Access Do You Provide to Web Server Logs?”

web server configuration

The main reason you want access to log files from the web server is to learn how visitors are accessing your content. This will reveal a wealth of knowledge about the raw traffic patterns of your web application and expose various performance issues and limitations. Often these performance issues will not be detected by page-based performance tools like Yahoo’s YSlow or Google’s Page Speed.

Web server logs come in many different formats. Usually they are large text files where every request is logged on its own line. Several pieces of data about each request are logged in different fields on the line separated by commas. Typically information that is logged for each request is:

  • URL requested.
  • IP address of the visitor.
  • Date and time request was received.
  • The program or browser used to request the URL. This is called the User-Agent.
  • The Referring webpage (if any).
  • HTTP version used to request the page.
  • Status code of the response.
  • Size of the body of the response.

Log files are a very granular view of your web traffic. Sometimes it can be difficult to see the forest through the trees. For example, what pages did user XYZ visit, in what order, and how long did the user stay on each page? It is usually very difficult to get this information from logs alone because web server logs only track users by a specific IP address. To provide a larger view and answer questions like those listed above web developers use web analytics packages like Omniture, Hitbox, or Google Analytics. Web analytics packages uses cookies and JavaScript to gather detailed information about your visitors, the capabilities of their browsers, and their actions through your web application. Web analytics packages are simple to add to a website. Typically all that is involved is inserting a block of JavaScript at the end of each HTML page. This is very easy to do on templated or dynamically generated websites. So if web analytics provides you with “bigger picture” and richer data than web server logs that begs the question:

Are Web Analytics Reports Good Enough?

Actually, no. Web analytics reports are not good enough. Web analytics data abstracts away the raw traffic of your web application and which can hide several important problems. Web analytics packages only track visitor requests and activity for HTML files that are served with a 200 status code. Out of the box, here are things that most web analytics packages do not track:

  • All requests to non-HTML resources.
    • Images
    • JavaScript
    • Style sheets
    • Feeds (RSS, Atom)
    • RIA files (HTC, Flash, Silverlight, Java, etc)
    • Access files (robots.txt, sitemaps.xml, crossdomain.xml, etc)
    • Documents (PDF, Office docs, Zip files)
    • Other resources (Fonts, Cursors)
  • Most error pages (404, 5xx, etc).
  • Conditional requests that return “304 Not Modified.”
  • Requests from non browser User-Agents (spiders, mash-ups, etc).
  • Users who have JavaScript disabled for accessibility or (more commonly) security reasons.

This valuable information is completely missed if you are only using web analytics data to understand your traffic. Consider the valuable questions you can answer with web server logs:

  • Where are your redirects? Which can be removed to decrease page load time?
  • What web resources are using the most bandwidth? This is calculated by simply adding up all the body sizes that are returned for a resource and sorting. Can you reduce the size of these files somehow using compression, minification, or by removing meta data?
  • What are the most requested resources on your website? Can you use caching or other methods to minimize the number of times the resource is requested? If you cannot cache those files because they are dynamically generated can you add programming logic to use a Last-Modified header to reduce bandwidth? Can you remove resources like external JavaScript or CSS files that are referenced by not actually used by that web page?
  • How often does your static content actually change? This is calculated by counting the 304s for a resource. Perhaps you can use a longer Expires time on your content.
  • How often are search engine crawlers visiting your site? Are there crawlers that are missing? You should submit your website to their indexes.
  • Are crawlers finding your high value content? Perhaps you should be using or modify your sitemap.
  • Do crawlers request a large amount of low value content? Do crawlers “get stuck” on part of your website? Perhaps you need to fix your robots.txt.
  • Which web resources are you still getting a lot of requests for that no longer exist? You should use a redirect that points to the correct content.

Real Life Examples

At Zoompf we have detected and solved numerous performance issues just by examining a client’s web server log files. Here are some of our more interesting stories:

  • A social networking client used their robots.txt file to prevent crawlers from indexing content from new, untrusted members that could be content spammers. As such their robots.txt was 250 kilobytes! Because the file was updated so often all of the search crawlers would request it multiple times a day. These factors resulted in the client using 5 gigs of bandwidth a month just to serve its robots.txt file! Turning on HTTP compression for text files reduced this by 70%. The client is currently implementing the use of <META> tags and rel=”nofollow” attributes to limit search engine indexing for the web pages of untrusted users. This will result in even higher performance savings.
  • A software client found that by far their most popular pages were for their product documentation. These pages were constructed using a PHP templating system. However these files never changed once that version of the product shipped. The client moved to pre-rendering the web pages to static HTML files and using a far future Expires header on the HTML files. This drastically improved performance and reduced bandwidth consumption and server load.
  • An ecommerce client discovered crawlers were requested all the available colors for each item they sold. For example, the crawlers would visit /item1/, /item1/color/red/, /item1/color/blue/, and so on. By creating a robots.txt rule to prevent crawlers from requesting every color for every item the client reduced their bandwidth by nearly 80% while still having their important content indexed.
  • A client discovered that their most important content was not getting indexed. A developer had copied a code snippet from the Internet into the top of their template to solve a CSS problem they were having. Unfortunately this code snippet also included a <META> tag to telling search engines not to index the web page.
  • A client learned that its logo had not changed for over 4 years and it was also an uncompressed BMP file even though the logo had a JPEG extension. They change the logo to be a proper image format for web and increased the logo’s Expires time.
  • A client discovered that no one had every requested their iPhone application image. They removed the <LINK> tag and reduced the size of all of their HTML pages.
  • A client discovered their favicon.ico file was consuming huge amounts of bandwidth. This is because it contained multiple versions of the same icon at different dimensions. Removing all but the 16 by 16 pixel version from the ICO file reduced file size by 97%.

Typically Log Access

If you have your own server access to the web server logs is usually unrestricted. However in most shared hosting environments you will not have direct access. Typical web server log options you have are:

  • The raw Apache, IIS, or NCSA log files in a directory outside of your web root that you can access using FTP or sFTP. This is the ideal case.
  • The raw Apache, IIS, or NCSA log files placed directly in your web root. While this provides you with the raw logs anyone on the internet can also access your log files. This is a security risk as log files can often contain sensitive data like credentials or “hidden” areas of your web application. Talk with your hosting provider about moving the location of the web logs.
  • An option through a web-based website administration system like cPANEL that lets you download the raw log file.
  • An option or interface in the web admin system that lets you view or download a specially formatted version of the logs.

If you cannot access the raw log files don’t panic. As long as the log file contains the follow information you will have all the data you need:

  • URL requested
  • Date and time request was received
  • The program or browser used to request the URL. This is called the User-Agent.
  • Status code of the response
  • Size of the body of the response

Another question to ask hosting providers is not only “what information is in the log file” but also “how much time does the log file cover?” You can imagine in large sharing hosting environments how log files can quickly go to hundreds of megabytes for potentially thousands of customers. Hosting providers often limit the log file in different ways including:

  • Record only a week of traffic and replace the log with a new empty file every week.
  • Limit the total size of the log file. Each new entry removes an entry from the start of the log
  • Provide a night copy of the log file for all the traffic of the site received that day. These copies are usually removed after a certain about of time.

If you do have a time window make sure grab a copy of the log file. Some interfaces like cPANEL offer a scheduling services that can email you the log file or place them in a special location that you can then download. You can schedule an FTP download or use wget or curl to download these log files.

Processing Log Files

Depending on how much log data you have, you might want to concatenate your log files together until you have a big enough sample. At Zoompf we suggest collecting a sample between 500,000-1,000,000 requests, or a week’s worth of web traffic, depending on which is larger. Programs like awstats are very helpful for processing and provide reports with your most popular and least popular files, largest files in terms of bandwidth, and other data already broken out. Directly processing the logs yourself always you to discovered more detailed data and not as hard as you would thing. Some basic regular expressions can make it very easy to gather metrics like “show all of the 304s, 404s, 500s, etc.”

Remember, examining your web logs is a key technique to discovering and solving performance problems with your web applications. Those pretty graphs from Google Analytics or other web analytics data is simply not good enough to detect performance issues and bottlenecks. You need access to the information about all the requests the web server is processing. Make sure you ask your hosting provider how you can access the raw web server log files. Find out how much web traffic data the logs contain and how you can easily collect this data so you can analysis. If your hosting provider does not provide this you should consider that a deal breaker and find another provider.

This entry was posted in Guide and tagged , . Bookmark the permalink.

Leave a Reply