December 14, 2009

Performance Questions to Ask Hosting Providers: Secure Website Access

(This is the third article in a series of articles about performance questions you should ask when choosing a hosting provider. The first article, “What control do I have over the web server?” and the second article “What access do you provide to web server logs?” are also available.)

So far in this series we have talked a lot about questions to ask hosting providers to make sure you can configure your website for performance and access the raw traffic logs of your website to spot performance problems. All of this is moot of course if you cannot get content onto your website. That’s why this post of “Questions to ask a hosting provider” is all about:

“Can I Securely Communicate With My Website?”

ethernet-locked

It has happened to everyone. You are out at a coffee shop, a client site, or at a conference and you need to make changes to your website. Perhaps you need to upload a few new PHP files or some images. Perhaps you need to update your web server configuration to set up a new email address for an event. Perhaps you simply saw something cool and want to write a WordPress post. But can you do anything of these things securely using a public network? This question is best answered with an analogy.

Imagine you are at a formal cocktail party. You drift from room to room, through a sea of lavishly dressed party goers and dine on mouth-watering morsels served on silver trays by waiters in white gloves. As you approach a side table of crystal champagne glasses you overhear bits and pieces of the conversations around you.

  • “We cannot wait. It should be a lovely vacation and it’s the perfect time for us to get away for a week.”
  • “That’s right, with the nanny! Walked right in on them! And he tried to say that she was only choking!”
  • “Chris starts there next spring, just like his father.”

Well attended cocktail parties are loud and noisy. Its almost impossible not to hear what everyone else is saying! Of course we are taught that to be polite we should ignore the conversations other people are having unless we are involved. You are on the honor system not to eavesdrop.

Public networks such as wireless networks are just like cocktail parties. Your wireless card is like a party guest. It broadcasts out to the room when it “speaks” and “listens” to everyone within range to hear a response. Like a real party guest, wireless cards are supposed to ignore any conversations that they overhear that is not meant for them. They do this by dropping the data and not bubbling it up to the computer. However nothing forces network devices to ignore data they receive that is not meant for them. In fact, all networking devices (not just wireless devices) can be placed into “Promiscuous Mode” where any data they receive, even data that is not addressed to themselves, is received and bubbled up to the computer to process. This allows any networking device to become a giant listening device that hears and records all the information on the network! Promiscuous mode is not some evil hacker trick. It’s a fully intended feature of networking devices that has many legitimate uses.

Diagram showing how clients in a wireless network hear each others' traffic

But wait! I use Encryption!

“The conference wireless network or the coffee shops wireless network is encrypted. They tell me they use something called WPA2 with a key of a million bits! I’m secure right?”

No, you are not secure.

Let’s go back to the cocktail party analogy. The hosts don’t want just anyone coming into their party and drinking all their fine wines. So they place a bouncer at the door of the party. Only people that know the password are allowed into the party. If you know the password you get into the party and can listen to all the other guests. If you do not know the password you remain outside the building and cannot hear anything that is going on inside.

Encrypted wireless networks are just cocktail parties with bouncers. You need the “password” to join the wireless network. Once you are connected you can listen to everyone else’s traffic just like before because on the network everyone is using the same password to transmit and receive their data. (This is the only scalable solution. Otherwise the wireless network administrator would have to create a new, unique password for each and every person that joins the network). In other words, an encrypted network uses the password solely to protect and restrict “access” to the network. It does nothing to protect the users of the network from themselves or from each other.

The Danger of Sniffing (packets)

So What! Who cares if someone can listen to my network traffic. It’s not a big deal. After all they will just see the blog content I was about to post anyway. Unfortunately this is not true. Using any system that requires a username and a password on a wireless network? You may have shouted to the entire cocktail party that username and password. And chances are you use that same username and password somewhere else on the Internet. Like your bank. Or an online store. Are you already logged into a system like Gmail or your WordPress administration panel? You are shouting your HTTP Cookies to the entire cocktail party. Someone can steal your HTTP session cookies and use session hijacking to access Gmail or WordPress as if they were you without needing your username and password. Next thing you know you are on The Wall Of Sheep!

Secure Communications With Your Website

Remember: network encryption protects networks and application encryption protects applications! You need to make sure you are using encrypted application protocols to properly protect yourself. What protocols you use and how you use them will vary with different use cases.

Uploading Content

How do you upload content to your website? If the answer is FTP you are in trouble. FTP sends usernames and passwords in the clear. You need an encrypted file transfer mechanism like SFTP or SCP. If you have shell access to your web server using SSH you also have the ability to use either SFTP or SCP as they are simply subsets of the functionality of SSH. By default most hosting companies provide an insecure file transfer system like FTP. Ask if they provide (for free) a secure file transfer system like SFTP or SCP. Make sure they understand you don’t need full SSH functionality and are only interested in secure file transfer. If this is not available you might need to upgrade your account or purchase an add-on to get SSH access for your website.

Writing Content

Do you use a web interface to write content for your blog platform or CMS system? Does it use SSL? Check the address bar. Does it start with https? If not you are not using SSL. Do you write your content using other software? Does that software directly publish the content to your blog using a web API like RSD or XMLRPC? Does that use SSL? Check the settings and see if you are using “https” to access the API interface. If you are not using SSL to communicate with these web resources then anyone can capture your username and password or cookies (which are just as good as your username and password).

Website Administration

How do you administer your website? Do you use a web interface like cPanel? These web administration interfaces are most common in shared hosting environments and typically run on a different hostname or an odd port number. Ask the hosting provider if they offer SSL access to the interface. Hosting providers often get confused and think you want to create an SSL certificate for your website. While this would secure a CMS you configure like WordPress (see previous use case) it does not help you secure the web administration interface because that is often running on a separate system. Make sure they understand you want secure access to their interface, not your website. This discussion may take several emails back and forth but most hosting providers are willing to supply SSL access to cPanel or other administration interfaces.

Summary

In conclusion, the questions about secure communications you should ask your hosting provider are:

  • “Do you provide a secure file transfer mechanism like SFTP or SCP? Is it provided for free or is it extra? If you don’t do you offer SSH access to the web server? Is it free?”
  • “If you provide a web-based website administration interface like cPanel do you provide access to it using SSL?”
  • “Do you provide an SSL certificate for my CMS? What is the cost?”

How to judge their answers will vary from person to person based on need. Personally, a secure file transfer mechanism is a requirement. Too many times have I needed to upload a presentation, PDF, or file to my website from a public network at a conference or client site. If you have a heavy blogger secure access to your content management system is going to be critical. After all, it is difficult to write a blog post about an event from the event if you cannot securely access your blog to write the post!

December 7, 2009

The Challenge of Dynamically Generating Static Content

php_code

Time and time again I see people using PHP or some other application logic to try and hack around some issue they are facing. We saw this in our previous post Questions to Ask Hosting Providers: Web Server Configuration where people would use PHP to emulate mod_deflate or mod_expires. Andrew King, in his book Website Optimization talks about wrapping developer comments in CSS or JavaScript files in <?php ?> tags and using the PHP interpreter to remove them. People use PHP to combine CSS or JavaScript resources together. And today I read an article from the always awesome Chris Coyier over at css-tricks.com about using PHP to emulate CSS variables.

Don’t get me wrong. I was actually bemoaning the lack of variables in CSS two days before Chris wrote his article. (Actually, what we really want is more like C/C++ macros but that’s another story). Anyone who has tried to implement CSS sprites, change margins or element sizes, or modify color values knows what a pain it is to go through a CSS file and type the same thing over and over.

Using PHP to solve this problem, or any of the other problems listed above, makes perfect sense at first. Because it makes things easy. Because you are all being lazy. You are using a runtime mechanism to try and simplify your life.

Stop Being Lazy!

Now, under normal circumstances programmers should be lazy! After all your very job is to create something that does work for you! Unfortunately in this case your laziness is harming the performance of your application. Using application logic to dynamically generate static content at runtime is a massively bad idea. Consider these 4 consequences:

  • You take an order of magnitude performance hit for invoking the application tier instead of just serving a flat static file from the file system.
  • Since the web server is not serving a static file, there will be no Last-Modified header sent by default. That means no conditional GETs and no 304 responses which means lots of bandwidth consumption.
  • PHP, like virtually all application tiers, produces a chucked response. This is because the web server has no idea what the content length will be because it is dynamically generated. Dynamically generated chunked responses will not send the Accept-Range header. This means no pausing or resuming or error recovering. The entire resource must be re-downloaded.
  • Chunked encoding is not supported with HTTP/1.0, so any HTTP/1.0 device (like every caching proxy ever made) has to flip into “store and forward” mode where it downloads the entire response before passing it along.

And as if all these downsides for invoking the application tier was not enough, we have my personal favorite: Web Security! As someone who professionally broke into computer systems for many years when I see:

http://example.com/combine.php?files=a.js|b.js|c.js

I get very excited. Think about what a resource combiner script does. “Hey website, I’m going to give you a list of files on your hard drive, and I want you to read them off the disk, one at a time, and dump their raw contents into a response and send it to me!” Jackpot baby! This is what we call a Local File Inclusion vulnerability just waiting to happen. The developer has not so much created a resource combiner as they have provided me with a rudimentary remote file download service! I immediately do something like this:

http://example.com/combine.php?files=db.inc

In about 45 seconds I have downloaded the /etc/password file, your httpd.conf, your .htaccess, your raw mysql database, you app config files filled or user names, passwords, and database connection strings, and each PHP file to retrieve all your source code. Or worse I perform remote file inclusion, thereby injecting a PHP-Shell, which allows me to completely take over your website! (BTW: Roughly one in every 3 PHP resource combiner scripts I have seen contains these security vulnerabilities. Beware where you get your source code!)

The Fundamental Problem

The fundamentally problem in all of these examples are developers are getting lazy and are using PHP code to do something at runtime that should have been done earlier.

Properly Generating Static Content

Great! So what is a web developers to do? Go back to the dark ages where you cannot leverage all that great application logic in the generation of our content? I want my CSS variables and I want them now! Notice I never said you cannot dynamically generate static content! I just said you should not dynamically generate static content at runtime! Want CSS variables? Want to use a PHP script to combine resources or minify or whatever?Go ahead and do it! Just do it ahead of time. You can run your PHP script form the command line, produce your CSS file, complete with all the correct CDN paths and color values, and upload that to your website. And this isn’t just for PHP. Use Perl, Python, Ruby, Java, or whatever. You can even do it in QBASIC!

'CSSGEN.BAS - kicking it old school CDN$ = "http://zoompf.com/" LOGO$ = "includes/logo.png" PRINT ".logo {" PRINT " background: url("; CDN$ + LOGO$ + ");" PRINT "}"

And the output:

qbasic-css-gen

(Thats right. I totally just used QBasic 1.1 from DOS 5.0 to automate publishing a web application on 64bit Vista. Oh yeah!)

The moral of the story is never make the user pay for your laziness. Do not use the application tier of a website to dynamically generate static content at runtime. Instead do it at publishing time or even do it in a daily or hourly cron job. This approach allows you all the advantages of using application logic without drastically reducing the very web performance you were trying to improve in the first place!

November 25, 2009

Performance Questions to Ask Hosting Providers: Log File Access

(This is the second article in a series of articles about performance questions you should ask when choosing a hosting provider. The first article in the series is here)

Last time we covered the most important question you should ask a hosting provider: What control do I have over the web server. This time we will be showcasing another important question to ask a hosting provider:

“What Access Do You Provide to Web Server Logs?”

web server configuration

The main reason you want access to log files from the web server is to learn how visitors are accessing your content. This will reveal a wealth of knowledge about the raw traffic patterns of your web application and expose various performance issues and limitations. Often these performance issues will not be detected by page-based performance tools like Yahoo’s YSlow or Google’s Page Speed.

Web server logs come in many different formats. Usually they are large text files where every request is logged on its own line. Several pieces of data about each request are logged in different fields on the line separated by commas. Typically information that is logged for each request is:

  • URL requested.
  • IP address of the visitor.
  • Date and time request was received.
  • The program or browser used to request the URL. This is called the User-Agent.
  • The Referring webpage (if any).
  • HTTP version used to request the page.
  • Status code of the response.
  • Size of the body of the response.

Log files are a very granular view of your web traffic. Sometimes it can be difficult to see the forest through the trees. For example, what pages did user XYZ visit, in what order, and how long did the user stay on each page? It is usually very difficult to get this information from logs alone because web server logs only track users by a specific IP address. To provide a larger view and answer questions like those listed above web developers use web analytics packages like Omniture, Hitbox, or Google Analytics. Web analytics packages uses cookies and JavaScript to gather detailed information about your visitors, the capabilities of their browsers, and their actions through your web application. Web analytics packages are simple to add to a website. Typically all that is involved is inserting a block of JavaScript at the end of each HTML page. This is very easy to do on templated or dynamically generated websites. So if web analytics provides you with “bigger picture” and richer data than web server logs that begs the question:

Are Web Analytics Reports Good Enough?

Actually, no. Web analytics reports are not good enough. Web analytics data abstracts away the raw traffic of your web application and which can hide several important problems. Web analytics packages only track visitor requests and activity for HTML files that are served with a 200 status code. Out of the box, here are things that most web analytics packages do not track:

  • All requests to non-HTML resources.
    • Images
    • JavaScript
    • Style sheets
    • Feeds (RSS, Atom)
    • RIA files (HTC, Flash, Silverlight, Java, etc)
    • Access files (robots.txt, sitemaps.xml, crossdomain.xml, etc)
    • Documents (PDF, Office docs, Zip files)
    • Other resources (Fonts, Cursors)
  • Most error pages (404, 5xx, etc).
  • Conditional requests that return “304 Not Modified.”
  • Requests from non browser User-Agents (spiders, mash-ups, etc).
  • Users who have JavaScript disabled for accessibility or (more commonly) security reasons.

This valuable information is completely missed if you are only using web analytics data to understand your traffic. Consider the valuable questions you can answer with web server logs:

  • Where are your redirects? Which can be removed to decrease page load time?
  • What web resources are using the most bandwidth? This is calculated by simply adding up all the body sizes that are returned for a resource and sorting. Can you reduce the size of these files somehow using compression, minification, or by removing meta data?
  • What are the most requested resources on your website? Can you use caching or other methods to minimize the number of times the resource is requested? If you cannot cache those files because they are dynamically generated can you add programming logic to use a Last-Modified header to reduce bandwidth? Can you remove resources like external JavaScript or CSS files that are referenced by not actually used by that web page?
  • How often does your static content actually change? This is calculated by counting the 304s for a resource. Perhaps you can use a longer Expires time on your content.
  • How often are search engine crawlers visiting your site? Are there crawlers that are missing? You should submit your website to their indexes.
  • Are crawlers finding your high value content? Perhaps you should be using or modify your sitemap.
  • Do crawlers request a large amount of low value content? Do crawlers “get stuck” on part of your website? Perhaps you need to fix your robots.txt.
  • Which web resources are you still getting a lot of requests for that no longer exist? You should use a redirect that points to the correct content.

Real Life Examples

At Zoompf we have detected and solved numerous performance issues just by examining a client’s web server log files. Here are some of our more interesting stories:

  • A social networking client used their robots.txt file to prevent crawlers from indexing content from new, untrusted members that could be content spammers. As such their robots.txt was 250 kilobytes! Because the file was updated so often all of the search crawlers would request it multiple times a day. These factors resulted in the client using 5 gigs of bandwidth a month just to serve its robots.txt file! Turning on HTTP compression for text files reduced this by 70%. The client is currently implementing the use of <META> tags and rel=”nofollow” attributes to limit search engine indexing for the web pages of untrusted users. This will result in even higher performance savings.
  • A software client found that by far their most popular pages were for their product documentation. These pages were constructed using a PHP templating system. However these files never changed once that version of the product shipped. The client moved to pre-rendering the web pages to static HTML files and using a far future Expires header on the HTML files. This drastically improved performance and reduced bandwidth consumption and server load.
  • An ecommerce client discovered crawlers were requested all the available colors for each item they sold. For example, the crawlers would visit /item1/, /item1/color/red/, /item1/color/blue/, and so on. By creating a robots.txt rule to prevent crawlers from requesting every color for every item the client reduced their bandwidth by nearly 80% while still having their important content indexed.
  • A client discovered that their most important content was not getting indexed. A developer had copied a code snippet from the Internet into the top of their template to solve a CSS problem they were having. Unfortunately this code snippet also included a <META> tag to telling search engines not to index the web page.
  • A client learned that its logo had not changed for over 4 years and it was also an uncompressed BMP file even though the logo had a JPEG extension. They change the logo to be a proper image format for web and increased the logo’s Expires time.
  • A client discovered that no one had every requested their iPhone application image. They removed the <LINK> tag and reduced the size of all of their HTML pages.
  • A client discovered their favicon.ico file was consuming huge amounts of bandwidth. This is because it contained multiple versions of the same icon at different dimensions. Removing all but the 16 by 16 pixel version from the ICO file reduced file size by 97%.

Typically Log Access

If you have your own server access to the web server logs is usually unrestricted. However in most shared hosting environments you will not have direct access. Typical web server log options you have are:

  • The raw Apache, IIS, or NCSA log files in a directory outside of your web root that you can access using FTP or sFTP. This is the ideal case.
  • The raw Apache, IIS, or NCSA log files placed directly in your web root. While this provides you with the raw logs anyone on the internet can also access your log files. This is a security risk as log files can often contain sensitive data like credentials or “hidden” areas of your web application. Talk with your hosting provider about moving the location of the web logs.
  • An option through a web-based website administration system like cPANEL that lets you download the raw log file.
  • An option or interface in the web admin system that lets you view or download a specially formatted version of the logs.

If you cannot access the raw log files don’t panic. As long as the log file contains the follow information you will have all the data you need:

  • URL requested
  • Date and time request was received
  • The program or browser used to request the URL. This is called the User-Agent.
  • Status code of the response
  • Size of the body of the response

Another question to ask hosting providers is not only “what information is in the log file” but also “how much time does the log file cover?” You can imagine in large sharing hosting environments how log files can quickly go to hundreds of megabytes for potentially thousands of customers. Hosting providers often limit the log file in different ways including:

  • Record only a week of traffic and replace the log with a new empty file every week.
  • Limit the total size of the log file. Each new entry removes an entry from the start of the log
  • Provide a night copy of the log file for all the traffic of the site received that day. These copies are usually removed after a certain about of time.

If you do have a time window make sure grab a copy of the log file. Some interfaces like cPANEL offer a scheduling services that can email you the log file or place them in a special location that you can then download. You can schedule an FTP download or use wget or curl to download these log files.

Processing Log Files

Depending on how much log data you have, you might want to concatenate your log files together until you have a big enough sample. At Zoompf we suggest collecting a sample between 500,000-1,000,000 requests, or a week’s worth of web traffic, depending on which is larger. Programs like awstats are very helpful for processing and provide reports with your most popular and least popular files, largest files in terms of bandwidth, and other data already broken out. Directly processing the logs yourself always you to discovered more detailed data and not as hard as you would thing. Some basic regular expressions can make it very easy to gather metrics like “show all of the 304s, 404s, 500s, etc.”

Remember, examining your web logs is a key technique to discovering and solving performance problems with your web applications. Those pretty graphs from Google Analytics or other web analytics data is simply not good enough to detect performance issues and bottlenecks. You need access to the information about all the requests the web server is processing. Make sure you ask your hosting provider how you can access the raw web server log files. Find out how much web traffic data the logs contain and how you can easily collect this data so you can analysis. If your hosting provider does not provide this you should consider that a deal breaker and find another provider.

November 17, 2009

Performance Questions to Ask Hosting Providers: Web Server Configuration

Hosting a web application can be annoying and time consuming. There is the cost of the hardware. There is the time configuring, administering, and patching the operating system, web server, and other software. There is the security risk of exposing a machine onto the Internet. So it’s no surprise that many people and companies use a 3rd party hosting provider to host their web application and manage the infrastructure. Choosing a hosting provider should not be made lightly. You no longer have full control over the machine running your web application. For those interested in creating high performance web applications you must ensure that you don’t give up control over the features that you need to make your web application run as fast as possible.

This is the first in a series articles of performance questions you should ask a hosting provider. While hosting providers do offer dedicated hosting (where your application runs on a single machine all by itself) the vast majority of people choose shared hosting environments. While we will be references hosting services that use the Apache web server all of the advice in this series is applicable to Windows hosting as well.

Without a doubt the first and most important question you should ask a hosting provider is:

“What Control Do I Have Over Web Server Configuration?”

Image of Server Room

This questions is critical. Many of the easiest and most impactful performance improvements you can make to your web application, such as HTTP compression and caching, are configured at the web server level. You should start off by asking what modules are installed already. The Apache modules most relevant to performance are:

  • mod_deflate (for Apache 2) or mod_gzip (for Apache 1) – This module enables HTTP compression.
  • mod_expires – This module enables HTTP caching.
  • mod_rewrite – This module enables on-the-fly URL rewriting which is very helpful when maintaining and updating resources while using far future caching.

All of these modules are installed with the typical default installation of Apache. While this depends on the platform and the distribution they are almost always present by default. Sometimes web hosting companies will compile their own version of Apache from source to maximize performance for their particular server machines. Often they will remove modules to save space and time. If you find a hosting provider like this explain to them that you would like these modules installed. Tell them this is a reasonable request as these modules are part of the default installation of Apache. You should be able to convince them to turn these modules on for you. If not, this is a deal breaker and you should not use that hosting provider. The vast majority of hosting providers offer these modules even at the lowest pricing tiers.

Even if the hosting provider offers these modules, you should ask them for a list of all available modules as well as their policy is for enabling new modules. While mod_deflate, mod_expires, and mod_rewrite and the most helpful modules from a performance point of view there might be other modules, such as mod_cband or mod_bw, that you might want to use for performance reasons.

Once you know what you can configure on the web server your next question should be “how do I configure it?” In most shared hosting environments you will not have access to the main Apache configuration file httpd.conf but usually can control the web server through the use of .htaccess files. This is the best solution since it allows you to directly configure the web server. You simply edit the .htaccess file in the root directory for your web application and upload it to the hosting provider.

Some hosting providers supply you with a web interface to control web server configuration typically through a web administration system like CPanel. If this is the case ask to see examples of the interface. It could be simply a web form that allows to you edit a raw .htaccess file. It could be a more structured web interface with check boxes to turn on modules or forms to add new rules. Be very wary of any type of web-based server configuration. The interface will limit what you are able to configure. If a web interface is available ask if you can still manually upload your own .htaccess file to control the web server. If you cannot do this your ability to configure the web server will be severely limited. If the web interface does not provide the functionality you need you should not use that hosting provider. In general you should not use hosting providers that only offer web-based server configuration.

Bad Idea: Hacking Around Limits

Some developers like to point out that you can use server side application logic to compress content or implement caching for static resources like images or JavaScript files or CSS files. This means you don’t have to have access to the web server to configure things like HTTP compression or caching. Unfortunately this actually hurts performance more than it helps! With this method, PHP (or some other application logic layer) is invoked for all requests. Remember that the vast majority of requests are for static content and do not hit the application layer. The overhead of invoking PHP dozens if not hundreds of times for a page load removes any performance benefit of compressing or caching. We will explore this method more in a future post. For now you should completely avoid it. Never use application code to hack around the blatant shortcomings of a hosting provider.

Remember when choosing a hosting provider the single most important performance question you can ask is “how do I configure the web server?” In our next post we will explore more performance questions you should ask when choosing your hosting provider.