Sunday, December 7, 2014

Stupid Simple Website Metrics

I wrote a tool that I am open-sourcing to keep track of website server performance. It is very un-fancy. It tries to do one job and do it well: Log and display page load times. (Wait, is that two jobs?)


Over the past several months, I've spent way more time than I had anticipated stressing over, testing, and trying to fix the abdominal performance I had on I was using a low end hosting service intended for personal websites, which I had expected to on the slower side of things, but it got so bad that page loads could take up to 45 seconds. That was even at times when I was the only user other than the failed spam-bot log ins once a minute or so. I would have been satisfied with 6 second loads even though that is considered a generally poor user experience. Something had to give. Anibit was on a shared server with probably dozens (my host service doesn't reveal that I'm aware of how many websites share your host). Any of them could misbehave and bring the server to it's knees until the hosting service brought the hammer down on them, which it frequently needed to do.

I don't want to bash the hosting provider, especially since they did make efforts to fix my concerns. I often wondered if I cost them most in tech support than the meager fee I was paying for the service. I also don't have experience with of different ISPs for comparison.

In the midst of my anguish over Anibit's performance, I researched a lot of metrics tools and ideas for being smarter about figuring out how to keep track of performance history. Anbits performance could vary by an order of magnitude over the course of a week, but I had no idea when the problems would first show up. I wanted a tool that could track that for me, and I wanted it for free. I could not find anything, so I decided to write something.

Enter "Stupid Simple Website Metrics"


The Stupid Simple Website Metrics tool (from now on shorted to SSWM), is made up of two parts, the metrics collecting part, and the results viewer part. Both are implemented in Python 3.4, though they may work in other Python versions, you probably need Python 3.X at the very least. The metrics collection application is meant to be run from a command-line, and uses a JSON based configuration file to control it's operation. The results viewer is a Python script that implements a web server you can point your browser to see the historic metrics data.

The development was driven by the following goals: 1) Keep it simple(to develop and use) 2) work without requiring any system-wide software or Python modules to be installed. It all works with standard Python 3.4 libraries. The only software you need to have installed is curl. On Windows, I just put a curl executable in the same location as the script. Curl comes standard with most Linux distributions.

The metrics collector loads a list of URL's from the configuration file, and then launches curl to fetch the initial html for that URL. It does not, at least for now, load any other file than the root document. This means its metrics do not test the timing of the complete browser experience, just the amount of time in loading the initial html. This is still important information, in my case, that was where the bottleneck was most prominent. Curl is launched with a parameter to log page load timings. The metrics application then parses this log to determine page loading benchmarks. These are then saved to a SQLite database. It does this for each URL in the configuration.

The metric viewer is a simple Python-based web server built on the Bottle framework. (One day, especially if someone asks for it, and even more especially if someone hires me to do it, I'll add support for running under traditional web server). The viewer page is a simple filter, chart and table of metrics results. The charts is drawn in Javascript using the Flot jQuery charting library. You can launch the web server from the command line, from initialization scripts, of as a Windows system service.

The metrics

There are currently two metrics points captures for each URL tested. In a nutshell, I wanted to know how long the server spent in CPU-intensive work preparing my content, and how much time it took in total to download. For static html pages, the first metric should be almost negligible, but but many PHP, Python, Perl, Ruby, Node, etc based websites, there is a significant amount of processing and database I/O that the server must do to determine what html needs to be sent to the browser. Drupal is known for being a powerful PHP web platform, but it is also a heavyweight in processing power needed. A single page on Anibit can require dozens and even hundreds of database queries. Anibit makes heavy use of multiple caching strategies to alleviate this: generated HTML is cached sp that the same page does not have to be regenerated every time, and database queries are cached to memory, making frequent queries sometimes a 1000 times faster! When I was having severe performance problems, it was the generation of content and the queries that were responsible for large part of the slow performance. You can tell how long a server was "thinking" about your your web page in most desktop browsers. Chrome has a superb set of tools for timing and analyzing any webpage built in to it(Firefox is great too). Let's take a look at Chrome's analysis of the Anibit home page:

That green bar, labeled "Waiting TTFB" represents "Time to First Byte". That means the time until the web server started sending HTML. Before then, it is "thinking". The SSWM tool measures something similar, though not quite the same thing. The metric measured is "TTFDB", "Time to first data byte", since it is possible that some servers may send _some_ bytes that are HTTP headers, and not the actual web page contents. The actual amount of time spent sending HTTP header information is typically very small, but it could lead to overly optimistic metrics, see this article from a real expert. SSWM tries to mitigate that by looking for actual HTTP data traffic timings in the curl log.

The second metric captured is the total loading time, represented by adding the blue bar from the Chrome metrics above.

What this tool does not capture is the total time to load and render all parts of any given page. It only instruments the root html of a given URL. This means it's not great for comparison to timings form other tools, but it is good for comparing your site's performance today to your site's performance two weeks ago.

You can check out the code, and documentation on Github here. And don't forget that if it doesn't do what you want, I can add features for a little bit of coin. :)


No comments:

Post a Comment

I welcome you're thoughts. Keep it classy, think of the children.