Search This Blog

Thursday, February 24, 2011

What Value is in My Web Access Log Files?

Over the last couple of months, I have been working on a prototype to show the value of content in our log files. Usually, log files are only used during dire times when all else fails. Throughout the years, I have found that log files contain all sorts of valuable information that should be looked at frequently. The problem is it just takes too much effort and time to go through them. This has been a passion of mine for over ten years now and I have developed several tools to extract the valuable data from many types of log files. This post is going to attempt to explain the value of the data in these log files. While I have decided to focus on web log files, the ideas can be used for most types of log files such as application debug logs, admin logs, and pretty much any other types of log files.

There are products out there which are commercial, free to use and open-source that can report on some of this data. I hate to even mention these other tools as they offer a different value to a different audience, but hopefully this helps to better understand the problem. The value areas these tools focus on require them to loose a lot of the value that exist in the log files. For example, they aggregate data at an hourly or daily level, some of them require modifications to actual web pages, and therefore only record the responses and only for the pages that the web designers or developers remember to put the tracking code in.

In order to understand the value lost by these tools, it is important to first understand the type of data which is already being logged by the web server. After that it will be easier to see not just the value of the individual fields, but of linking this data together in ways never even thought possible before.

A line in a web access log file contains lots of information for each request. For example, the date/time/millisecond of the request, the ip of the requester, the time taken to respond to the request, the number of bytes returned, the HTTP method used, the path requested along with all query parameters, the page or URI responsible for the request (the referrer), the HTTP response code, the content type of the response, the user agent string, the character encoding and more.

Let's start with the date/time field. I worked at many restaurants in high school and college. From this experience, i learned over and over again that while most rush times are predictable, many aren't. The rush times are normally between 7AM-9AM and 11:30AM-1:30PM and 5PM-8PM, depending on many factors like the location, the type of food and type of customers. During the less busy hours, managers would normally have less staff. I remember that at least three times a week, we had a rush during those off times. Sometimes the rush would be more of a rush than during our peak hours, but usually, the rush would all come in a few minutes and then we wouldn't have any customers for twenty or thirty minutes. During these unexpected peak times, the customers would experience longer than expected wait times, but we would still average between one or two customers every couple of minutes for that hour. A manager looking at the hour level might think "We didn't get that many more customers during this shift compared to other shifts. Why did we get so many bad reviews?" Without all the data, the manager might even fire the staff, thinking they are slacking off.

The internet is similar in that sometimes a bunch of people will decide to visit a website at the same second or minute. Often times the servers are not spec'd to cover this much traffic, and these spikes cause all sorts of unexpected behavior that is hard to spot when aggregating the data at the hour. In fact, many times the servers won't even respond, making it impossible for those tools that record responses (Google Analytics, Ominture, & Open Web Analytics). In fact, I remember at one particular company, we would get these kinds of spikes and the software vendor was always blamed. Our web analytics tool showed that traffic wasn't any higher than normal for that day of the week. I went into the log files and was able to pinpoint the exact time of when the site went down. It turns out the site wasn't responding to two-thirds of our customers during those days of the week; the analytics tool simply didn't know about the site being down because no responses were recorded. In other words, the analytics tool was being used for something beyond its scope.

Knowing exactly when each request came is crucial to making better decisions about the hardware sizing or architecture. Knowing that a bunch of requests came in during the same second might provoke the need to know which IP addresses requested the resources.

An IP address can be used to figure out which county, region, city, postal code and ISP the request originated from. It can also be used detect Denial of Service Attacks or to detect bots that aren't being polite. Combine this information with the user agent string and there is even more information to be gained.

A lot of information can be extracted from the user agent string including: the OS, the type of OS (mobile, tablet, desktop), the version of the OS, the browser, browser version as well as which toolbars and widgets are installed. Many mobile devices even include the manufacturer and name of the device. This information can be used to prioritize which mobile devices to test the site and/or apps against first. The user agent string can also be used to see which bots are indexing the site. I've used it to see when Google has indexed a specific page to test some SEO strategies. Many companies then use this data to block bots that create traffic and extra bandwidth. When combined with the requester IP address, this is pretty unique and can be used to calculate unique users.

When discovering bottlenecks, it's important to not only know the number of requests made per page, but it's also important to know the size of the response or the bandwidth. The content length of the response can be used to calculate bandwidth utilization, discover pages or resources that are large and should be broken up, compare the bandwidth requirements of the site with the bandwidth subscription model (is the network the bottleneck?), and even help strategize on externally hosting some of the larger resources.

Knowing the time taken to respond to a request is very helpful in understanding the actual user experience at peak and normal times. Knowing this data, helps save effort in writing automated scripts which measure the response time to alert the sys admins that something's going wrong. Now imagine combining this with the number of requests and the number of bytes returned during time period of interest.

Correlating the response time with the location of the originating request is also useful in judging whether it's a bandwidth problem of the requester or another problem in the web site. It also helps in seeing just how friendly the site is to dial-up or other high-latency internet connections.

When a linked on a site is clicked on, the entire path is recorded in the logs. For HTTP GET requests, all query string parameters are logged as well. This isn't just the case for links. Often times, forms will use HTTP GET to send the information to the server. This means all form fields are logged. The potential in these log entries is endless. For example, let's say there is an opt-out checkbox included in various pages or even on just the one page. A quick report can be run to see which percentage of people are opting out.

Another field is the referrer. This details the page or resource the user came from. This includes all query string parameters and external sites referring to a site. Combining this with the requested path can be used, for example, to find patterns around which pages users who are opting out are coming from. For example, maybe a higher rate of users who opt out come from a certain page which might be directly before or several pages. The referrers's query string can also be used to report on the search terms responsible for visits. This information can then be used for SEO strategies. With some effort, the user's entire experience can be reported on, including the form fields sent, the response time, the size of the page returned and even the response codes returned for each page. This information can then be used to simplify the flow of the site. Another way to use this field is just to report on how much traffic each referrer is responsible for. Combine this with the ip address and the user agent string and you can find out how many unique visitors are referred from another site or page on a site.

HTTP response codes explain if a page was returned successfully, was redirected to another page, returned an error page, didn't respond, wasn't found, etc ... This information can then be combined with the response time, the page requested, the referrer, or any of the other fields mentioned above. For example, the referrer will tell you which pages have bad links or which pages are responsible for server errors. It can be combined to help see why certain response times are long. This is helpful in understanding how the server behaves when it's under heavy load. Some of these response codes might not be very easy to reproduce. Combining these with the user agent agent to get the OS and browser, can be very helpful in reproducing some of the behavior that might be due to JavaScript problems of a certain version of a browser on a certain OS and version.

Sometimes, it's easy enough to tell which resources are images, javascript or css by their file-endings (map.jpg), but often times those types of resources are returned by services that generate images or even javascript which no longer have those file extensions. Knowing the content type returned is useful in seeing which types of requests might be working harder. For example, maybe the development team uses a service to generate one PDF. As the months go by, they start using this service for more and more features. The number of users and the location of those users might be constant over those months, but the servers might be more taxed and the site might be considerably slower. Knowing that the number of dynamic PDF requests has increased is very useful in understanding the site's hardware requirements. Combining this with the time and content length is also helpful in seeing how much bandwidth is required for which types of content and which times do users need this type of behavior.

The character encoding can be useful for finding which languages are most popular. Combining this information with the location of where the requests are coming from might be surprising. Also, combining this with certain pages might help prioritize which specific pages to translate into specific languages.

The combinations are endless. Keeping each of these fields connected with each other at the request makes it possible to report on many different dimensions. However, aggregating at the hour level, means this data needs to be separated. For example, having a report that displays which browsers were used for a given page during a given minute or even hour becomes very hard because this information is separated to speed things up. The higher the level of aggregation, the harder this becomes to correlate on different dimensions. This is why most of the reports on the web analytic tools are one or maybe two dimensional.

What if it was possible to provide multi-dimensional reports in an ad-hoc manner while keeping all of the information connected and at the lowest possible level?

The next blog post will start getting more specific.