Search This Blog

Thursday, February 24, 2011

What Value is in My Web Access Log Files?

Over the last couple of months, I have been working on a prototype to show the value of content in our log files. Usually, log files are only used during dire times when all else fails. Throughout the years, I have found that log files contain all sorts of valuable information that should be looked at frequently. The problem is it just takes too much effort and time to go through them. This has been a passion of mine for over ten years now and I have developed several tools to extract the valuable data from many types of log files. This post is going to attempt to explain the value of the data in these log files. While I have decided to focus on web log files, the ideas can be used for most types of log files such as application debug logs, admin logs, and pretty much any other types of log files.

There are products out there which are commercial, free to use and open-source that can report on some of this data. I hate to even mention these other tools as they offer a different value to a different audience, but hopefully this helps to better understand the problem. The value areas these tools focus on require them to loose a lot of the value that exist in the log files. For example, they aggregate data at an hourly or daily level, some of them require modifications to actual web pages, and therefore only record the responses and only for the pages that the web designers or developers remember to put the tracking code in.

In order to understand the value lost by these tools, it is important to first understand the type of data which is already being logged by the web server. After that it will be easier to see not just the value of the individual fields, but of linking this data together in ways never even thought possible before.

A line in a web access log file contains lots of information for each request. For example, the date/time/millisecond of the request, the ip of the requester, the time taken to respond to the request, the number of bytes returned, the HTTP method used, the path requested along with all query parameters, the page or URI responsible for the request (the referrer), the HTTP response code, the content type of the response, the user agent string, the character encoding and more.

Let's start with the date/time field. I worked at many restaurants in high school and college. From this experience, i learned over and over again that while most rush times are predictable, many aren't. The rush times are normally between 7AM-9AM and 11:30AM-1:30PM and 5PM-8PM, depending on many factors like the location, the type of food and type of customers. During the less busy hours, managers would normally have less staff. I remember that at least three times a week, we had a rush during those off times. Sometimes the rush would be more of a rush than during our peak hours, but usually, the rush would all come in a few minutes and then we wouldn't have any customers for twenty or thirty minutes. During these unexpected peak times, the customers would experience longer than expected wait times, but we would still average between one or two customers every couple of minutes for that hour. A manager looking at the hour level might think "We didn't get that many more customers during this shift compared to other shifts. Why did we get so many bad reviews?" Without all the data, the manager might even fire the staff, thinking they are slacking off.

The internet is similar in that sometimes a bunch of people will decide to visit a website at the same second or minute. Often times the servers are not spec'd to cover this much traffic, and these spikes cause all sorts of unexpected behavior that is hard to spot when aggregating the data at the hour. In fact, many times the servers won't even respond, making it impossible for those tools that record responses (Google Analytics, Ominture, & Open Web Analytics). In fact, I remember at one particular company, we would get these kinds of spikes and the software vendor was always blamed. Our web analytics tool showed that traffic wasn't any higher than normal for that day of the week. I went into the log files and was able to pinpoint the exact time of when the site went down. It turns out the site wasn't responding to two-thirds of our customers during those days of the week; the analytics tool simply didn't know about the site being down because no responses were recorded. In other words, the analytics tool was being used for something beyond its scope.

Knowing exactly when each request came is crucial to making better decisions about the hardware sizing or architecture. Knowing that a bunch of requests came in during the same second might provoke the need to know which IP addresses requested the resources.

An IP address can be used to figure out which county, region, city, postal code and ISP the request originated from. It can also be used detect Denial of Service Attacks or to detect bots that aren't being polite. Combine this information with the user agent string and there is even more information to be gained.

A lot of information can be extracted from the user agent string including: the OS, the type of OS (mobile, tablet, desktop), the version of the OS, the browser, browser version as well as which toolbars and widgets are installed. Many mobile devices even include the manufacturer and name of the device. This information can be used to prioritize which mobile devices to test the site and/or apps against first. The user agent string can also be used to see which bots are indexing the site. I've used it to see when Google has indexed a specific page to test some SEO strategies. Many companies then use this data to block bots that create traffic and extra bandwidth. When combined with the requester IP address, this is pretty unique and can be used to calculate unique users.

When discovering bottlenecks, it's important to not only know the number of requests made per page, but it's also important to know the size of the response or the bandwidth. The content length of the response can be used to calculate bandwidth utilization, discover pages or resources that are large and should be broken up, compare the bandwidth requirements of the site with the bandwidth subscription model (is the network the bottleneck?), and even help strategize on externally hosting some of the larger resources.

Knowing the time taken to respond to a request is very helpful in understanding the actual user experience at peak and normal times. Knowing this data, helps save effort in writing automated scripts which measure the response time to alert the sys admins that something's going wrong. Now imagine combining this with the number of requests and the number of bytes returned during time period of interest.

Correlating the response time with the location of the originating request is also useful in judging whether it's a bandwidth problem of the requester or another problem in the web site. It also helps in seeing just how friendly the site is to dial-up or other high-latency internet connections.

When a linked on a site is clicked on, the entire path is recorded in the logs. For HTTP GET requests, all query string parameters are logged as well. This isn't just the case for links. Often times, forms will use HTTP GET to send the information to the server. This means all form fields are logged. The potential in these log entries is endless. For example, let's say there is an opt-out checkbox included in various pages or even on just the one page. A quick report can be run to see which percentage of people are opting out.

Another field is the referrer. This details the page or resource the user came from. This includes all query string parameters and external sites referring to a site. Combining this with the requested path can be used, for example, to find patterns around which pages users who are opting out are coming from. For example, maybe a higher rate of users who opt out come from a certain page which might be directly before or several pages. The referrers's query string can also be used to report on the search terms responsible for visits. This information can then be used for SEO strategies. With some effort, the user's entire experience can be reported on, including the form fields sent, the response time, the size of the page returned and even the response codes returned for each page. This information can then be used to simplify the flow of the site. Another way to use this field is just to report on how much traffic each referrer is responsible for. Combine this with the ip address and the user agent string and you can find out how many unique visitors are referred from another site or page on a site.

HTTP response codes explain if a page was returned successfully, was redirected to another page, returned an error page, didn't respond, wasn't found, etc ... This information can then be combined with the response time, the page requested, the referrer, or any of the other fields mentioned above. For example, the referrer will tell you which pages have bad links or which pages are responsible for server errors. It can be combined to help see why certain response times are long. This is helpful in understanding how the server behaves when it's under heavy load. Some of these response codes might not be very easy to reproduce. Combining these with the user agent agent to get the OS and browser, can be very helpful in reproducing some of the behavior that might be due to JavaScript problems of a certain version of a browser on a certain OS and version.

Sometimes, it's easy enough to tell which resources are images, javascript or css by their file-endings (map.jpg), but often times those types of resources are returned by services that generate images or even javascript which no longer have those file extensions. Knowing the content type returned is useful in seeing which types of requests might be working harder. For example, maybe the development team uses a service to generate one PDF. As the months go by, they start using this service for more and more features. The number of users and the location of those users might be constant over those months, but the servers might be more taxed and the site might be considerably slower. Knowing that the number of dynamic PDF requests has increased is very useful in understanding the site's hardware requirements. Combining this with the time and content length is also helpful in seeing how much bandwidth is required for which types of content and which times do users need this type of behavior.

The character encoding can be useful for finding which languages are most popular. Combining this information with the location of where the requests are coming from might be surprising. Also, combining this with certain pages might help prioritize which specific pages to translate into specific languages.

The combinations are endless. Keeping each of these fields connected with each other at the request makes it possible to report on many different dimensions. However, aggregating at the hour level, means this data needs to be separated. For example, having a report that displays which browsers were used for a given page during a given minute or even hour becomes very hard because this information is separated to speed things up. The higher the level of aggregation, the harder this becomes to correlate on different dimensions. This is why most of the reports on the web analytic tools are one or maybe two dimensional.

What if it was possible to provide multi-dimensional reports in an ad-hoc manner while keeping all of the information connected and at the lowest possible level?

The next blog post will start getting more specific.

Friday, October 16, 2009

Delivering vs Defining

After going to an internal conference this week, I was able to put my thumb on something that I couldn't since I started working for my current employer. That thing is the power of empowering your employees.

Overall, the internal conference was great. I applaud the people who put it on! The idea was to save money by having our own internal gurus from different departments present on various topics ranging from management, design, development, testing and systems. I enjoyed the topics very much. We even had James Gosling come and give us the keynote.

I particually remember a topic done by an old collegue that showed how he used Hadoop on to write a quick tool that solved a need.

The conference got me thinking and comparing the different departments as they presented on very different topics and technologies.

The first organization has a list of technologies that are blessed and any technology that doesn't fit that mold requires meetings and process for approval. There is a team who defines exactly what can and cannot be used and how those technologies should be used. This department is very worried about solving situations that might someday occur.

I remember, one day I was tinkering with Groovy and a member from this team came by and asked me why I was playing with a non-approved technology. That was a first in my career. I am used to something more like "Hey, that's cool! Does it have closures?" Another time, I had an idea that would help 5-10 technical people (non-paying customers). It would have taken only a few days to complete and I had nothing else on my plate at the time. The problem was I didn't have anywhere to bill my time so I was asked to document the idea and put it through a process where it gets prioritized, voted on by several other people and then finally, if found important enough, funded. The process takes at least two weeks; usually closer to 4-6 weeks. Instead of completing that idea, I spent my time doing other remedial tasks which I can bill my time to. Basically, very few of these ideas get done.

I have less insight into the other organization(s). I only have only had brief experiences with them. Basically, they seem to be about just getting stuff done and using whichever technology they can to get it done. They seem to be the ones pushing the bar on technologies and I haven't heard of a central approval board for technologies and processes.

The epiphany I had from the presentations given at this conference is that the department which was focused on delivering presented on more recent technologies and even trends while the other one presented on stuff that's been tried and trued for a while now.

The difference is one organization has a whole department of engineers who are expected and empowered to solve problems while the other organization has only one team of engineers who are expected and empowered to solve problems. The other employees are not expected nor empowered to solve complicated problems unless a problem fits in the 3x5 box that they are expected to think in.

Sunday, October 11, 2009

Randal Schwartz interviews Kent Beck

Randal Schwartz interviews Kent Beck in this issue of FLOSS. They talk about TDD, patterns, Agile and where XP fits in there along with a few other topics. I worked with Randal while I was at Yahoo! and he is brilliant. For those wanting an introduction to the topics listed previously along with some other interesting facts, this is a very good listen.

Friday, October 9, 2009

Red-Green-Refactor

Red-Green-Refactor is a phrase created to help those practicing TDD (Test-Driven Development) remember the importance of the order of tasks. The premise comes from running the unit tests in some sort of GUI. When a test fails, the GUI displays a red bar. This red bar makes it easy to recognize that a unit test has failed. When a test passes, then the GUI displays a nice green bar.

The first thing to accomplish is to write a failing test (Red bar). This proves that the test is a valid test and will fail when the code is not implemented correctly. Once a red bar can be reproduced, just enough code to get the test to pass (Green bar) is written. The green bar is now a safety line. The new goal after the green bar is to refactor the code while keeping the green bar. Now that the safety line exists, as much time as needed can be spent on the refactor. After the code is refactored, then the process starts all over again.

I like to call it R~e~d/Green/R~e~f~a~c~t~o~r as the least amount of time is spent making the bar green while the most amount of time is spent refactoring and creating the test.

Saturday, August 22, 2009

The Cost of a Bug

The Agile Mentor just published an article I wrote called "The Cost of a Bug". I'm interested in what you think.

http://blog.agiledad.com/2009/08/cost-of-bug-its-published.html

Monday, June 1, 2009

Test Infecting a Development Team

Most developers and managers alike know there is great value in unit testing. To this day, I have never met a manager who does not want developers to do unit testing of some sort. Developers know that they should be writing tests for their code, but have various concerns as to why they don't. This article summarizes the top reasons developers don't write unit tests. It also recommends a strategy of “test infecting” an entire development team.

Software testing is not a new trend. Developers have been testing their code in one way or another since the inception of computers. For example, developers write the code, install it, start the application and begin testing the section of code they are working on. In many cases, they will even write hooks into the application to allow it to be manually tested more easily. They run the application in debug mode, making sure that certain lines of code are executed with the correct variable state. No matter how it is accomplished, developers inherently feel the urge to test their code. The goal of this document is to identify how to create a climate that will sway that urge to a more efficient and valuable means of testing code.

Key Reasons Developers Do Not Write Unit Tests:
  1. There are no obvious benefits from writing unit tests. Most developers will not make the investment until they have seen the proven returns. However, unless they try writing unit tests, they will never see the benefits that arise from good unit testing.
  2. Unit testing is intimidating. Learning how to unit test is similar to learning a new culture. Developers do not know where to begin; let alone how a given section of code can be unit tested. Often times, developers have been introduced to unit testing before, but have reservations about writing unit tests against a complex scenario. The mentality can become that there is no sense in learning a technique for something that is only useful for a small number of cases.
  3. It takes too much time to write unit tests. It is often believed that unit testing slows down the development of the product too much. Somehow, the feeling is that constant manual testing, usually requiring deep navigation into the application, is acceptable, but writing code that tests a method in the code isn't. In a recent study at Google, it was found that seventy percent of the developers who refuse to write unit tests do so because it is simply too much work. They claimed that on average every line of code required around four lines of unit testing code.
  4. Many developers do not know what is expected of a unit test. They do not know what to unit test, what not to unit test and what an acceptable unit test is. When learning a new technology reluctance is expected. This reluctance is usually resolved by giving the developer time to learn. However, this same expectation is not usually given to developers with regards to unit testing. Instead, it is simply an expectation assumed by other team members or mandated by management, making the pressure of not failing all the greater. Even a seasoned developer with good unit testing habits regularly finds new innovative ways of unit testing.

Developers need to feel that unit testing is not something they are doing to accommodate a requirement, but is a tool that can become one of the most useful tools in their toolboxes. Unit testing should make the developer feel empowered, not restricted. For the majority of cases, unit testing is a more useful tool than the debugger is. The more unit tests there are, the less the debugger is used.

The goal is not to just have developers write unit tests. Developers need to write good unit tests. We need master unit testers that write unit tests just like they write code. They need to have the same passion about unit testing that they have about coding. They must become test infected. The word infect means “to affect in a contagious way”. Test infected developers have passion about unit testing, sharing the many benefits of unit testing with those around them. Test infected developers see unit tests as a necessary tool when writing code. Developing without writing unit tests becomes unimaginable. Test infected developers plan out a unit testing strategy just like they would a development strategy. They find patterns in unit tests and create efficient techniques that lessen the work required to unit test, such as helper libraries specific to the application being unit tested.

Test infecting a development team has proven to be among the most difficult goals to achieve in this industry. Most organizations simply mandate unit tests. Simply requiring a particular practice has not proven successful in the past. For example, the industry has been insisting on documentation for years and to this day good documentation is something rarely seen. Documentation can be seen as a feature of a product, but it still often times gets skipped. Unit tests, on the other hand, are never seen by the customer, making it even harder to enforce. Developers must see the benefits before they will commit to writing unit tests. This requires a commitment to resolve their concerns and giving them the time and resources required to learning good unit testing practices. Test infecting a development team requires complete support from management.

Over the years, it has been shown that test infecting a development team requires commitment and a reproducible process. With this in mind, a good strategy is to pick a member of the team who is a good people person, shows interest in unit testing and realizes the benefits. One way to accomplish this is by programming with each team member, introducing unit testing to each one and resolving concerns. The earliest adopter who is hopefully a good mentor is always the best candidate. Once chosen, time is invested in showing the return of investment as well as the ins and outs of unit testing to the developer. Programming with the developer is crucial at this point. Doing so will answer any questions immediately and it will help demonstrate alternative ways to writing the tests. This helps infect the developer, giving confidence and answers to many questions asked by those learning how to unit test effectively. The developer will soon become test infected.

Once the contagion is planted, it will spread across the team. The process simply begins again. Only this time, the test infected developer will become the subject matter expert on the team. He will program with each team member and help choose the next developer to be infected. He will want to make unit testing easier for his teammates. He will be responsible for resolving any concerns regarding unit testing on his team. The excitement and passion about unit testing will spread to other members of the team with time, leaving only a few who simply refuse to unit test. With the organization's mandate to unit test, the last few resisting will not only be required to unit test, but they will have been trained on how to do so. Those last few will have the complete support of their team and management to help them succeed.

There are several concerns that developers have about unit testing. If these concerns are ignored and unit testing is simply mandated, then getting developers to unit test is not very likely. However, if some time is invested and support is shown by not only resolving their concerns, but also showing them all of the benefits of unit testing, the possibility of developers realizing unit testing is simply another tool to put in their toolbox greatly increases. If developers acquire the desire to unit test, then the quality of the code, the unit tests, the application being released and the developers' skills will drastically improve. The management challenge is to find the early adopter with the courage and vision to take that first step.

If you are running into the chicken before the egg problem, then training is a great option. There are also local user groups that the developer can attend. Of course training and attending user group sessions will greatly help, but it may take some time. The important thing is to be patient, empathetic and supportive. Having a small pilot project for the developer to do in his spare time is a great way for the developer to practice what was learned from the conferences and user groups. It would also likely speed up the learning process. Another option is to bring a consultant for a month or two for that developer to work with. In other words, be resourceful! In the end, put this person into a situation where he can be successful on the first effort. Don't forget this person is still new to it and might become discouraged unless you are there to constantly support him. The good news is, he is likely wanting to spread the joy.