How to read the Apache access_log file (Part 1)


Right off the bat, let me make one thing perfectly clear.  I don't know that much about Apache's access_log file - but I am learning.  One thing I've learned - with all due respect to the tools available - is that every log analysis tool is different, and no single tool I've seen gives people all the information they want to see about their access_log.

Let me give you an example.  A friend of mine is what I call a "marketing type" who owns an e-commerce site.  A few weeks ago, he lamented his frustration at the existing tools, and wondered if I could provide a little help. He wanted the usual details out of his log file:  Who's visiting his site?  Where are they coming from? What products are they looking at?  But beyond that, he also wanted more details:  Are people coming back consistently, day-to-day?  What path did they take to make a buy? As usual, he also wanted the ultimate "What if ...?" tool.

Could I provide I little help?  "Sure, for the right price" I said.

In Part 1 of this article, I'm going to show you a method I came up with to read the access_log file.  Is it the best way to read the log file?  I don't know for sure yet, but if you have a better way to do it, well, in the words of Ross Perot, I'm all ears.

Whose idea was this?

After poking around the access_log file format for a while, my first thought was "whose idea was this"?  Once I recovered from my first look, I got another cup of coffee, closed the office door, got out some paper and started working on a method.

My friend has what is called an ECLF ("Extended Common Log Format") access_log file.  This is an extended file format that consists of nine fields of data for each record.  Listing 1 shows several sample records from a log file with this format. - - [01/Sep/1998:20:18:06 -0400] "GET /images/briefcase001.gif 
HTTP/1.1" 200 86 "" "Mozilla/4.0 (compatible; 
MSIE 4.01; Windows 95)" - - [01/Sep/1998:20:18:07 -0400] "GET /images/umbrella001.gif 
HTTP/1.1" 200 101 "" "Mozilla/4.0 (compatible; 
MSIE 4.01; Windows 95)"

Listing 1:  Two sample records from an ECLF access_log file show that this format provides nine fields of information for each record of information. 

As you can see, at first glance it's hard to tell that there are actually nine fields of information per record.  But believe me, they're in there.  The first three fields are just text fields separated by whitespace - no big deal for reading.  The entire fourth field is enclosed between the square brackets "[]".  The fifth field is enclosed in double quotes, while fields six and seven go back to being text-only.  The last two fields are both enclosed in double quotes.  A final point I should make is that the contents of these fields - especially those contained in the double quotes - can change significantly from record-to-record.

Well, after a couple cups of coffee I came up with a solution.  Listing 2 shows the code that I came up with during my closed door session.

Listing 2:  The program demonstrates how to read each line of the Apache ECLF access_log file.  Note that nothing is done after the fields are put into their variables - you can do whatever you want to do with the data at this point. 

This solution begins with the standard file-opening stuff.  If the open statement fails the program die's,  and that's that.

If the open statement works as expected, I start having my fun.  First I do the normal chomp.  Then, because I saw a few inconsistencies in whitespace in the access_log data file I was looking at, I threw in this statement:

This statement matches one or more consecutive whitespace characters (that's what the \s+ means - one or more whitespace characters), and converts those to just one single space.  This makes some of the things I do later a little easier.  The "g" character in "go" indicates that all occurrences of my search pattern on a single line should be replaced, not just the first one.  If you leave the "g" out, only the first set of whitespace characters on each line will be substituted.

The "o" character in "go" isn't necessarily required, but I'm told that it might make the code a little faster.  The "o" tells Perl to compile the search pattern only once, the first time the variable is interpolated.  I haven't had a chance to test the performance of this feature yet - but I hope to soon.

One statement does it all

The next statement is where all of the dirty work is done.  I use Perl's pattern-matching syntax to match the patterns I expect to find in each field, then store the pattern between parentheses, enabling me to return the actual data contained in the field.

With the access_log file in standard ECLF format, you know that five of the data fields are going to be simple strings to read, with each string separated by a space.  In those five fields I use this syntax to separate these five fields from the others:

The thing that makes this file so exciting to read is that the other fields aren't formatted consistently.  Three of the fields are enclosed in double quotes, while the other is enclosed in square brackets.

Fields eight and nine are read like this:

while field five is read like this: All three fields are enclosed in double-quotes, but from what I've read, fields eight and nine may or may-not contain data, so you want to read "zero or more" characters, while field five (the HTTP request) will always contain data.

The fourth field, enclosed in square brackets, is tackled with this syntax:


The rest, as they say, is up to you

The program shown in Listing 2 goes through each line of data in the access_log file and assigns each field of information to a scalar variable.  At this point, I'm going to leave the rest up to you.

You now have some code that lets you store each field of data in a variable.  What you do with those variables inside the reading loop depends on what type of information you want, doesn't it?

Where do we go from here?

As you can see, there's still much work to be done.  I'll tackle my end of the bargain in a follow-up article later this week, where I'll show you what I've done to get some of the information my marketing-type friend is interested in.

If you're interested in tackling your own access_log file, you're welcome to download the code shown in Listing 2.  I think it can help you get started in the right direction.

Also, if you're not comfortable with the pattern-matching syntax I've used in this article, please stand by.  I'll try to include a more detailed discussion in another article.  Because of it's unusual formatting, the access_log file makes a good case study for pattern-matching.

If you have any thoughts, comments, or suggestions pertaining to this article, drop us a line and let us know if there's a better/faster way to read the access_log file.  We're always interested in creating better code!