How to read the Apache access_log file (Part 2)
Determining the number of hits per client address
 

Introduction

In our first installment of this series, we showed a method that can be used to read the Apache web server's access_log file, when the Apache ECLF ("Extended Common Log Format") access_log file format is used.

In this article, we'll pick up right where we left off, and demonstrate how you can determine the number of hits your site has received from each client that accesses your site.  When we're finished, you should be able to generate a report like this from the data in your access_log file:

This report shows that there are 89 entries in your access_log file from the domain elvis.com, 67 hits from jimmy.hoffa.com, etc.  The list will continue in numerically sorted order until all entries are printed.

A report like this can help you determine more information about who's hitting your web site.  For instance, you can see if all of the hits are coming from one domain, or if the hits are spread out evenly across many domains.
 

Assigning access_log data to the hash

The first part of accumulating the data is fairly easy if you understand Perl hashes.  Right after the complex statement we use to read the access_log records (discussed in our first article), you just need to add a hash statement that accumulates the number of hits for each "client address" that's discovered in the access_logListing 1 shows the read statement followed immediately by the hash assignment statement.

 
#  read a record from the access_log
   ($clientAddress,    $rfc1413,      $username,
   $localTime,         $httpRequest,  $statusCode,
   $bytesSentToClient, $referer,      $clientSoftware) =
   /^(\S+) (\S+) (\S+) \[(.+)\] \"(.+)\" (\S+) (\S+) \"(.*)\" \"(.*)\"/o;
#  add the $clientAddress to the hash that counts (hits per $clientAddress)
   $numHits{$clientAddress}++;
 
Listing 1:  These two statements are used to read a record from the Apache ECLF file, and create a hash that contains the number of hits for each unique $clientAddress
 
 

How it works

The %numHits hash works like this:  When a record is read from the access_log file, the $clientAddress will either be (a) in the hash already or (b) not in the hash already.  If this $clientAddress is not in the hash already, it's added to the hash, and then the count is incremented.  If it is in the database already, the count for that $clientAddress is simply incremented.  This is best shown by example.
 

When a new $clientAddress record is read

Suppose that a record was just read in, and $clientAddress was 'fred.flinstone.com'.  Assuming that 'fred.flinstone.com' is not already in the hash, the record

is first created.  Once created, it's count is incremented from zero to one with the ++ operator.
 

When an existing $clientAddress record is read

Suppose that a record was just read in, and $clientAddress was 'spock.startrek.com'.  Assuming that 'spock.startrek.com' is already in the hash, the record

is just incremented with the ++ operator.  It does not need to be created, because it already exists.
 

Printing the data, sorted by the number of hits

If the first part of this article was only mildly tricky, the second part may be more interesting.  As it turns out, printing a hash in sorted order by the hash value is not a well-known recipe.  (Well, it may be more well-known now that we've published the technique in our Perl Q&A Center.)

The code that properly prints the number of hits per $clientAddress is shown in Listing 2.
 
   #--------------------------------------------#
   #  Output the number of hits per IP address  #
   #--------------------------------------------#

   print "NUMBER OF HITS PER IP ADDRESS:\n";
   print "------------------------------\n\n";
   $count=0;
   foreach $key (sort {$numHits{$b} <=> $numHits{$a}} (keys(%numHits))) {
      last if ($count >= $NUM_RECS_TO_PRINT);
      print "$numHits{$key} \t\t $key\n";
      $count++;
   }
   print "\n\n";
 
Listing 2:  This snippet of code demonstrates the method I use to print the number of hits per $clientAddress in numerically sorted order. 
 

How the sorting and printing techniques work

This technique works by extracting the keys from the hash, and giving those keys to a small "sort helper routine".  The helper routine is the code that looks like this:

This routine is used to tell the sort routine that I want to sort the %numHits hash numerically, according to the number of hits per $clientAddress, which is the value contained in the hash.  (Remember, the $clientAddress is the hash key, and the hit count is the hash value.)

For a more thorough discussion of this technique, I'll refer you to our article on sorting a hash by the hash value.  The only difference between that discussion and what I've done here is that in this case, I've included the sorting code in the loop, instead of creating a separate subroutine.

The only other thing that's unique in this loop is the statement

With this statement, I'm just giving the user more control over the output.  At the beginning of the program, the user can set the variable $NUM_RECS_TO_PRINT to control how many records should be output.  If you want to see only the top fifty addresses in the list, set this to 50; if you only want the top ten, set it to 10.
 

The full source code

If you're interested in downloading the full source code for this example, click here. The full source code is a complete, operational program.

Due to privacy concerns, we will not supply a sample Apache file in ECLF format - you'll have to supply your own.
 

Running the program

On a Unix system, you can run our program like this to get a count of the number of hits per client address in your access_log file:

(Warning:  Don't run this program on a working copy of your access_log file!  We suggest making a copy of the access_log, and then running this program against the copy.)
 

Warnings, caveats, and the future

Warnings:

  1. As mentioned before, don't run this program on a working copy of your access_log file.  We suggest making a copy of the access_log file, and running this program against the copy.

  2.  
  3. Be warned that the Perl keys function can create some large files and processes if you have a very large access_log file.  Because we rolled our access_log files daily, and the web site we tested was fairly small, the access_log never exceeded more than 10,000 to 20,000 records.  We performed our tests running a Unix system with 48 MB of memory and never experienced any problems, but this can be a problem if you have a very large site or you never roll over your access_log file.
If you have any thoughts, comments, or suggestions pertaining to this article, drop us a line.  We're always interested in hearing from our readers, especially if you've found an error or know a better way to accomplish a task.

As for the future, it seems that a lot of people (especially marketing people) want to know a lot of information about what's in their access_log files.  In future articles, we'll explore many other advanced topics on how to read the access_log file for fun and profit, including how to determine your most popular files, and much more.

Until then, best wishes!