| Developer's Daily | Perl Education |
| front page | java | perl | unix | dev directory | web log |
Introduction
In our first installment of this series, we showed a method that can be used to read the Apache web server's access_log file, when the Apache ECLF ("Extended Common Log Format") access_log file format is used.
In this article, we'll pick up right where we left off, and demonstrate how you can determine the number of hits your site has received from each client that accesses your site. When we're finished, you should be able to generate a report like this from the data in your access_log file:
A report like this can help you determine more information about who's
hitting your web site. For instance, you can see if all of the hits
are coming from one domain, or if the hits are spread out evenly across
many domains.
Assigning access_log data to the hash
The first part of accumulating the data is fairly easy if you understand Perl hashes. Right after the complex statement we use to read the access_log records (discussed in our first article), you just need to add a hash statement that accumulates the number of hits for each "client address" that's discovered in the access_log. Listing 1 shows the read statement followed immediately by the hash assignment statement.
# read a record from the access_log ($clientAddress, $rfc1413, $username, $localTime, $httpRequest, $statusCode, $bytesSentToClient, $referer, $clientSoftware) = /^(\S+) (\S+) (\S+) \[(.+)\] \"(.+)\" (\S+) (\S+) \"(.*)\" \"(.*)\"/o; # add the $clientAddress to the hash that counts (hits per $clientAddress) $numHits{$clientAddress}++;
|
| Listing 1: | These two statements are used to read a record from the Apache ECLF file, and create a hash that contains the number of hits for each unique $clientAddress. |
The %numHits hash works like this: When a record is read
from the access_log file, the $clientAddress will either
be (a) in the hash already or (b) not in the hash already. If this
$clientAddress is not in the hash already, it's added to the hash,
and then the count is incremented. If it is in the database already,
the count for that $clientAddress is simply incremented.
This is best shown by example.
When a new $clientAddress record is read
Suppose that a record was just read in, and $clientAddress was 'fred.flinstone.com'. Assuming that 'fred.flinstone.com' is not already in the hash, the record
When an existing $clientAddress record is read
Suppose that a record was just read in, and $clientAddress was 'spock.startrek.com'. Assuming that 'spock.startrek.com' is already in the hash, the record
Printing the data, sorted by the number of hits
If the first part of this article was only mildly tricky, the second part may be more interesting. As it turns out, printing a hash in sorted order by the hash value is not a well-known recipe. (Well, it may be more well-known now that we've published the technique in our Perl Q&A Center.)
The code that properly prints the number of hits per $clientAddress
is shown in Listing 2.
#--------------------------------------------#
# Output the number of hits per IP address #
#--------------------------------------------#
print "NUMBER OF HITS PER IP ADDRESS:\n";
print "------------------------------\n\n";
$count=0;
foreach $key (sort {$numHits{$b} <=> $numHits{$a}} (keys(%numHits))) {
last if ($count >= $NUM_RECS_TO_PRINT);
print "$numHits{$key} \t\t $key\n";
$count++;
}
print "\n\n";
|
| Listing 2: | This snippet of code demonstrates the method I use to print the number of hits per $clientAddress in numerically sorted order. |
How the sorting and printing techniques work
This technique works by extracting the keys from the hash, and giving those keys to a small "sort helper routine". The helper routine is the code that looks like this:
For a more thorough discussion of this technique, I'll refer you to our article on sorting a hash by the hash value. The only difference between that discussion and what I've done here is that in this case, I've included the sorting code in the loop, instead of creating a separate subroutine.
The only other thing that's unique in this loop is the statement
The full source code
If you're interested in downloading the full source code for this example, click here. The full source code is a complete, operational program.
Due to privacy concerns, we will not supply a sample Apache file in
ECLF format - you'll have to supply your own.
Running the program
On a Unix system, you can run our program like this to get a count of the number of hits per client address in your access_log file:
Warnings, caveats, and the future
Warnings:
As for the future, it seems that a lot of people (especially marketing people) want to know a lot of information about what's in their access_log files. In future articles, we'll explore many other advanced topics on how to read the access_log file for fun and profit, including how to determine your most popular files, and much more.
Until then, best wishes!