| Developer's Daily | Perl Education |
| front page | java | perl | unix | dev directory | web log |
Introduction
Very often people want to know if their web site is a success. Furthermore, they want to know what portion of their web site is succeeding. People are always asking questions like "What's drawing the most attention?", and "What are visitors looking at?". Log analysis tools have in fact become big business.
In this article we're going to give you a method you can use to determine the number of hits your web site is getting, sorted in order from the most popular files to the least viewed files. We'll do this by developing a Perl program we've created called logHBF - "log hits by file". This program generates results like this:
A few highlights
Because this program includes a wealth of comments, and most of the details have been covered in previous articles, I'll only cover some of the highlights in this discussion.
Before going any farther, you may want to click here to look at the source code for logHBF.pl.
As you can see from the source code, we still read the access_log
file just as we demonstrated in our first
article.
Determining $fileRequested
Next, after reading each line of data, we do something a little new by breaking the $httpRequest field down into multiple sub-fields. This is performed with this line of code:
Ignoring hits to GIF and JPG files
In the code we also ignore hits to GIF and JPG files. These are
generally not of interest, but they can be in certain circumstances.
If you want to see these hits, set the variables IGNORE_GIF_FILES
and IGNORE_JPG_FILES to '0'.
Dealing with index files
In this sort of code you need to count the hit to an index file (i.e., index.html) the same as you count a hit to the directory. For example, on our web site, we need to count a hit to "/index.shtml" the same as a hit to "/".
To enable this functionality, we added this section of code to our program:
Same directory, different names
A similar problem exists with directory names. It's possible that you'll get hits like "/products/" and "/products", where the only difference between the names is the trailing slash character. In each case the directory is the same, but the lack of a trailing "/" character causes the two names to be counted separately. This section of code:
Related articles
Everything else in the code has already been discussed in our previous articles on this topic, so we won't bother to repeat those discussions here. If you're interested in those articles, you can select their links from here:
Download the source code
If you're interested in downloading the full logHBF.pl source code, click here. The full source code is a complete, operational program.
Due to privacy concerns, we will not supply a sample Apache file in
ECLF format - you'll have to supply your own.
Running the program
On a Unix system, you can run our program like this to get a count of the number of hits per file in your access_log file:
Conclusions
If you want to know what people like about your web site, one way to get the answer is to determine what people are looking at most frequently. The logHBF.pl program we've provided can certainly get you started in the right direction.
If you'd like to see other features of this topic covered in the future,
send us a quick e-mail with your idea.