How to read the Apache access_log file (Part 3)
Determining the most popular files on your web site
 

Introduction

Very often people want to know if their web site is a success.  Furthermore, they want to know what portion of their web site is succeeding.  People are always asking questions like "What's drawing the most attention?", and "What are visitors looking at?".  Log analysis tools have in fact become big business.

In this article we're going to give you a method you can use to determine the number of hits your web site is getting, sorted in order from the most popular files to the least viewed files.  We'll do this by developing a Perl program we've created called logHBF - "log hits by file".  This program generates results like this:

With this kind of report you can determine what resources people are interested in.
 

A few highlights

Because this program includes a wealth of comments, and most of the details have been covered in previous articles, I'll only cover some of the highlights in this discussion.

Before going any farther, you may want to click here to look at the source code for logHBF.pl.

As you can see from the source code, we still read the access_log file just as we demonstrated in our first article.
 

Determining $fileRequested

Next, after reading each line of data, we do something a little new by breaking the $httpRequest field down into multiple sub-fields.  This is performed with this line of code:

The only thing we're really after in this statement is determining the $fileRequested value.  This is the actual name of the file the visitor has requested.
 

Ignoring hits to GIF and JPG files

In the code we also ignore hits to GIF and JPG files.  These are generally not of interest, but they can be in certain circumstances.  If you want to see these hits, set the variables IGNORE_GIF_FILES and IGNORE_JPG_FILES to '0'.
 

Dealing with index files

In this sort of code you need to count the hit to an index file (i.e., index.html) the same as you count a hit to the directory.  For example, on our web site, we need to count a hit to "/index.shtml" the same as a hit to "/".

To enable this functionality, we added this section of code to our program:

The array @indexFilenames is defined by you the user, and it should contain a list of all possible index file names used on your site, such as index.htm, index.html, index.shtml, etc.  If you define the list, this foreach loop will turn $fileRequested strings like  "/index.shtml" into "/".
 

Same directory, different names

A similar problem exists with directory names.  It's possible that you'll get hits like "/products/" and "/products", where the only difference between the names is the trailing slash character.  In each case the directory is the same, but the lack of a trailing "/" character causes the two names to be counted separately.  This section of code:

chops the trailing "/" character off of directories, essentially turning "/products/" into "/products".
 

Related articles

Everything else in the code has already been discussed in our previous articles on this topic, so we won't bother to repeat those discussions here.  If you're interested in those articles, you can select their links from here:


Download the source code

If you're interested in downloading the full logHBF.pl source code, click here.  The full source code is a complete, operational program.

Due to privacy concerns, we will not supply a sample Apache file in ECLF format - you'll have to supply your own.
 

Running the program

On a Unix system, you can run our program like this to get a count of the number of hits per file in your access_log file:

As a safety precaution, we normally run this program on a copy of our access_log file.
 

Conclusions

If you want to know what people like about your web site, one way to get the answer is to determine what people are looking at most frequently.  The logHBF.pl program we've provided can certainly get you started in the right direction.

If you'd like to see other features of this topic covered in the future, send us a quick e-mail with your idea.