A Perl program to determine RSS readers from an Apache access log file

Perl/RSS FAQ: How many RSS subscribers do I have on my website?

Like many other people with a blog or website, I was curious yesterday about how many RSS readers/subscribers the devdaily website has. You can try to get this information in a variety of ways, but the real information is on your server, in your Apache log files.

To figure out how many RSS subscribers your website has, just go through your Apache log file, find all the records that look like this:

GET /rss.xml

and then weed out the duplicate IP addresses. Of course you really don't want to do that manually, so I wrote a Perl script to do this for me. Sorry, I won't discuss this unless anyone has questions, but here's the source code for my "How many RSS readers do I have?" application:


#   PROGRAM: rsscounter.pl
#   PURPOSE: Read an apache log file, and output the unique IP addresses that
#            have hit the "/rss.xml" URI, along with a total count of these
#            addresses.
#   USAGE:
#            rsscounter.pl access_log > results
#            perl rsscounter.pl access_log > results
#            Copyright 2010 by Alvin Alexander, http://devdaily.com
#            This program is released under the terms of the
#            Creative Commons Attribution-NonCommercial 3.0 Unported license.
#            See http://creativecommons.org/licenses/by-nc/3.0/ for more 
#            information.

use File::Basename;

sub usage
  print STDERR "\n\tUsage:  rsscounter.pl access_log_file > output_file\n";

#===( MAIN )===#

# make sure we got the right number of args
$numArgs = $#ARGV + 1;
if ($numArgs != 1)
  exit 1;

# open the apache logfile
$logFile = $ARGV[0];
open (LOGFILE,"$logFile") || die "  Error opening log file $logFile.\n";

# declare our array of unique ip addresses
my @ip_addresses;

# process the logfile records

  # process the line only if it contains 'GET /rss.xml'
  next unless ($_ =~ /GET \/rss.xml/);


  # condense one or more whitespace character to one single space
  s/\s+/ /go;
  #  break each apache access_log record into nine variables                                          #
  ($clientAddress,    $rfc1413,      $username, 
  $localTime,         $httpRequest,  $statusCode, 
  $bytesSentToClient, $referer,      $clientSoftware) =
  /^(\S+) (\S+) (\S+) \[(.+)\] \"(.+)\" (\S+) (\S+) \"(.*)\" \"(.*)\"/o;

  # determine the value of $fileRequested
  ($getPost, $fileRequested, $junk) = split(' ', $httpRequest, 3);

  # add to the array only if the ip address is new
  if ( grep { $_ eq $clientAddress} @ip_addresses )
    # the array already contains this ip address; skip it this time
    # the array does not yet contain this ip address; add it
    push @ip_addresses, $clientAddress;
close (LOGFILE);

# output: sort the ip addresses and print them, along with a total count
@sorted = sort { "\L$a" cmp "\L$b" } @ip_addresses;

$count = 0;
print "\n";
print "IP Addresses Hitting /rss.xml\n";
print "-----------------------------\n";
foreach (@sorted)
  print "$_\n";

print "TOTAL: $count unique IP addresses.\n"

# the end

The only thing I'll say about this script at this time is that it prints out a list of all the unique IP addresses that have hit the "/rss.xml" URI, along with the total count of unique addresses. I don't use these IP addresses for anything, I just print them to make sure my program is working okay.

If anything, I think this script is conservative, and may understate your number of subscribers. For instance, if several people from the same office location have subscribed to your RSS feed, this script will only count them as one, assuming their office does the normal thing and uses NAT to make all addresses appear the same.

If you happen to be curious about how many RSS subscribers your website has, feel free to use this script. Or, if you have ideas on how to improve it, please let me know.