Analyzing Apache access logs with Spark and Scala (a tutorial)

I want to analyze some Apache access log files for this website, and since those log files contain hundreds of millions (billions?) of lines, I thought I’d roll up my sleeves and dig into Apache Spark to see how it works, and how well it works. I used Hadoop several years ago, and as a quick summary, I found the transition to be easy. Here are my notes.

The last one million visitors

I probably spend about 10 hours a year looking at data related to website visitors, and today was one of those days where I gave it 15 minutes of time. Here’s a quick look at the data.

This first image shows what browsers the visitors are using:

I write mostly about Open Source and Macs, so if IE is a little lower than usual, it may be because of that.

This image shows the number of people using desktop, mobile, and tablet clients:

A Perl program to determine RSS readers from an Apache access log file

Perl/RSS FAQ: How many RSS subscribers do I have on my website?

Like many other people with a blog or website, I was curious yesterday about how many RSS readers/subscribers the devdaily website has. You can try to get this information in a variety of ways, but the real information is on your server, in your Apache log files.

To figure out how many RSS subscribers your website has, just go through your Apache log file, find all the records that look like this:

A note on using Google Analytics

For a long time I resisted using a tool like Google Analytics on my websites. I don't know exactly why, other than to say I didn't think I needed the information. But these days it's one of the first things I recommend to website customers. Simply put, if you want to be successful -- and not waste your time -- you need to know what pages visitors are looking at when they come to your website, and how they got to your website.