spark

Analyzing Apache access logs with Spark and Scala (a tutorial)

I want to analyze some Apache access log files for this website, and since those log files contain hundreds of millions (billions?) of lines, I thought I’d roll up my sleeves and dig into Apache Spark to see how it works, and how well it works. I used Hadoop several years ago, and as a quick summary, I found the transition to be easy. Here are my notes.