The things people do with Raspberry Pi computers are amazing. This photo shows a cluster of RPI computers someone put together to run Drupal. At some point I hope to put together a similar cluster to run things like Akka and Apache Spark.
Generating a list of URLs from Apache access log files, sorted by hit count, using Apache Spark (and Scala)
I don’t want to make my original Parsing Apache access log records with Spark and Scala article any longer, so I’m putting some new, better code here.
Assuming that you read that article, I’ll jump right in and say that I use this code to load my data into the Spark REPL:
I want to analyze some Apache access log files for this website, and since those log files contain hundreds of millions (billions?) of lines, I thought I’d roll up my sleeves and dig into Apache Spark to see how it works, and how well it works. I used Hadoop several years ago, and as a quick summary, I found the transition to be easy. Here are my notes.