My Scala Apache access log parser library

Note: I originally wrote this article many years ago using Apache Spark 0.9.x with Scala 2.x. Hopefully the content below is still useful, but I wanted to warn you up front that it is old.

Introduction

Last week I wrote an Apache access log parser library in Scala to help me analyze my Apache HTTP access log file records using Apache Spark. The source code for that project is hosted here on Github. You can use this library to parse Apache access log “combined” records using Scala, Java, and other JVM-based programming languages.

This article provides some documentation on how to use my library. (I link to other tutorials at the end of this document that show how to use this library with Apache Spark.)

Basic use

The parseRecord method of the library is intended to work on one Apache access log record at a time. For instance, after you create an AccessLogParser instance like this:

val parser = AccessLogParser

you can then parse an access log record into an AccessLogRecord instance like this:

val rawRecord = """80.166.165.200 - - [21/Jul/2009:02:48:12 -0700] "GET /foo/bar HTTP/1.1" 404 970 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.11) Firefox/3.0.11""""

// an AccessLogRecord instance
val accessLogRecord = parser.parseRecord(rawRecord)

AccessLogRecord

An AccessLogRecord is defined to look like this:

case class AccessLogRecord (
    clientIpAddress: String,         // should be an ip address, but may also be the hostname if hostname-lookups are enabled
    rfc1413ClientIdentity: String,   // typically `-`
    remoteUser: String,              // typically `-`
    dateTime: String,                // [day/month/year:hour:minute:second zone]
    request: String,                 // `GET /foo ...`
    httpStatusCode: String,          // 200, 404, etc.
    bytesSent: String,               // an int, but may be `-`
    referer: String,                 // where the visitor came from
    userAgent: String                // long string to represent the browser and OS
)

The nine fields defined in that case class correspond to the nine fields of an Apache access log (extended/combined) record.

Helper methods

I return the fields as String values so you can parse each record, and then convert the individual fields as desired. For instance, when using this library with Apache Spark to generate a list of URLs from Apache access log files, sorted by hit count, I only needed these fields to be strings.

That being said, if you want to parse the fields, I created a couple of helper methods to get started. The static method AccessLogParser.parseRequestField returns a Tuple3[String, String, String], and the static method AccessLogParser.parseDateField converts the Apache access log date field into a java.util.Date (though it ignores the timezone offset that’s at the end of that string).

If you don’t like the fact that the parseRecord method returns an Option[AccessLogRecord], just use the parseRecordReturningNullObjectOnFailure method instead; as its name implies, it returns a null object version of an AccessLogRecord if its unable to parse the Apache access log record. I’ve recently improved the parser code and haven’t had any lines I couldn’t parse recently, but it may still be better to use code like this, assuming that it’s possible to run across lines I can’t parse. (I could add another method that assumes all lines are parsed successfully and returns an AccessLogRecord.)

Example: Tests of parsing Apache access log records

You can see how to use my library by looking at its test cases. The most up to date examples will always be in the test classes that come with the library, so I encourage you to look at those tests to see them.

That being said, here are some examples from March 11, 2014:

// how to use parseRecord and handle the Option it returns
describe("Testing a second access log record ...") {
    records = SampleCombinedAccessLogRecords.data
    val parser = new AccessLogParser
    val rec = parser.parseRecord(records(1))
    it("the result should not be None") {
        assert(rec != None)
    }
    it("the individual fields should be right") {
        rec.foreach { r =>
            assert(r.clientIpAddress == "89.166.165.223")
            assert(r.rfc1413ClientIdentity == "-")
            assert(r.remoteUser == "-")
            assert(r.dateTime == "[21/Jul/2009:02:48:12 -0700]")
            assert(r.request == "GET /favicon.ico HTTP/1.1")
            assert(r.httpStatusCode == "404")
            assert(r.bytesSent == "970")
            assert(r.referer == "-")
            assert(r.userAgent == "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11")
        }
    }
}

// test parseRecordReturningNullObjectOnFailure
describe("Testing the parseRecordReturningNullObjectOnFailure method with a valid record ...") {
    records = SampleCombinedAccessLogRecords.data
    val parser = new AccessLogParser
    val rec = parser.parseRecordReturningNullObjectOnFailure(records(1))
    it("the result should not be null") {
        assert(rec != null)
    }
    it("the individual fields should be right") {
        assert(rec.clientIpAddress == "89.166.165.223")
        assert(rec.rfc1413ClientIdentity == "-")
        assert(rec.remoteUser == "-")
        assert(rec.dateTime == "[21/Jul/2009:02:48:12 -0700]")
        assert(rec.request == "GET /favicon.ico HTTP/1.1")
        assert(rec.httpStatusCode == "404")
        assert(rec.bytesSent == "970")
        assert(rec.referer == "-")
        assert(rec.userAgent == "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11")
    }
}

These tests show how to parse the access log request field and the date field of an AccessLogRecord:

describe("Parsing the request field ...") {
    it("a simple request should work") {
        val req = "GET /the-uri-here HTTP/1.1"
        val result = AccessLogParser.parseRequestField(req)
        assert(result != None)
        result.foreach { res =>
            val (requestType, uri, httpVersion) = res 
            assert(requestType == "GET")
            assert(uri == "/the-uri-here")
            assert(httpVersion == "HTTP/1.1")
        }
    }
    it("an invalid request should return blanks") {
        val req = "foobar"
        val result = AccessLogParser.parseRequestField(req)
        assert(result == None)
    }
}

describe("Parsing the date field ...") {
    it("a valid date field should work") {
        val date = AccessLogParser.parseDateField("[21/Jul/2009:02:48:13 -0700]")
        assert(date != None)
        date.foreach { d =>
            val cal = Calendar.getInstance
            cal.setTimeInMillis(d.getTime)
            assert(cal.get(Calendar.YEAR) == 2009)
            assert(cal.get(Calendar.MONTH) == 6)  // 0-based
            assert(cal.get(Calendar.DAY_OF_MONTH) == 21)
            assert(cal.get(Calendar.HOUR) == 2)
            assert(cal.get(Calendar.MINUTE) == 48)
            assert(cal.get(Calendar.SECOND) == 13)
        }
    }
    it("an invalid date field should return None") {
        val date = AccessLogParser.parseDateField("[foo bar]")
        assert(date == None)
    }
}

Using the library with Apache Spark

In real-world examples, the following code shows how I used this library with a recent Apache Spark project. First, initialize what I need:

import com.alvinalexander.accesslogparser._
val p = new AccessLogParser
val log = sc.textFile("alvinalexander_com.accesslog")

Next, generate a list of URIs from my sample access log file:

val uris = log.map(p.parseRecordReturningNullObjectOnFailure(_).request)
              .filter(_ != "")
              .map(_.split(" ")(1))

I used this code to find the access log records I couldn’t parse properly:

for {
    line <- log
    if p.parseRecord(line) == None
} yield line

I could show more examples, but instead I’ll just refer you to my two current Apache Spark tutorials:

Summary

In summary, if you need a JVM library (Scala, Java, etc.) that you can use to parse Apache access log records, I hope this code and article have been helpful.