Note: I originally wrote this article many years ago using Apache Spark 0.9.x with Scala 2.x. Hopefully the content below is still useful, but I wanted to warn you up front that it is old.
Introduction
Last week I wrote an Apache access log parser library in Scala to help me analyze my Apache HTTP access log file records using Apache Spark. The source code for that project is hosted here on Github. You can use this library to parse Apache access log “combined” records using Scala, Java, and other JVM-based programming languages.
This article provides some documentation on how to use my library. (I link to other tutorials at the end of this document that show how to use this library with Apache Spark.)
Basic use
The parseRecord
method of the library is intended to work on one Apache access log record at a time. For instance, after you create an AccessLogParser
instance like this:
val parser = AccessLogParser
you can then parse an access log record into an AccessLogRecord
instance like this:
val rawRecord = """80.166.165.200 - - [21/Jul/2009:02:48:12 -0700] "GET /foo/bar HTTP/1.1" 404 970 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.11) Firefox/3.0.11"""" // an AccessLogRecord instance val accessLogRecord = parser.parseRecord(rawRecord)
AccessLogRecord
An AccessLogRecord
is defined to look like this:
case class AccessLogRecord ( clientIpAddress: String, // should be an ip address, but may also be the hostname if hostname-lookups are enabled rfc1413ClientIdentity: String, // typically `-` remoteUser: String, // typically `-` dateTime: String, // [day/month/year:hour:minute:second zone] request: String, // `GET /foo ...` httpStatusCode: String, // 200, 404, etc. bytesSent: String, // an int, but may be `-` referer: String, // where the visitor came from userAgent: String // long string to represent the browser and OS )
The nine fields defined in that case class correspond to the nine fields of an Apache access log (extended/combined) record.
Helper methods
I return the fields as String
values so you can parse each record, and then convert the individual fields as desired. For instance, when using this library with Apache Spark to generate a list of URLs from Apache access log files, sorted by hit count, I only needed these fields to be strings.
That being said, if you want to parse the fields, I created a couple of helper methods to get started. The static method AccessLogParser.parseRequestField
returns a Tuple3[String, String, String]
, and the static method AccessLogParser.parseDateField
converts the Apache access log date field into a java.util.Date
(though it ignores the timezone offset that’s at the end of that string).
If you don’t like the fact that the parseRecord
method returns an Option[AccessLogRecord]
, just use the parseRecordReturningNullObjectOnFailure
method instead; as its name implies, it returns a null object version of an AccessLogRecord
if its unable to parse the Apache access log record. I’ve recently improved the parser code and haven’t had any lines I couldn’t parse recently, but it may still be better to use code like this, assuming that it’s possible to run across lines I can’t parse. (I could add another method that assumes all lines are parsed successfully and returns an AccessLogRecord
.)
Example: Tests of parsing Apache access log records
You can see how to use my library by looking at its test cases. The most up to date examples will always be in the test classes that come with the library, so I encourage you to look at those tests to see them.
That being said, here are some examples from March 11, 2014:
// how to use parseRecord and handle the Option it returns describe("Testing a second access log record ...") { records = SampleCombinedAccessLogRecords.data val parser = new AccessLogParser val rec = parser.parseRecord(records(1)) it("the result should not be None") { assert(rec != None) } it("the individual fields should be right") { rec.foreach { r => assert(r.clientIpAddress == "89.166.165.223") assert(r.rfc1413ClientIdentity == "-") assert(r.remoteUser == "-") assert(r.dateTime == "[21/Jul/2009:02:48:12 -0700]") assert(r.request == "GET /favicon.ico HTTP/1.1") assert(r.httpStatusCode == "404") assert(r.bytesSent == "970") assert(r.referer == "-") assert(r.userAgent == "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11") } } } // test parseRecordReturningNullObjectOnFailure describe("Testing the parseRecordReturningNullObjectOnFailure method with a valid record ...") { records = SampleCombinedAccessLogRecords.data val parser = new AccessLogParser val rec = parser.parseRecordReturningNullObjectOnFailure(records(1)) it("the result should not be null") { assert(rec != null) } it("the individual fields should be right") { assert(rec.clientIpAddress == "89.166.165.223") assert(rec.rfc1413ClientIdentity == "-") assert(rec.remoteUser == "-") assert(rec.dateTime == "[21/Jul/2009:02:48:12 -0700]") assert(rec.request == "GET /favicon.ico HTTP/1.1") assert(rec.httpStatusCode == "404") assert(rec.bytesSent == "970") assert(rec.referer == "-") assert(rec.userAgent == "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11") } }
These tests show how to parse the access log request field and the date field of an AccessLogRecord
:
describe("Parsing the request field ...") { it("a simple request should work") { val req = "GET /the-uri-here HTTP/1.1" val result = AccessLogParser.parseRequestField(req) assert(result != None) result.foreach { res => val (requestType, uri, httpVersion) = res assert(requestType == "GET") assert(uri == "/the-uri-here") assert(httpVersion == "HTTP/1.1") } } it("an invalid request should return blanks") { val req = "foobar" val result = AccessLogParser.parseRequestField(req) assert(result == None) } } describe("Parsing the date field ...") { it("a valid date field should work") { val date = AccessLogParser.parseDateField("[21/Jul/2009:02:48:13 -0700]") assert(date != None) date.foreach { d => val cal = Calendar.getInstance cal.setTimeInMillis(d.getTime) assert(cal.get(Calendar.YEAR) == 2009) assert(cal.get(Calendar.MONTH) == 6) // 0-based assert(cal.get(Calendar.DAY_OF_MONTH) == 21) assert(cal.get(Calendar.HOUR) == 2) assert(cal.get(Calendar.MINUTE) == 48) assert(cal.get(Calendar.SECOND) == 13) } } it("an invalid date field should return None") { val date = AccessLogParser.parseDateField("[foo bar]") assert(date == None) } }
Using the library with Apache Spark
In real-world examples, the following code shows how I used this library with a recent Apache Spark project. First, initialize what I need:
import com.alvinalexander.accesslogparser._ val p = new AccessLogParser val log = sc.textFile("alvinalexander_com.accesslog")
Next, generate a list of URIs from my sample access log file:
val uris = log.map(p.parseRecordReturningNullObjectOnFailure(_).request) .filter(_ != "") .map(_.split(" ")(1))
I used this code to find the access log records I couldn’t parse properly:
for { line <- log if p.parseRecord(line) == None } yield line
I could show more examples, but instead I’ll just refer you to my two current Apache Spark tutorials:
- Generating a list of URLs from Apache access log files, sorted by hit count
- Analyzing Apache access logs with Spark and Scala
this post is sponsored by my books: | |||
#1 New Release |
FP Best Seller |
Learn Scala 3 |
Learn FP Fast |
Summary
In summary, if you need a JVM library (Scala, Java, etc.) that you can use to parse Apache access log records, I hope this code and article have been helpful.