Scala file reading performance: Line counting algorithms

Out of curiosity about Scala’s file-reading performance, I decided to write a “line count” program in Scala. One obvious approach was to count the newline characters in the file:

// took 101 secs (10M lines)
// work on one character at a time
def countLines1(source: Source): Long = {
    var newlineCount = 0L
    for {
        c <- source
        if c.toByte == NEWLINE
    } newlineCount += 1
    newlineCount
}

As the comment shows, this took 101 seconds to read a file that has 10M lines. (An Apache access log file for this website.)

I ran the Unix wc -l command on the same file, and it ran in about 22 seconds, so I thought I better see if I could do better.

Using getLines

Another obvious approach is to use the getLines method from the Source object my method receives (a BufferedSource). Would it run much faster? Oh yeah. It took just 22 seconds:

// took 22 secs (10M lines)
// use getLines to count the number of lines
def countLines2(source: Source): Long = {
    var newlineCount = 0L
    for (line <- source.getLines) {
        newlineCount += 1
    }
    newlineCount
}

Taking advantage of the BufferedSource by calling getLines makes a huge difference in performance. (Quick update: Here's a link to the current Scala BufferedSource code. It uses a Java PushbackReader and BufferedReader.)

One cool thing about this result: It's just as fast as the Unix word count program (wc -l).

(I originally thought this algorithm was a little faster than the wc program, but then I remembered that I start the timer inside the Scala code -- after the initial Java startup time -- while I timed the wc command with the time command.)

Counting all chars

In case you think the performance difference is the result of looking at each character in the file, this third approach combines the two techniques. I use getLines to get each line, and the loop over each character in each line. It only takes 23 seconds:

// 23 seconds
// use getLines, then count newline characters found (redundant, i know)
def countLines3(source: io.BufferedSource): Long = {
    var newlineCount = 0L
    for (line <- source.getLines) {
        for {
            c <- line
            if c.toByte == NEWLINE
        } newlineCount += 1
    }
    newlineCount
}

If you’re comfortable with Scala, you may know that I can rewrite that for loop. I share the rewritten loop in my upcoming book, the Scala Cookbook.