Scala file reading performance - Line counting algorithms

Out of curiosity about Scala’s file-reading performance, I decided to write a “line count” program in Scala. One obvious approach was to count the newline characters in the file:

// took 101 secs (10M lines)
// work on one character at a time
def countLines1(source: Source): Long = {
    var newlineCount = 0L
    for {
        c <- source
        if c.toByte == NEWLINE
    } newlineCount += 1
    newlineCount
}

As the comment shows, this took 101 seconds to read a file that has 10M lines. (An Apache access log file for this website.)

I ran the Unix wc -l command on the same file, and it ran in about 22 seconds, so I thought I better see if I could do better.

Using getLines

Another obvious approach is to use the getLines method from the Source object my method receives (a BufferedSource). Would it run much faster? Oh yeah. It took just 22 seconds:

// took 22 secs (10M lines)
// use getLines to count the number of lines
def countLines2(source: Source): Long = {
    var newlineCount = 0L
    for (line <- source.getLines) {
        newlineCount += 1
    }
    newlineCount
}

I haven’t taken the time to look into why there’s such a big difference between these two approaches, but I thought I’d share these results here before I have to move on to something else. I assume that this second approach takes advantage of buffering, and the first approach doesn’t, but again, I haven’t looked into it yet.

(Quick update: Here's a link to the current Scala BufferedSource code. It uses a Java PushbackReader and BufferedReader.)

One cool thing about this result: It's just as fast as the Unix word count program (wc -l).

(I originally thought this algorithm was a little faster than the wc program, but then I remembered that I start the timer inside the Scala code -- after the initial Java startup time -- while I timed the wc command with the time command.)

Counting all chars

In case you think the performance difference is the result of looking at each character in the file, this third approach combines the two techniques. I use getLines to get each line, and the loop over each character in each line. It only takes 23 seconds:

// 23 seconds
// use getLines, then count newline characters found (redundant, i know)
def countLines3(source: io.BufferedSource): Long = {
    var newlineCount = 0L
    for (line <- source.getLines) {
        for {
            c <- line
            if c.toByte == NEWLINE
        } newlineCount += 1
    }
    newlineCount
}

If you’re comfortable with Scala, you may know that I can rewrite that for loop. I share the rewritten loop in my upcoming book, the Scala Cookbook.

Add new comment

The content of this field is kept private and will not be shown publicly.

Anonymous format

  • Allowed HTML tags: <em> <strong> <cite> <code> <ul type> <ol start type> <li> <pre>
  • Lines and paragraphs break automatically.
By submitting this form, you accept the Mollom privacy policy.