Out of curiosity about Scala’s file-reading performance, I decided to write a “line count” program in Scala. One obvious approach was to count the newline characters in the file:
// took 101 secs (10M lines) // work on one character at a time def countLines1(source: Source): Long = { var newlineCount = 0L for { c <- source if c.toByte == NEWLINE } newlineCount += 1 newlineCount }
As the comment shows, this took 101 seconds to read a file that has 10M lines. (An Apache access log file for this website.)
I ran the Unix wc -l
command on the same file, and it ran in about 22 seconds, so I thought I better see if I could do better.
Using getLines
Another obvious approach is to use the getLines
method from the Source
object my method receives (a BufferedSource
). Would it run much faster? Oh yeah. It took just 22 seconds:
// took 22 secs (10M lines) // use getLines to count the number of lines def countLines2(source: Source): Long = { var newlineCount = 0L for (line <- source.getLines) { newlineCount += 1 } newlineCount }
Taking advantage of the BufferedSource
by calling getLines
makes a huge difference in performance. (Quick update: Here's a link to the current Scala BufferedSource code. It uses a Java PushbackReader and BufferedReader.)
One cool thing about this result: It's just as fast as the Unix word count program (wc -l
).
(I originally thought this algorithm was a little faster than the wc
program, but then I remembered that I start the timer inside the Scala code -- after the initial Java startup time -- while I timed the wc
command with the time
command.)
Counting all chars
In case you think the performance difference is the result of looking at each character in the file, this third approach combines the two techniques. I use getLines
to get each line, and the loop over each character in each line. It only takes 23 seconds:
// 23 seconds // use getLines, then count newline characters found (redundant, i know) def countLines3(source: io.BufferedSource): Long = { var newlineCount = 0L for (line <- source.getLines) { for { c <- line if c.toByte == NEWLINE } newlineCount += 1 } newlineCount }
If you’re comfortable with Scala, you may know that I can rewrite that for
loop. I share the rewritten loop in my upcoming book, the Scala Cookbook.