How to process every character in a text file in Scala

This is an excerpt from the Scala Cookbook (partially modified for the internet). This is Recipe 12.4, “How to process every character in a text file in Scala.”

Problem

You want to open a text file in Scala and process every character in the file.

Solution

If performance isn’t a concern, write your code in a straightforward, obvious way:

val source = io.Source.fromFile("/Users/Al/.bash_profile")
for (char <- source) {
    println(char.toUpper)
}

source.close

However, be aware that this code may be slow on large files. For instance, the following method that counts the number of lines in a file takes 100 seconds to run on my current computer on an Apache access logfile that is ten million lines long:

// run time: took 100 secs
def countLines1(source: io.Source): Long = {
    val NEWLINE = 10
    var newlineCount = 0L
    for {
        char <- source
        if char.toByte == NEWLINE
    } newlineCount += 1
    newlineCount
}

The time can be significantly reduced by using the getLines method to retrieve one line at a time, and then working through the characters in each line. The following line-counting algorithm counts the same ten million lines in just 23 seconds on the same computer:

// run time: 23 seconds
// use getLines, then count the newline characters
// (redundant for this purpose, i know)
def countLines2(source: io.Source): Long = {
    val NEWLINE = 10
    var newlineCount = 0L
    for {
        line <- source.getLines
        c <- line
        if c.toByte == NEWLINE
    } newlineCount += 1
    newlineCount
}

Both algorithms work through each byte in the file, but by using getLines in the second algorithm, the run time is reduced dramatically.

Notice that there are the equivalent of two for loops in the second example. If you haven’t seen this approach before, here’s what the code looks like with two explicit for loops:

for (line <- source.getLines) {
    for {
        c <- line
        if c.toByte == NEWLINE
    } newlineCount += 1
}

The two approaches are equivalent, but the first is more concise.