Five good ways (and two bad ways) to read large text files with Scala

Summary: I’m working on a little project to parse relatively large (but not huge) Apache access log files, with the file this week weighing in at 9.2 GB and 33,444,922 lines. So I gave myself 90 minutes to goof around and try a few different ways to write a simple “line count” program in Scala. This isn’t my final goal, but it seemed like something I could do to measure file-reading speed before applying my algorithm.

Performance summary

I know there are a lot of problems with testing the performance of JVM algorithms, but this is a simple application that needs to (a) start, (b) read each line in a file, then (c) stop. So I don’t think issues like “warming up the JVM” apply here. I was also looking forward to seeing how much faster GraalVM was than the java JVM approach.

I ran each algorithm three times, and the average runtimes of the algorithms to parse the file came in like this:

1. Scala BufferedSource               26.6 seconds
2. Java BufferedReader                21.1
3. Java NIO Files.lines               19.9
4. Java NIO Files.newBufferedReader   20.3
5. Apache CommonsIo FileUtils         22.3
6. Scanner                            <did not complete>
7. GraalVM native-image               56.0
8. wc -l                              15.8

From a performance standpoint, the 19.9 seconds number equates to reading a text file at a clip of 0.46 GB/second and 1,680,649.3 lines/second.

Some of those numbers probably aren’t statistically significant because there was some variation in each run, but I’ll be giving the “Java NIO Files.lines” approach a try. As you can see below, it’s file-reading algorithm looks like this:

val stream: Stream[String] = Files.lines(Paths.get(filename))
val numLines = stream.count

The biggest surprise was that creating a native image with GraalVM was so slow. Last week I saw how much faster GraalVM made a file-finding algorithm, but this week I found that it made a file-reading algorithm much slower. (I tried converting the first four approaches into a native executable before I gave up on it.)

Also, I only tried the Scanner approach because I saw that someone mentioned it as a way to read files. I’ve only used it to read from the command line, and reading from a file in the manner shown below was so slow I eventually killed it; I have no idea how long it would have taken to complete.

As a final note, I reported the GraalVM performance as a bug, so there may be some way to make it run faster that I don’t currently know.

If you’re interested, the rest of this page shows the source code for the different approaches I used.

1. Scala BufferedSource

This line-counting approach uses scala.io.BufferedSource:

import java.io._
import scala.io.{Source, BufferedSource}

object BufferedSource1 extends App {

    val filename = "access_log.big"
    val source: BufferedSource = io.Source.fromFile(filename)

    var newlineCount = 0L
    for (line <- source.getLines) {
        newlineCount += 1
    }
    source.close
    
    println(newlineCount)

}

In all of these examples I created a JAR file with the SBT package command and then ran that JAR file with a simple java command, with no special command-line options.

2. Java BufferedReader

This line-counting approach uses java.io.BufferedReader:

import java.io._

object BufferedReader2 extends App {

    val filename = "access_log.big"

    val br = new BufferedReader(new FileReader(filename))
    var newlineCount = 0L
    var line = ""
    while ({line = br.readLine; line != null}) {
        newlineCount += 1
    }
    br.close
    println(newlineCount)

}

3. Java NIO Files.lines

This line-counting approach uses the lines method of java.nio.file.Files:

import java.nio.file.{Files, Paths}
import java.util.stream.Stream

object JavaNioFiles extends App {

    val filename = "access_log.big"

    val stream: Stream[String] = Files.lines(Paths.get(filename))
    val numLines = stream.count
    stream.close

}

4. Java NIO Files.newBufferedReader

This line-counting approach uses the lines method of java.nio.file.Files.newBufferedReader:

import java.io.BufferedReader
import java.nio.file.{Files,Paths}

object Java8Files2 extends App {

    val filename = "access_log.big"

    val br = Files.newBufferedReader(Paths.get(filename))
    val numLines = br.lines.count
    br.close

}

5. Apache Commons-IO FileUtils

This line-counting approach uses FileUtils.lineIterator from the Apache Commons-IO project:

import org.apache.commons.io._
import java.io.File

object CommonsIo extends App {

    val filename = "access_log.big"

    var numLines = 0L
    val it = FileUtils.lineIterator(new File(filename), "UTF-8")
    while (it.hasNext()) {
        numLines += 1
        val junk = it.nextLine
    }

}

6. Scanner

This line-counting approach uses java.util.Scanner:

import java.io.File
import java.util.Scanner

object ScannerTest extends App {

    val filename = "access_log.big"

    var numLines = 0L
    val scanner = new Scanner(new File(filename))
    while (scanner.hasNext()) {
        numLines += 1
        val junk = scanner.nextLine
    }

}

7. GraalVM

To test GraalVM, I first created a JAR file named FIND.jar from the first four algorithms, then used commands like these to create a native image:

native-image --no-server -cp $SCALA_HOME/lib/scala-library.jar -jar FIND.jar

native-image -cp $SCALA_HOME/lib/scala-library.jar:FIND.jar JavaNioFiles$delayedInit$body

They consistently ran at about 56 seconds.

Note to self: I first set up my Graal environment like this:

export JAVA_HOME=/Users/al/bin/graalvm-ce-19.1.1/Contents/Home
export PATH=/Users/al/bin/graalvm-ce-19.1.1/Contents/Home/bin:$PATH

8. wc -l

The MacOS wc -l command ran in about 15.8 seconds.

It’s worth mentioning again that my final purpose wasn’t to create a line-count algorithm. I just wanted to see which file-reading approach might be fastest before I started applying my Apache access log parsing for this project’s needs.

General testing procedure

As noted above, in all of the Java examples I created a JAR file with the SBT package command and then ran that JAR file with a simple java command, with no special command-line options. I wrapped that command in a time command, like this:

time java -cp ".:FIND.jar:$SCALA_HOME/lib/scala-library.jar" BufferedReader1$delayedInit$body

I then created a Graal native image from each JAR file and then ran it with the time command as well:

time ./find

I only gave myself 90 minutes for these tests, but I tried to do different things on the filesystem in between each test run. I didn’t know if any form of caching could come in to play here, so I tried to keep the SSD busy in between tests.

That’s all

I shared the summary of my findings at the beginning of this article, so there isn’t much else to say, other than I’ll try to work with the GraalVM people to see why Graal is so slow in this use-case.

One more thing that seems worth mentioning is that I didn’t try to use Apache Spark on this project, mostly because it seemed like overkill for my needs.