Summary: I’m working on a little project to parse relatively large (but not huge) Apache access log files, with the file this week weighing in at 9.2 GB and 33,444,922 lines. So I gave myself 90 minutes to goof around and try a few different ways to write a simple “line count” program in Scala. This isn’t my final goal, but it seemed like something I could do to measure file-reading speed before applying my algorithm.
Performance summary
I know there are a lot of problems with testing the performance of JVM algorithms, but this is a simple application that needs to (a) start, (b) read each line in a file, then (c) stop. So I don’t think issues like “warming up the JVM” apply here. I was also looking forward to seeing how much faster GraalVM was than the java
JVM approach.
I ran each algorithm three times, and the average runtimes of the algorithms to parse the file came in like this:
1. Scala BufferedSource 26.6 seconds
2. Java BufferedReader 21.1
3. Java NIO Files.lines 19.9
4. Java NIO Files.newBufferedReader 20.3
5. Apache CommonsIo FileUtils 22.3
6. Scanner <did not complete>
7. GraalVM native-image 56.0
8. wc -l 15.8
From a performance standpoint, the 19.9 seconds number equates to reading a text file at a clip of 0.46 GB/second and 1,680,649.3 lines/second.
Some of those numbers probably aren’t statistically significant because there was some variation in each run, but I’ll be giving the “Java NIO Files.lines” approach a try. As you can see below, it’s file-reading algorithm looks like this:
val stream: Stream[String] = Files.lines(Paths.get(filename))
val numLines = stream.count
The biggest surprise was that creating a native image with GraalVM was so slow. Last week I saw how much faster GraalVM made a file-finding algorithm, but this week I found that it made a file-reading algorithm much slower. (I tried converting the first four approaches into a native executable before I gave up on it.)
Also, I only tried the Scanner approach because I saw that someone mentioned it as a way to read files. I’ve only used it to read from the command line, and reading from a file in the manner shown below was so slow I eventually killed it; I have no idea how long it would have taken to complete.
As a final note, I reported the GraalVM performance as a bug, so there may be some way to make it run faster that I don’t currently know.
If you’re interested, the rest of this page shows the source code for the different approaches I used.
1. Scala BufferedSource
This line-counting approach uses scala.io.BufferedSource:
import java.io._
import scala.io.{Source, BufferedSource}
object BufferedSource1 extends App {
val filename = "access_log.big"
val source: BufferedSource = io.Source.fromFile(filename)
var newlineCount = 0L
for (line <- source.getLines) {
newlineCount += 1
}
source.close
println(newlineCount)
}
In all of these examples I created a JAR file with the SBT package
command and then ran that JAR file with a simple java
command, with no special command-line options.
2. Java BufferedReader
This line-counting approach uses java.io.BufferedReader:
import java.io._
object BufferedReader2 extends App {
val filename = "access_log.big"
val br = new BufferedReader(new FileReader(filename))
var newlineCount = 0L
var line = ""
while ({line = br.readLine; line != null}) {
newlineCount += 1
}
br.close
println(newlineCount)
}
3. Java NIO Files.lines
This line-counting approach uses the lines method of java.nio.file.Files:
import java.nio.file.{Files, Paths}
import java.util.stream.Stream
object JavaNioFiles extends App {
val filename = "access_log.big"
val stream: Stream[String] = Files.lines(Paths.get(filename))
val numLines = stream.count
stream.close
}
4. Java NIO Files.newBufferedReader
This line-counting approach uses the lines method of java.nio.file.Files.newBufferedReader:
import java.io.BufferedReader
import java.nio.file.{Files,Paths}
object Java8Files2 extends App {
val filename = "access_log.big"
val br = Files.newBufferedReader(Paths.get(filename))
val numLines = br.lines.count
br.close
}
5. Apache Commons-IO FileUtils
This line-counting approach uses FileUtils.lineIterator from the Apache Commons-IO project:
import org.apache.commons.io._
import java.io.File
object CommonsIo extends App {
val filename = "access_log.big"
var numLines = 0L
val it = FileUtils.lineIterator(new File(filename), "UTF-8")
while (it.hasNext()) {
numLines += 1
val junk = it.nextLine
}
}
6. Scanner
This line-counting approach uses java.util.Scanner:
import java.io.File
import java.util.Scanner
object ScannerTest extends App {
val filename = "access_log.big"
var numLines = 0L
val scanner = new Scanner(new File(filename))
while (scanner.hasNext()) {
numLines += 1
val junk = scanner.nextLine
}
}
7. GraalVM
To test GraalVM, I first created a JAR file named FIND.jar from the first four algorithms, then used commands like these to create a native image:
native-image --no-server -cp $SCALA_HOME/lib/scala-library.jar -jar FIND.jar
native-image -cp $SCALA_HOME/lib/scala-library.jar:FIND.jar JavaNioFiles$delayedInit$body
They consistently ran at about 56 seconds.
Note to self: I first set up my Graal environment like this:
export JAVA_HOME=/Users/al/bin/graalvm-ce-19.1.1/Contents/Home
export PATH=/Users/al/bin/graalvm-ce-19.1.1/Contents/Home/bin:$PATH
8. wc -l
The MacOS wc -l
command ran in about 15.8 seconds.
It’s worth mentioning again that my final purpose wasn’t to create a line-count algorithm. I just wanted to see which file-reading approach might be fastest before I started applying my Apache access log parsing for this project’s needs.
this post is sponsored by my books: | |||
#1 New Release |
FP Best Seller |
Learn Scala 3 |
Learn FP Fast |
General testing procedure
As noted above, in all of the Java examples I created a JAR file with the SBT package
command and then ran that JAR file with a simple java
command, with no special command-line options. I wrapped that command in a time
command, like this:
time java -cp ".:FIND.jar:$SCALA_HOME/lib/scala-library.jar" BufferedReader1$delayedInit$body
I then created a Graal native image from each JAR file and then ran it with the time
command as well:
time ./find
I only gave myself 90 minutes for these tests, but I tried to do different things on the filesystem in between each test run. I didn’t know if any form of caching could come in to play here, so I tried to keep the SSD busy in between tests.
That’s all
I shared the summary of my findings at the beginning of this article, so there isn’t much else to say, other than I’ll try to work with the GraalVM people to see why Graal is so slow in this use-case.
One more thing that seems worth mentioning is that I didn’t try to use Apache Spark on this project, mostly because it seemed like overkill for my needs.