How to process a CSV file in Scala

This is an excerpt from the Scala Cookbook. This is Recipe 12.5, “How to process a CSV file in Scala.”

Problem

You want to process the lines in a CSV file in Scala, either handling one line at a time or storing them in a two-dimensional array.

Solution

Combine Recipe 12.1, “How to Open and Read a Text File in Scala” with Recipe 1.3, “How to Split Strings in Scala”. Given a simple CSV file like this named finance.csv:

January, 10000.00, 9000.00, 1000.00
February, 11000.00, 9500.00, 1500.00
March, 12000.00, 10000.00, 2000.00

you can process the lines in the file with the following code:

object CSVDemo extends App {
    println("Month, Income, Expenses, Profit")
    val bufferedSource = io.Source.fromFile("/tmp/finance.csv")
    for (line <- bufferedSource.getLines) {
        val cols = line.split(",").map(_.trim)
        // do whatever you want with the columns here
        println(s"${cols(0)}|${cols(1)}|${cols(2)}|${cols(3)}")
    }
    bufferedSource.close
}

The magic in that code is this line:

val cols = line.split(",").map(_.trim)

It splits each line using the comma as a field separator character, and then uses the map method to trim each field to remove leading and trailing blank spaces. The resulting output looks like this:

January|10000.00|9000.00|1000.00
February|11000.00|9500.00|1500.00
March|12000.00|10000.00|2000.00

If you prefer named variables instead of accessing array elements, you can change the for loop to look like this:

for (line <- bufferedSource.getLines) {
    val Array(month, revenue, expenses, profit) = line.split(",").map(_.trim)
    println(s"$month $revenue $expenses $profit")
}

If the first line of the file is a header line and you want to skip it, just add drop(1) after getLines:

for (line <- bufferedSource.getLines.drop(1)) { // ...

If you prefer, you can also write the loop as a foreach loop:

bufferedSource.getLines.foreach { line =>
    rows(count) = line.split(",").map(_.trim)
    count += 1
}

If you’d like to assign the results to a two-dimensional array, there are a variety of ways to do this. One approach is to create a 2D array, and then use a counter while assigning each line to a row. To do this, you need to know the number of rows in the file before creating the array:

object CSVDemo2 extends App {
    val nrows = 3
    val ncols = 4
    val rows = Array.ofDim[String](nrows, ncols)
    val bufferedSource = io.Source.fromFile("/tmp/finance.csv")
    var count = 0
    for (line <- bufferedSource.getLines) {
        rows(count) = line.split(",").map(_.trim)
        count += 1
    }
    bufferedSource.close

    // print the rows
    for (i <- 0 until nrows) {
        println(s"${rows(i)(0)} ${rows(i)(1)} ${rows(i)(2)} ${rows(i)(3)}")
    }
}

Rather than use a counter, you can do the same thing with the zipWithIndex method. This changes the loop to:

val bufferedSource = io.Source.fromFile("/tmp/finance.csv")
for ((line, count) <- bufferedSource.getLines.zipWithIndex) {
    rows(count) = line.split(",").map(_.trim)
}

bufferedSource.close

If you don’t know the number of rows ahead of time, read each row as an Array[String], adding each row to an ArrayBuffer as the file is read. That approach is shown in this example, which uses the using method introduced in the Solution:

import scala.collection.mutable.ArrayBuffer

object CSVDemo3 extends App {

    // each row is an array of strings (the columns in the csv file)
    val rows = ArrayBuffer[Array[String]]()

    // (1) read the csv data
    using(io.Source.fromFile("/tmp/finance.csv")) { source =>
        for (line <- source.getLines) {
            rows += line.split(",").map(_.trim)
        }
    }

    // (2) print the results
    for (row <- rows) {
        println(s"${row(0)}|${row(1)}|${row(2)}|${row(3)}")
    }

    def using[A <: { def close(): Unit }, B](resource: A)(f: A => B): B =
        try {
            f(resource)
        } finally {
            resource.close()
        }
}

An Array[String] is used for each row because that’s what the split method returns. You can convert this to a different collection type, if desired.

Discussion

As you can see, there are a number of ways to tackle this problem. Of all the examples shown, the zipWithIndex method probably requires some explanation. The Iterator Scaladoc denotes that it creates an iterator that pairs each element produced by this iterator with its index, counting from 0.

So the first time through the loop, line is assigned the first line from the file, and count is 0. The next time through the loop, the second line of the file is assigned to line, and count is 1, and so on. The zipWithIndex method offers a nice solution for when you need a line counter.

In addition to these approaches, a quick search for “scala csv parser” will turn up a number of competing open source projects that you can use.

See Also

The Scala Cookbook

This tutorial is sponsored by the Scala Cookbook, which I wrote for O’Reilly:

You can find the Scala Cookbook at these locations:

Add new comment

The content of this field is kept private and will not be shown publicly.

Anonymous format

  • Allowed HTML tags: <em> <strong> <cite> <code> <ul type> <ol start type> <li> <pre>
  • Lines and paragraphs break automatically.
By submitting this form, you accept the Mollom privacy policy.