This is an excerpt from the 1st Edition of the Scala Cookbook. This is Recipe 12.5, “How to process a CSV file in Scala.”
Problem
You want to process the lines in a CSV file in Scala, either handling one line at a time or storing them in a two-dimensional array.
Solution
Combine Recipe 12.1, “How to Open and Read a Text File in Scala” with Recipe 1.3, “How to Split Strings in Scala”. Given a simple CSV file like this named finance.csv:
January, 10000.00, 9000.00, 1000.00 February, 11000.00, 9500.00, 1500.00 March, 12000.00, 10000.00, 2000.00
you can process the lines in the file with the following code:
object CSVDemo extends App { println("Month, Income, Expenses, Profit") val bufferedSource = io.Source.fromFile("/tmp/finance.csv") for (line <- bufferedSource.getLines) { val cols = line.split(",").map(_.trim) // do whatever you want with the columns here println(s"${cols(0)}|${cols(1)}|${cols(2)}|${cols(3)}") } bufferedSource.close }
The magic in that code is this line:
val cols = line.split(",").map(_.trim)
It splits each line using the comma as a field separator character, and then uses the map
method to trim each field to remove leading and trailing blank spaces. The resulting output looks like this:
January|10000.00|9000.00|1000.00 February|11000.00|9500.00|1500.00 March|12000.00|10000.00|2000.00
If you prefer named variables instead of accessing array elements, you can change the for
loop to look like this:
for (line <- bufferedSource.getLines) { val Array(month, revenue, expenses, profit) = line.split(",").map(_.trim) println(s"$month $revenue $expenses $profit") }
If the first line of the file is a header line and you want to skip it, just add drop(1)
after getLines
:
for (line <- bufferedSource.getLines.drop(1)) { // ...
If you prefer, you can also write the loop as a foreach
loop:
bufferedSource.getLines.foreach { line => rows(count) = line.split(",").map(_.trim) count += 1 }
If you’d like to assign the results to a two-dimensional array, there are a variety of ways to do this. One approach is to create a 2D array, and then use a counter while assigning each line to a row. To do this, you need to know the number of rows in the file before creating the array:
object CSVDemo2 extends App { val nrows = 3 val ncols = 4 val rows = Array.ofDim[String](nrows, ncols) val bufferedSource = io.Source.fromFile("/tmp/finance.csv") var count = 0 for (line <- bufferedSource.getLines) { rows(count) = line.split(",").map(_.trim) count += 1 } bufferedSource.close // print the rows for (i <- 0 until nrows) { println(s"${rows(i)(0)} ${rows(i)(1)} ${rows(i)(2)} ${rows(i)(3)}") } }
Rather than use a counter, you can do the same thing with the zipWithIndex
method. This changes the loop to:
val bufferedSource = io.Source.fromFile("/tmp/finance.csv") for ((line, count) <- bufferedSource.getLines.zipWithIndex) { rows(count) = line.split(",").map(_.trim) } bufferedSource.close
If you don’t know the number of rows ahead of time, read each row as an Array[String]
, adding each row to an ArrayBuffer
as the file is read. That approach is shown in this example, which uses the using
method introduced in the Solution:
import scala.collection.mutable.ArrayBuffer object CSVDemo3 extends App { // each row is an array of strings (the columns in the csv file) val rows = ArrayBuffer[Array[String]]() // (1) read the csv data using(io.Source.fromFile("/tmp/finance.csv")) { source => for (line <- source.getLines) { rows += line.split(",").map(_.trim) } } // (2) print the results for (row <- rows) { println(s"${row(0)}|${row(1)}|${row(2)}|${row(3)}") } def using[A <: { def close(): Unit }, B](resource: A)(f: A => B): B = try { f(resource) } finally { resource.close() } }
An Array[String]
is used for each row because that’s what the split
method returns. You can convert this to a different collection type, if desired.
Discussion
As you can see, there are a number of ways to tackle this problem. Of all the examples shown, the zipWithIndex
method probably requires some explanation. The Iterator
Scaladoc denotes that it creates an iterator that pairs each element produced by this iterator with its index, counting from 0.
So the first time through the loop, line
is assigned the first line from the file, and count
is 0. The next time through the loop, the second line of the file is assigned to line
, and count
is 1, and so on. The zipWithIndex
method offers a nice solution for when you need a line counter.
In addition to these approaches, a quick search for “scala csv parser” will turn up a number of competing open source projects that you can use.
2022 Scala 3 Update
As an update in November, 2022, this is a Scala 3 “main method” solution to reading a CSV file:
@main def readCsvFile = val bufferedSource = io.Source.fromFile("/Users/al/Desktop/Customers.csv") for line <- bufferedSource.getLines do val cols = line.split(",").map(_.trim) print(s"${cols(1)}, ") bufferedSource.close
In this example, I’m extracting the 2nd column from a CSV file using Scala 3.
See Also
- Recipe 12.1, “How to Open and Read a Text File” shows both manual and automated ways of closing file resources
- Recipe 10.11, “Using zipWithIndex or zip in Scala to Create Loop Counters” provides more examples of the
zipWithIndex
method - The Scala Iterator trait
this post is sponsored by my books: | |||
#1 New Release |
FP Best Seller |
Learn Scala 3 |
Learn FP Fast |