Examples of how to use parallel collections in Scala

This is an excerpt from the Scala Cookbook. This is Recipe 13.12, “Examples of how to use parallel collections in Scala.”

Problem

You want to improve the performance of an algorithm by using Scala’s parallel collections.

Solution

When creating a collection, use one of the Scala’s parallel collection classes, or convert an existing collection to a parallel collection. In either case, test your algorithm to make sure you see the benefit you’re expecting.

You can convert an existing collection to a parallel collection. To demonstrate this, first create a sequential collection, such as a Vector:

scala> val v = Vector.range(0, 10)
v: scala.collection.immutable.Vector[Int] = Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

Next, print the sequence, and you’ll see that it prints as usual:

scala> v.foreach(print)
0123456789

As expected, that example prints the string 0123456789. No matter how many times you print it, you’ll always see that same result; that’s the linear world you’re used to.

Next, call the par method on your collection to turn it into a parallel collection, and repeat the experiment:

scala> v.par.foreach(print)
5678901234

scala> v.par.foreach(print)
0123456789

scala> v.par.foreach{ e => print(e); Thread.sleep(50) }
0516273894

Whoa. Sometimes the collection prints in order, other times it prints in a seemingly random order. That’s because it’s now using an algorithm that runs concurrently. Welcome to the brave, new, parallel world.

That example showed how to convert a “normal” collection to a parallel collection. You can also create a parallel collection directly:

scala> import scala.collection.parallel.immutable.ParVector
import scala.collection.parallel.immutable._

scala> val v = ParVector.range(0, 10)
v: scala.collection.parallel.immutable.ParVector[Int] =
   ParVector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

scala> v.foreach{ e => Thread.sleep(100); print(e) }
0516273849

Discussion

As shown, you can create parallel collections in two ways:

  • Convert a “normal” collection to its parallel counterpart
  • Instantiate them directly, just like their nonparallel counterparts

You can create a new instance of a parallel collection directly. As with the “normal” collection classes that are discussed in Chapter 10 and Chapter 11, there are both immutable and mutable parallel collections. Here’s a list of some of the immutable parallel collection classes:

ParHashMap    ParHashSet    ParIterable    ParMap
ParRange      ParSeq        ParSet         ParVector

In addition to these, the mutable collection has other classes and traits, including ParArray.

For a full list of Scala’s parallel collections, see the Scala website.

Where are parallel collections useful?

To understand where a parallel collection can be useful, it helps to think about how they work. Conceptually, you can imagine a collection being split into different chunks; your algorithm is then applied to the chunks, and at the end of the operation, the different chunks are recombined.

For instance, in the Solution, a ParVector was created like this:

scala> val v = ParVector.range(0, 10)
v: scala.collection.parallel.immutable.ParVector[Int] =
   ParVector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

The elements in the ParVector were then printed like this:

scala> v.foreach{ e => Thread.sleep(100); print(e) }
0516273849

This makes sense if you imagine that the original ParVector is split into two sequences before the printing operation begins:

(0,1,2,3,4)
(5,6,7,8,9)

In this case you can imagine the foreach method taking (or receiving) the 0 from the first sequence, printing it; getting the 5 from the second sequence, printing it; then getting the 1 from the first sequence, etc.

To summarize the basic concept:

  • Collection elements are split into different groups
  • The operation is performed
  • The elements are recombined

The impact of this approach is that it must be okay that your algorithm receives elements in an arbitrary order. This means that algorithms like sum, max, min, mean, and filter will all work fine.

Conversely, any algorithm that depends on the collection elements being received in a predictable order should not be used with a parallel collection. A simple demonstration of this is the foreach examples that have been shown: if it’s important that the collection elements are printed in a particular order, such as the order in which they were placed in the collection, using a parallel collection isn’t appropriate.

The official Scala documentation refers to this as “side-effecting operations.” The Parallel Collections Overview URL in the See Also section discusses this in detail.

Performance

Using parallel collections won’t always make your code faster. It’s important to test your algorithm with and without a parallel collection to make sure your algorithm is faster with a parallel collection. The “Measuring Performance” URL in the See Also section has a terrific discussion about how to properly benchmark JVM performance.

For a parallel algorithm to provide a benefit, a collection usually needs to be fairly large. The documentation states:

“As a general heuristic, speed-ups tend to be noticeable when the size of the collection is large, typically several thousand elements.”

Finally, if using a parallel collection won’t solve your problem, using Akka actors and futures can give you complete control over your algorithms.

See Also