Table of Contents
This is an excerpt from the Scala Cookbook. This is Recipe 13.12, “Examples of how to use parallel collections in Scala.”
Problem
You want to improve the performance of an algorithm by using Scala’s parallel collections.
Solution
When creating a collection, use one of the Scala’s parallel collection classes, or convert an existing collection to a parallel collection. In either case, test your algorithm to make sure you see the benefit you’re expecting.
You can convert an existing collection to a parallel collection. To demonstrate this, first create a sequential collection, such as a Vector
:
scala> val v = Vector.range(0, 10) v: scala.collection.immutable.Vector[Int] = Vector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
Next, print the sequence, and you’ll see that it prints as usual:
scala> v.foreach(print) 0123456789
As expected, that example prints the string 0123456789
. No matter how many times you print it, you’ll always see that same result; that’s the linear world you’re used to.
Next, call the par
method on your collection to turn it into a parallel collection, and repeat the experiment:
scala> v.par.foreach(print) 5678901234 scala> v.par.foreach(print) 0123456789 scala> v.par.foreach{ e => print(e); Thread.sleep(50) } 0516273894
Whoa. Sometimes the collection prints in order, other times it prints in a seemingly random order. That’s because it’s now using an algorithm that runs concurrently. Welcome to the brave, new, parallel world.
That example showed how to convert a “normal” collection to a parallel collection. You can also create a parallel collection directly:
scala> import scala.collection.parallel.immutable.ParVector import scala.collection.parallel.immutable._ scala> val v = ParVector.range(0, 10) v: scala.collection.parallel.immutable.ParVector[Int] = ParVector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) scala> v.foreach{ e => Thread.sleep(100); print(e) } 0516273849
Discussion
As shown, you can create parallel collections in two ways:
- Convert a “normal” collection to its parallel counterpart
- Instantiate them directly, just like their nonparallel counterparts
You can create a new instance of a parallel collection directly. As with the “normal” collection classes that are discussed in Chapter 10 and Chapter 11, there are both immutable and mutable parallel collections. Here’s a list of some of the immutable parallel collection classes:
ParHashMap ParHashSet ParIterable ParMap ParRange ParSeq ParSet ParVector
In addition to these, the mutable collection has other classes and traits, including ParArray
.
For a full list of Scala’s parallel collections, see the Scala website.
Where are parallel collections useful?
To understand where a parallel collection can be useful, it helps to think about how they work. Conceptually, you can imagine a collection being split into different chunks; your algorithm is then applied to the chunks, and at the end of the operation, the different chunks are recombined.
For instance, in the Solution, a ParVector
was created like this:
scala> val v = ParVector.range(0, 10) v: scala.collection.parallel.immutable.ParVector[Int] = ParVector(0, 1, 2, 3, 4, 5, 6, 7, 8, 9)
The elements in the ParVector
were then printed like this:
scala> v.foreach{ e => Thread.sleep(100); print(e) } 0516273849
This makes sense if you imagine that the original ParVector
is split into two sequences before the printing operation begins:
(0,1,2,3,4) (5,6,7,8,9)
In this case you can imagine the foreach
method taking (or receiving) the 0
from the first sequence, printing it; getting the 5
from the second sequence, printing it; then getting the 1
from the first sequence, etc.
To summarize the basic concept:
- Collection elements are split into different groups
- The operation is performed
- The elements are recombined
The impact of this approach is that it must be okay that your algorithm receives elements in an arbitrary order. This means that algorithms like sum
, max
, min
, mean
, and filter
will all work fine.
Conversely, any algorithm that depends on the collection elements being received in a predictable order should not be used with a parallel collection. A simple demonstration of this is the foreach
examples that have been shown: if it’s important that the collection elements are printed in a particular order, such as the order in which they were placed in the collection, using a parallel collection isn’t appropriate.
The official Scala documentation refers to this as “side-effecting operations.” The Parallel Collections Overview URL in the See Also section discusses this in detail.
Performance
Using parallel collections won’t always make your code faster. It’s important to test your algorithm with and without a parallel collection to make sure your algorithm is faster with a parallel collection. The “Measuring Performance” URL in the See Also section has a terrific discussion about how to properly benchmark JVM performance.
For a parallel algorithm to provide a benefit, a collection usually needs to be fairly large. The documentation states:
“As a general heuristic, speed-ups tend to be noticeable when the size of the collection is large, typically several thousand elements.”
Finally, if using a parallel collection won’t solve your problem, using Akka actors and futures can give you complete control over your algorithms.
this post is sponsored by my books: | |||
#1 New Release |
FP Best Seller |
Learn Scala 3 |
Learn FP Fast |