Scala: How to find the keys that are common to two maps

As a quick tip, this method shows one way you can find the keys that are common to two Scala Maps:

/**
 * Find the list of keys that are common to two Maps.
 */
def commonMapKeys[A, B](a: Map[A, B], b: Map[A, B]): Set[A] = a.keySet.intersect(b.keySet)

I just had two Maps named p1Ratings and p2Ratings, which are Maps of movies to their movie review ratings, i.e., maps of type Map[String, Double]. You can think of p1Ratings as being all of my movie reviews, and p2Ratings as being all of your movie reviews. For instance, my ratings might look like this:

val p1Ratings = Map(
    "Snakes on a Plane" -> 2.0,
    "You Kill Me"       -> 4.5,
    "The Lake House"    -> 3.5,
    "The Way"           -> 5.0
)

Using commonMapKeys, I can find the movies that both of us have reviewed, like this:

val commonMovies = commonMapKeys(p1Ratings, p2Ratings)

The result of this is that commonMovies is a List[String], where again, that list contains the names of all movies that are common to the two movie reviewers.

Discussion/warning

As a word of warning, I don’t know how efficient this solution is for large maps. For instance, for a different problem, let’s say that the first Map has 100 million elements, and the second Map has another 500 million elements. This code:

a.keySet.intersect(b.keySet)

could require an awful lot of memory, as it requires the memory for the original two maps, as well as two Sets that contain all of their keys (and I also don’t know how much RAM the intersect method requires).

If there really is a memory issue here you’ll have to use a more manual approach, maybe something like this:

// a manual "object oriented" approach
val similarItems = scala.collection.mutable.Set[String]()

p1Ratings.keys.foreach{ movie =>
    if (p2Ratings.contains(movie)) similarItems += movie
}

In this case the keys method is an Iterable[String], so it won’t immediately require a huge amount of RAM.

As another idea, you may also be able to create a lazy “view” somewhere in the code, but I haven’t given that any thought yet.

Personally I wouldn’t jump into any “manual” solutions until it’s proven that the first method is a real memory hog, but if you have a big data set you need to work on right now, experiment away.