As a quick tip, this method shows one way you can find the keys that are common to two Scala Maps:
/** * Find the list of keys that are common to two Maps. */ def commonMapKeys[A, B](a: Map[A, B], b: Map[A, B]): Set[A] = a.keySet.intersect(b.keySet)
I just had two Maps named p1Ratings
and p2Ratings
, which are Map
s of movies to their movie review ratings, i.e., maps of type Map[String, Double]
. You can think of p1Ratings
as being all of my movie reviews, and p2Ratings
as being all of your movie reviews. For instance, my ratings might look like this:
val p1Ratings = Map( "Snakes on a Plane" -> 2.0, "You Kill Me" -> 4.5, "The Lake House" -> 3.5, "The Way" -> 5.0 )
Using commonMapKeys
, I can find the movies that both of us have reviewed, like this:
val commonMovies = commonMapKeys(p1Ratings, p2Ratings)
The result of this is that commonMovies
is a List[String]
, where again, that list contains the names of all movies that are common to the two movie reviewers.
Discussion/warning
As a word of warning, I don’t know how efficient this solution is for large maps. For instance, for a different problem, let’s say that the first Map
has 100 million elements, and the second Map
has another 500 million elements. This code:
a.keySet.intersect(b.keySet)
could require an awful lot of memory, as it requires the memory for the original two maps, as well as two Sets that contain all of their keys (and I also don’t know how much RAM the intersect
method requires).
If there really is a memory issue here you’ll have to use a more manual approach, maybe something like this:
// a manual "object oriented" approach val similarItems = scala.collection.mutable.Set[String]() p1Ratings.keys.foreach{ movie => if (p2Ratings.contains(movie)) similarItems += movie }
In this case the keys
method is an Iterable[String]
, so it won’t immediately require a huge amount of RAM.
As another idea, you may also be able to create a lazy “view” somewhere in the code, but I haven’t given that any thought yet.
Personally I wouldn’t jump into any “manual” solutions until it’s proven that the first method is a real memory hog, but if you have a big data set you need to work on right now, experiment away.