While reading the excellent book, Programming Collective Intelligence recently, I decided to code up the first algorithm in the book using Scala instead of Python (which the book uses). This is a Euclidean distance algorithm, and it provides one way to compare two sets of data to each other, and attempts to score the similarity between the data sets.
Without any further introduction (and assuming you have the Collective Intelligence book), here's the Scala source code for the Euclidean distance algorithm as described in the book:
package collectiveintelligence import scala.collection.mutable.ArrayBuffer object Recommendations { val lisaRose = Map( "Lady in the Water"-> 2.5, "Snakes on a Plane"-> 3.5, "Just My Luck"-> 3.0, "Superman Returns"-> 3.5, "You, Me and Dupree"-> 2.5, "The Night Listener"-> 3.0) val geneSeymour = Map( "Lady in the Water"-> 3.0, "Snakes on a Plane"-> 3.5, "Just My Luck"-> 1.5, "Superman Returns"-> 5.0, "The Night Listener"-> 3.0, "You, Me and Dupree"-> 3.5) val michaelPhillips = Map( "Lady in the Water"-> 2.5, "Snakes on a Plane"-> 3.0, "Superman Returns"-> 3.5, "The Night Listener"-> 4.0) val claudiaPuig = Map( "Snakes on a Plane"-> 3.5, "Just My Luck"-> 3.0, "The Night Listener"-> 4.5, "Superman Returns"-> 4.0, "You, Me and Dupree"-> 2.5) val mickLaSalle = Map( "Lady in the Water"-> 3.0, "Snakes on a Plane"-> 4.0, "Just My Luck"-> 2.0, "Superman Returns"-> 3.0, "The Night Listener"-> 3.0, "You, Me and Dupree"-> 2.0) val jackMatthews = Map( "Lady in the Water"-> 3.0, "Snakes on a Plane"-> 4.0, "The Night Listener"-> 3.0, "Superman Returns"-> 5.0, "You, Me and Dupree"-> 3.5) val toby = Map( "Snakes on a Plane"->4.5, "You, Me and Dupree"->1.0, "Superman Returns"->4.0) val critics = Map( "Lisa Rose" -> lisaRose, "Gene Seymour" -> geneSeymour, "Michael Phillips" -> michaelPhillips, "Claudia Puig" -> claudiaPuig, "Mick LaSalle" -> mickLaSalle, "Jack Matthews" -> jackMatthews, "Toby" -> toby ) def square(a: Double) = a * a /** * Determine the total Euclidean distance between two movie reviewers, * adding up the sum of the differences between all movies both reviewers have seen. */ def euclidianDistance(critics: Map[String, Map[String, Double]], person1: String, person2: String): Double = { val p1Ratings = critics(person1) // Map(movie -> rating) val p2Ratings = critics(person2) // create a list of movies that both people have seen val similarItems = scala.collection.mutable.Set[String]() p1Ratings.keys.foreach( (movie) => if (p2Ratings.contains(movie)) similarItems += movie ) if (similarItems.size == 0) return 0 var sumOfSquares = 0.0 for (movie <- similarItems) { val diffSquared = square(p1Ratings(movie) - p2Ratings(movie)) sumOfSquares += diffSquared } return 1/(1 + sumOfSquares) } // MAIN def main(args: Array[String]) { val dist = euclidianDistance(critics, "Lisa Rose", "Gene Seymour") println("VALUE: " + dist) } }
That code could use some cleanup/refactoring, but for the time being it looks pretty much like what the author of Collective Intelligence has written in Python. I haven't tested all the different movie reviewers or added tests to this code, but using the two reviewers shown, it seems to work as advertised.
If anyone has any questions about this algorithm or code I'll be glad to answer them, but until then, I'll just leave this code as is.