As a brief note to self, this is a SQL query I use in the Scala Cookbook (#ad) to get a distinct list of all movie titles in the MovieLens dataset that are five-star rated in the Comedy/Romance genre:
val q = """
select distinct(m.title), r.rating, m.genres
from movies m, ratings r
where m.movieId == r.movieId
and m.genres like '%Comedy%Romance%'
and r.rating == 5.0
order by m.title"""
spark.sql(q).show
Using Apache Spark — which I demonstrate in the Scala Cookbook — the output of that SQL query looks like this:
+--------------------+------+--------------------+
| title|rating| genres|
+--------------------+------+--------------------+
|(500) Days of Sum...| 5.0|Comedy|Drama|Romance|
| 10 (1979)| 5.0| Comedy|Romance|
|10 Items or Less ...| 5.0|Comedy|Drama|Romance|
|10 Things I Hate ...| 5.0| Comedy|Romance|
Similarly, here’s another SQL query where I use count
, group by
, and order by
to get a sorted count of the number of people who gave each Comedy/Romance movie a five-star rating:
val q = """
select m.title, count(1) as the_count
from movies m, ratings r
where m.movieId == r.movieId
and m.genres like '%Comedy%Romance%'
and r.rating == 5.0
group by m.title
order by the_count desc"""
// using 'false' in 'show' tells spark to print the full column widths
spark.sql(q).show(100, false)
That query’s output looks like this:
+-------------------------------------------------------------+---------+
|title |the_count|
+-------------------------------------------------------------+---------+
|Forrest Gump (1994) |115 |
|Princess Bride, The (1987) |61 |
|Life Is Beautiful (La Vita è bella) (1997) |43 |
|Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001) |42 |
|Groundhog Day (1993) |39 |
A key here is that the way this SQL query works is:
- It selects the movie title
- By joining the movies and ratings tables on the movie id
- It filters the movies by genre and rating
- It groups those resulting movies into different groups according to the title
- Then it performs the count within each group
- Finally, it prints the movie title from each group, along with the number of times that movie was in that group
A key to understanding this SQL query is to know when the GROUP BY
portion of the query is run. In this case you can think of it as being run before the count
function and before the ORDER BY
clause.
this post is sponsored by my books: | |||
![]() #1 New Release |
![]() FP Best Seller |
![]() Learn Scala 3 |
![]() Learn FP Fast |
As a summary, if you wanted to see a couple of SQL queries that demonstrate group by
, order by
, distinct
, joins, count
, and sorting — and also use Apache Spark — I hope these examples are helpful. And for more details on these and other queries using Spark, see the Scala Cookbook (#ad).