Example SQL queries: distinct, joins, group by, order by, count, and sorting

As a brief note to self, this is a SQL query I use in the Scala Cookbook to get a distinct list of all movie titles in the MovieLens dataset that are five-star rated in the Comedy/Romance genre:

val q = """
    select distinct(m.title), r.rating, m.genres
    from movies m, ratings r
    where m.movieId == r.movieId 
    and m.genres like '%Comedy%Romance%'
    and r.rating == 5.0
    order by m.title"""

spark.sql(q).show

Using Apache Spark — which I demonstrate in the Scala Cookbook — the output of that SQL query looks like this:

+--------------------+------+--------------------+
|               title|rating|              genres|
+--------------------+------+--------------------+
|(500) Days of Sum...|   5.0|Comedy|Drama|Romance|
|           10 (1979)|   5.0|      Comedy|Romance|
|10 Items or Less ...|   5.0|Comedy|Drama|Romance|
|10 Things I Hate ...|   5.0|      Comedy|Romance|

Similarly, here’s another SQL query where I use count, group by, and order by to get a sorted count of the number of people who gave each Comedy/Romance movie a five-star rating:

val q = """
    select m.title, count(1) as the_count
    from movies m, ratings r
    where m.movieId == r.movieId 
    and m.genres like '%Comedy%Romance%'
    and r.rating == 5.0
    group by m.title
    order by the_count desc"""

// using 'false' in 'show' tells spark to print the full column widths
spark.sql(q).show(100, false)

That query’s output looks like this:

+-------------------------------------------------------------+---------+
|title                                                        |the_count|
+-------------------------------------------------------------+---------+
|Forrest Gump (1994)                                          |115      |
|Princess Bride, The (1987)                                   |61       |
|Life Is Beautiful (La Vita è bella) (1997)                   |43       |
|Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)         |42       |
|Groundhog Day (1993)                                         |39       |

A key here is that the way this SQL query works is:

  • It selects the movie title
  • By joining the movies and ratings tables on the movie id
  • It filters the movies by genre and rating
  • It groups those resulting movies into different groups according to the title
  • Then it performs the count within each group
  • Finally, it prints the movie title from each group, along with the number of times that movie was in that group

A key to understanding this SQL query is to know when the GROUP BY portion of the query is run. In this case you can think of it as being run before the count function and before the ORDER BY clause.

As a summary, if you wanted to see a couple of SQL queries that demonstrate group by, order by, distinct, joins, count, and sorting — and also use Apache Spark — I hope these examples are helpful. And for more details on these and other queries using Spark, see the Scala Cookbook.