Example SQL queries: distinct, joins, group by, order by, count, and sorting

By Alvin Alexander. Last updated: January 29, 2022

As a brief note to self, this is a SQL query I use in the Scala Cookbook^(#ad) to get a distinct list of all movie titles in the MovieLens dataset that are five-star rated in the Comedy/Romance genre:

val q = """
    select distinct(m.title), r.rating, m.genres
    from movies m, ratings r
    where m.movieId == r.movieId 
    and m.genres like '%Comedy%Romance%'
    and r.rating == 5.0
    order by m.title"""

spark.sql(q).show

Using Apache Spark — which I demonstrate in the Scala Cookbook — the output of that SQL query looks like this:

+--------------------+------+--------------------+
|               title|rating|              genres|
+--------------------+------+--------------------+
|(500) Days of Sum...|   5.0|Comedy|Drama|Romance|
|           10 (1979)|   5.0|      Comedy|Romance|
|10 Items or Less ...|   5.0|Comedy|Drama|Romance|
|10 Things I Hate ...|   5.0|      Comedy|Romance|

Similarly, here’s another SQL query where I use count, group by, and order by to get a sorted count of the number of people who gave each Comedy/Romance movie a five-star rating:

val q = """
    select m.title, count(1) as the_count
    from movies m, ratings r
    where m.movieId == r.movieId 
    and m.genres like '%Comedy%Romance%'
    and r.rating == 5.0
    group by m.title
    order by the_count desc"""

// using 'false' in 'show' tells spark to print the full column widths
spark.sql(q).show(100, false)

That query’s output looks like this:

+-------------------------------------------------------------+---------+
|title                                                        |the_count|
+-------------------------------------------------------------+---------+
|Forrest Gump (1994)                                          |115      |
|Princess Bride, The (1987)                                   |61       |
|Life Is Beautiful (La Vita è bella) (1997)                   |43       |
|Amelie (Fabuleux destin d'Amélie Poulain, Le) (2001)         |42       |
|Groundhog Day (1993)                                         |39       |

A key here is that the way this SQL query works is:

It selects the movie title
By joining the movies and ratings tables on the movie id
It filters the movies by genre and rating
It groups those resulting movies into different groups according to the title
Then it performs the count within each group
Finally, it prints the movie title from each group, along with the number of times that movie was in that group

A key to understanding this SQL query is to know when the GROUP BY portion of the query is run. In this case you can think of it as being run before the count function and before the ORDER BY clause.

this post is sponsored by my books:
#1 New Release	FP Best Seller	Learn Scala 3	Learn FP Fast

As a summary, if you wanted to see a couple of SQL queries that demonstrate group by, order by, distinct, joins, count, and sorting — and also use Apache Spark — I hope these examples are helpful. And for more details on these and other queries using Spark, see the Scala Cookbook^(#ad).

scala

count

join

sort

sorting

Example SQL queries: distinct, joins, group by, order by, count, and sorting

books by alvin