Thinking With Types: A “Word Occurrence” example

Writing Scala code like this gets to be so much like algebra that you can imagine walking in on Einstein writing at a chalkboard, and following his equations one at a time until he joins them all together at the end.

For example, let’s walk through the process of writing a “word occurrence” algorithm using only pure functions. I think you’ll see that we can simply talk about this as we sketch out a series of function signatures, just like writing algebra on a chalkboard.

This example was inspired by a Haskell book. I’d like to give it credit, but at the moment I can’t remember which Haskell book I saw it in.

Problem statement

The first thing to do is to write a problem statement:

Given a document that’s represented as a String, count the number of occurrences of each word in that document.

The output of our algorithm should look like this:

the    543
an     210
or     99
...
...

So, how do we start?

Getting started

Well, the first thing we can say is that it looks like that output should be stored in a map, specifically a map where the key is a string, and the value is an integer:

Map[String, Int]

Note that the output is printed in sorted order, so depending on the details of the requirements, you may also want to return a LinkedHashMap or VectorMap indicate to others that there’s something unique about this map:

LinkedHashMap[String, Int]

In fact, because of Scala’s type system, it’s easy to create your own type to express exactly what’s going on:

DudeThisIsAMapSortedByValue[String, Int]

For the purposes of this example I’ll just leave the type as a Map, but you get the idea.

Getting back to our problem, we also know that the input our word occurrence algorithm is given is a string, so we can further state the type signature for our word count function should look like this:

def wordOccurrences(s: String): Map[String, Int]

So there we have it, that’s the signature for our main function. Now all we have to do is sketch out some more pure functions that this function will use.

From a document to words

The next thing I would think about is that I have this document represented by a string, and I need to convert it into a series of words. So now I’m thinking about a document-to-words function:

def convertDocumentToWords ...

Well, “a series of words” is just a sequence of strings, so I can sketch my function like this:

def convertDocumentToWords(): Seq[String] = ???

And what I’m calling a “document” as input is really just a string, so now I have this:

def convertDocumentToWords(s: String): Seq[String] = ???

A lot of programmers like to use aliases to make their code more readable, so using Scala 3 opaque types we can write this:

opaque type Document = String
opaque type Word = String

and now this:

def convertDocumentToWords(doc: Document): Seq[Word] = ???

Cool. It looks like we’re done with that function.

What do we need next? Well, now we have a list of all of the words in the document, so now we need to convert that list into a Map[Word, Int], i.e., a map of (a) each word that was found in the document to (b) the number of occurrences of that word.

Counting word occurrences

Following our previous thought process, let’s sketch another pure function. Right now we have a series of words from this function as our input:

def documentToWords(doc: Document): Seq[Word] =

So now we need to convert that Seq[Word] into something else:

def countWordOccurrences(words: Seq[Word]) ...

I stated above that what we want is a Map[Word, Int], so now we can completely sketch this function:

def countWordOccurrences(words: Seq[Word]): Map[Word, Int]

Given a series of words, this function returns a map of each word that was found, and how many times it was found.

Hmmm, I think we’re done ... all we have to do is glue our pure functions together.

Putting it all together

At this point all we have to do is glue our pure functions together. Assuming that I use our opaque types, the main function signature looks like this:

def wc(d: Document): Map[Word, Int] = ???

Then the first then we do is convert the document into a list of words:

def wc(d: Document): Map[Word, Int] = 
    val listOfWords = documentToWords(d)
    ...

After that we convert the list of words into a map of word occurrences:

def wc(d: Document): Map[Word, Int] = 
    val listOfWords = documentToWords(d)
    val wcMap = countWordOccurrences(listOfWords)

Then, ignoring sorting, we return that map:

def wc(d: Document): Map[Word, Int] = 
    val listOfWords = documentToWords(d)
    val wcMap = countWordOccurrences(listOfWords)
    wcMap

If you prefer, you can shorten that code to this:

def wc(d: Document): Map[Word, Int] = 
    val listOfWords = documentToWords(d)
    countWordOccurrences(listOfWords)

and then this:

def wc(d: Document): Map[Word, Int] = 
    countWordOccurrences(documentToWords(d))

I often find that when I write my code with pure functions it ends up looking a little like Lisp, and that’s true again here.

That’s pretty cool. We almost completely solved this problem by (a) sketching out a series of pure function signatures, and then (b) combining those signatures like a series of algebraic equations. All that’s left to do is write some code inside those function bodies.

For the detail-oriented

You may have noticed that I skipped a few functions to get to that solution. For instance, documentToWords is going to need a few helper functions:

  • Convert a document into paragraphs
  • Convert a paragraph into words

If you follow the same thought process that I used above, you’ll end up sketching a couple of pure function signatures like this:

splitDocumentIntoParagraphs(d: Document): Seq[Paragraph]
splitParagraphIntoWords(p: Paragraph): Seq[Word]

You’re a translator

One thing to note in all of this is that you, the programmer, are a translator. When someone says “a series of words”, you think:

Seq[String]

or

Seq[Word]

And when they say they need “a list of words and the number of times they occur,” you think:

Map[String, Int]

or

Map[Word, Count]

That may seem obvious, but sometimes the obvious bears pointing out. And interestingly, we’re already working with more complicated types like Map[Word, Count], and we’re creating aliases to make our types more meaningful.

... this post is sponsored by my books ...
 

Let’s read that document from a file

So far our solution looks like this:

def wordOccurrences(s: String): Map[Word, Count]

or this, if you prefer:

def wordOccurrences(s: String): DudeThisIsAMapSortedByValue[Word, Count]

What happens now if our requirement changes and we’re told:

You don’t get a string for input, you have to read the document from a file.

Well, right away that changes the input to our pure function:

def wordOccurrences(f: File) ...

What about our result?

Well, we know that we still want a Map[Word, Count] as a result, but when we’re dealing with a file, things can go wrong. For instance, the file might not exist. That might make you want to write this:

@throws FileNotFoundException
def wordOccurrences(f: File) ...

Don’t do that!

Oops, sorry, I’ll calm down. :)

When you’re using the standard tools that are built into Scala and you’re dealing with external resources like files, databases, and network I/O, you want to return a Try type:

def wordOccurrences(f: File): Try[?]

You might prefer Option or Either, but you get the idea, you want to return an error-handling type to account for the possibility that the file doesn’t exist, you don’t have permission to read it, or other errors that might occur.

So this leads to the question, what goes inside that Try declaration?

The answer is that the function’s answer goes inside Try. Because the function returns a Map[Word, Count] when it succeeds, the function’s return type now looks like this:

def wordOccurrences(f: File): Try[Map[Word, Count]]

So, when you’re told that you have to read a the document from a file, that’s the new solution. If you prefer Either, the solution looks like this:

def wordOccurrences(f: File): Either[Throwable, Map[Word, Count]]

This means that if everything goes fine, you’ll get a Map[Word, Count] in the Either, but if something fails, you’ll get a Throwable.

books i’ve written