Scala: How to extract parts of a string that match a regex

Scala String FAQ: How can I extract one or more parts of a string that match the regular-expression patterns I specify?

Solution

Define the regular-expression patterns you want to extract from your String, placing parentheses around them so you can extract them as “regular-expression groups.” First, define the desired pattern:

val pattern = "([0-9]+) ([A-Za-z]+)".r

Next, extract the regex groups from the target string:

val pattern(count, fruit) = "100 Bananas"

This code extracts the numeric field and the alphabetic field from the given string as two separate variables, count and fruit, as shown in the Scala REPL:

scala> val pattern = "([0-9]+) ([A-Za-z]+)".r
pattern: scala.util.matching.Regex = ([0-9]+) ([A-Za-z]+)

scala> val pattern(count, fruit) = "100 Bananas"
count: String = 100
fruit: String = Bananas

Discussion

The syntax shown here may feel a little unusual because it seems like you’re defining pattern as a val field twice, but this syntax is more convenient and readable in a real-world example.

Imagine you’re writing the code for a search engine like Google, and you want to let people search for movies using a wide variety of phrases. To be really convenient, you’ll let them type any of these phrases to get a listing of movies near Boulder, Colorado:

"movies near 80301"
"movies 80301"
"80301 movies"
"movie: 80301"
"movies: 80301"
"movies near boulder, co"
"movies near boulder, colorado"

One way you can allow all these phrases to be used is to define a series of regular-expression patterns to match against them. Just define your expressions, and then attempt to match whatever the user types against all the possible expressions you’re willing to allow.

For example purposes, you’ll just allow these two simplified patterns:

// match "movies 80301"
val MoviesZipRE = "movies (\\d{5})".r

// match "movies near boulder, co"
val MoviesNearCityStateRE = "movies near ([a-z]+), ([a-z]{2})".r

Once you’ve defined the patterns you want to allow, you can match them against whatever text the user enters, using a match expression. In this example, you’ll call a fictional method named getSearchResults when a match occurs:

textUserTyped match {
    case MoviesZipRE(zip) => getSearchResults(zip)
    case MoviesNearCityStateRE(city, state) => getSearchResults(city, state)
    case _ => println("did not match a regex")
}

As you can see, this syntax makes your match expressions very readable. For both patterns you’re matching, you call an overloaded version of the getSearchResults method, passing it the zip field in the first case, and the city and state fields in the second case. The two regular expressions shown in this example will match strings like this:

"movies 80301"
"movies 99676"
"movies near boulder, co"
"movies near talkeetna, ak"

It’s important to note that with this technique, the regular expressions must match the entire user input. With the regex patterns shown, the following strings will fail because they have a blank space at the end of the line:

"movies 80301 "
"movies near boulder, co "

You can solve this particular problem by trimming the input string or using a more complicated regular expression, which you’ll want to do anyway in the “real world.”

As you can imagine, you can use this same pattern-matching technique in many different circumstances, including matching date and time formats, street addresses, people’s names, and many other situations.