Parsing 'real world' HTML with Scala and HTMLCleaner

While XML parsers work great for well-formed XML, out in the 'real world' internet, you can't count on HTML being XHTML, or even being well-formatted. As a result, various 'HTML cleaner' libraries for Java have appeared. They attempt to clean up the HTML so you can parse it.

While working on my SARAH project recently (see my Mac Siri-like speech interaction project), I added a "Chicago Sports" plugin that SARAH uses to read the Chicago sports news to me, and used an HTML cleaner project to read the news headlines from the contents of this URL:

http://mobile.chicagotribune.com/s.p?sId=54&p=XxSYQXu6aC12&catId=5555&_title=Sports

If you'll look at the source code for that URL, you'll see that each headline story is a "A HREF" link, with a class attribute of "articleTitle", like this:

<div class="articleTitle">
<a class="articleTitle" href="(very long url here)">Sox&#039;s Youkilis &#034;over&#034; Valentine&#039;s comments</a>
</div>

I used that knowledge to develop the following Scala function, which extracts each of those A/HREF links, grabs the text portion of the link, and returns a Scala List of String elements:

def getHeadlinesFromUrl(url: String): List[String] = {
  var stories = new ListBuffer[String]
  val cleaner = new HtmlCleaner
  val props = cleaner.getProperties
  val rootNode = cleaner.clean(new URL(url))
  val elements = rootNode.getElementsByName("a", true)
  for (elem <- elements) {
    val classType = elem.getAttributeByName("class")
    if (classType != null && classType.equalsIgnoreCase("articleTitle")) {
      // stories might be "dirty" with text like "'", clean it up
      val text = StringEscapeUtils.unescapeHtml4(elem.getText.toString)
      stories += text
    }
  }
  return stories.filter(storyContainsDesiredPhrase(_)).toList
}

Scala and HTMLCleaner - How it works

This function works by using the Java HTMLCleaner library, so you'll need to download their jar file, and have this include in your code:

import org.htmlcleaner.HtmlCleaner

Actually, a nice thing about Scala is that you can have that include in your function, but I haven't started following that practice yet.

Real world HTML can be very malformed, and a good library like HTMLCleaner attempts to take care of all those problems for you. The code in this example does the following:

  1. Creates an HTMLCleaner instance.
  2. Passes the URL to that instance, and gets a reference to the root node of the HTML document.
  3. Gets all the anchor tags in the document.
  4. Loops over each anchor tag; if the class of the tag is "articleTitle", the text portion of the anchor tag (the text between the opening <a> tag and closing </a> tag) is extracted, and then cleaned with the Apache Commons StringEscapeUtils unescapeHtml4 method.
  5. The string is added to the list of strings named stories. This list is filtered using a function named storyContainsDesiredPhrase (not shown here), and then converted to a List[String] before it is returned.

You can also parse the HTML using XPath expressions, though I didn't use that approach here.

This function also uses the StringEscapeUtils class from the Apache Commons Lang project, so you'll need that library, and this include as well:

import org.apache.commons.lang3.StringEscapeUtils

It looks like you can also parse the HTML using XPath expressions, if you prefer.

For the purposes of this article, the last line of code doesn't matter, so I'll just briefly say that there's another Scala function named "storyContainsDesiredPhrase", which filters the list of stories down to just the ones I'm interested in. In short, I'm interested in stories about the Cubs, Bulls, and Bears, so I only return strings (story headlines) that contain those words.

The HTMLCleaner library website doesn't include many source code examples, so if you're interested in parsing "real world" HTML with Scala or Java, I hope this example is helpful.

Post new comment

The content of this field is kept private and will not be shown publicly.