How to create a Scala Twitter client using ScalaJ-HTTP and HTMLCleaner

Here’s some Scala source code that shows how to scrape the tweets off of a Twitter page. I was thinking about rewriting a Twitter module I use to use a “pure HTML” approach, and the test/demo code I came up with looks like this:

import scalaj.http.{Http, HttpResponse}
import org.htmlcleaner.{HtmlCleaner, TagNode}
import org.apache.commons.lang3.StringEscapeUtils

object JamieAllenTweets extends App
{
    // get the contents of the twitter page (ScalaJ)
    val response: HttpResponse[String] = Http("https://twitter.com/jamie_allen")
                                        .timeout(connTimeoutMs = 2000, readTimeoutMs = 5000)
                                        .asString
    val body = response.body

    // print the contents to stdout
    val tweetsSeq = extractTweets(body)
    val tweetsString = tweetsSeq.mkString("\n\n")
    println(tweetsString)

    // extract each individual tweet from a twitter page
    def extractTweets(html: String): Seq[String] = {
        val cleaner = new HtmlCleaner
        val rootNode = cleaner.clean(html)
        val elements = rootNode.getElementsByName("div", true)
        val tweetsSeq: Seq[String] = for {
            e <- elements
            currentClass = e.getAttributeByName("class")
            if currentClass != null
            if currentClass.contains("tweet-text") // a css class twitter uses on each tweet
            tweetText = StringEscapeUtils.unescapeHtml4(e.getText.toString.trim)
        } yield tweetText
        tweetsSeq
    }

}

Here’s a quick description of the code:

  • The first few lines uses ScalaJ-HTTP to download the HTML content from the URL.
  • After that I extract the tweets from the HTML using the extractTweets method.
  • The extractTweets method uses the HTMLCleaner library and Apache Commons.
  • After that I print the tweets as a series of strings.

The extractTweets method does these things:

  • Gets a list of all <div> tags.
  • Searches for all <div> tags that contain a class named tweet-text.
  • Cleans up the HTML within that tweet, converting character sequences like &nbsp;.
  • The for comprehension in extractTweets shows how to use multiple generators and guards, and yield a result.

That’s a quick summary of how this code works. Again, I wrote it as a test to see if I could extract tweets from a Twitter page. This approach works pretty well. It’s biggest drawback is that it only works on public Twitter pages, so if you want to extract information from your own private lists and saved searches, you’ll have to use an approach like what I showed in my How to create a Twitter client in Scala tutorial.

build.sbt file, test code

FWIW, here’s my build.sbt file for this project:

name := "ScalaJ"

version := "1.0"

scalaVersion := "2.11.7"

libraryDependencies ++= Seq(
    "org.scalaj" %% "scalaj-http" % "2.3.0",
    "org.scala-lang.modules" %% "scala-xml" % "1.0.3"
)

And here’s a little test class I used when I first started working with ScalaJ:

import scalaj.http._

object TestHead extends App
{
    val response: HttpResponse[String] = Http("http://www.google.com")
        .method("HEAD")
        .timeout(connTimeoutMs = 1000, readTimeoutMs = 5000)
        .asString
    for ((k,v) <- response.headers) println(s"key:   $k\nvalue: $v\n")
    //response.body
    //println(response.code)
    //response.cookies
    //println(response)
}

Summary

In summary, if you wanted to see how to use ScalaJ-HTTP and HTMLCleaner to create a Twitter client in Scala, I hope this is helpful.