Here’s some Scala source code that shows how to scrape the tweets off of a Twitter page. I was thinking about rewriting a Twitter module I use to use a “pure HTML” approach, and the test/demo code I came up with looks like this:
import scalaj.http.{Http, HttpResponse}
import org.htmlcleaner.{HtmlCleaner, TagNode}
import org.apache.commons.lang3.StringEscapeUtils
object JamieAllenTweets extends App
{
// get the contents of the twitter page (ScalaJ)
val response: HttpResponse[String] = Http("https://twitter.com/jamie_allen")
.timeout(connTimeoutMs = 2000, readTimeoutMs = 5000)
.asString
val body = response.body
// print the contents to stdout
val tweetsSeq = extractTweets(body)
val tweetsString = tweetsSeq.mkString("\n\n")
println(tweetsString)
// extract each individual tweet from a twitter page
def extractTweets(html: String): Seq[String] = {
val cleaner = new HtmlCleaner
val rootNode = cleaner.clean(html)
val elements = rootNode.getElementsByName("div", true)
val tweetsSeq: Seq[String] = for {
e <- elements
currentClass = e.getAttributeByName("class")
if currentClass != null
if currentClass.contains("tweet-text") // a css class twitter uses on each tweet
tweetText = StringEscapeUtils.unescapeHtml4(e.getText.toString.trim)
} yield tweetText
tweetsSeq
}
}
Here’s a quick description of the code:
- The first few lines uses ScalaJ-HTTP to download the HTML content from the URL.
- After that I extract the tweets from the HTML using the
extractTweets
method. - The
extractTweets
method uses the HTMLCleaner library and Apache Commons. - After that I print the tweets as a series of strings.
The extractTweets
method does these things:
- Gets a list of all
<div>
tags. - Searches for all
<div>
tags that contain a class namedtweet-text
. - Cleans up the HTML within that tweet, converting character sequences like
. - The
for
comprehension inextractTweets
shows how to use multiple generators and guards, and yield a result.
That’s a quick summary of how this code works. Again, I wrote it as a test to see if I could extract tweets from a Twitter page. This approach works pretty well. It’s biggest drawback is that it only works on public Twitter pages, so if you want to extract information from your own private lists and saved searches, you’ll have to use an approach like what I showed in my How to create a Twitter client in Scala tutorial.
build.sbt file, test code
FWIW, here’s my build.sbt file for this project:
name := "ScalaJ" version := "1.0" scalaVersion := "2.11.7" libraryDependencies ++= Seq( "org.scalaj" %% "scalaj-http" % "2.3.0", "org.scala-lang.modules" %% "scala-xml" % "1.0.3" )
And here’s a little test class I used when I first started working with ScalaJ:
import scalaj.http._ object TestHead extends App { val response: HttpResponse[String] = Http("http://www.google.com") .method("HEAD") .timeout(connTimeoutMs = 1000, readTimeoutMs = 5000) .asString for ((k,v) <- response.headers) println(s"key: $k\nvalue: $v\n") //response.body //println(response.code) //response.cookies //println(response) }
Summary
In summary, if you wanted to see how to use ScalaJ-HTTP and HTMLCleaner to create a Twitter client in Scala, I hope this is helpful.