Scala: Extracting data from an array of XML elements

Problem: Your XML data has an array of elements, and you need to extract the first element, second element, or more generally, the Nth element, using Scala.

Solution

The following simplified version of the XML from the Yahoo Weather API has three <forecast> elements:

val weather = <rss>
<channel>
<title>Yahoo! Weather - Boulder, CO</title>
<item>
<!-- multiple yweather:forecast elements -->
<forecast day="Thu" date="10 Nov 2011" low="37" high="58" 
          text="Partly Cloudy" code="29" />
<forecast day="Fri" date="11 Nov 2011" low="39" high="58" 
          text="Mostly Cloudy" code="28" />
<forecast day="Sat" date="12 Nov 2011" low="32" high="49" text="Cloudy" 
          code="27" />
</item>
</channel>
</rss>

To access the data in the first <forecast> element, wrap the XPath expression in parentheses and append (0) to it. You can access the first element using a series of \ method calls:

val day = (weather \ "channel" \ "item" \ "forecast")(0) \ "@day"
val date = (weather \ "channel" \ "item" \ "forecast")(0) \ "@date"

Or you can access it with a single \\ method call, if you prefer:

val low = (weather \\ "forecast")(0) \ "@low"
val high = (weather \\ "forecast")(0) \ "@high"

Either approach yields this result:

scala> val date = (weather \\ "forecast")(0) \ "@date"
date: scala.xml.NodeSeq = 10 Nov 2011

Better yet, create a forecasts object first, and then extract the attributes from it:

// 1) creates a NodeSeq with the three <forecast> elements
val forecasts = weather \ "channel" \ "item" \ "forecast"

// 2) extract the attributes
val day  = forecasts(0) \ "@day"    // Thu (as a NodeSeq)
val date = forecasts(0) \ "@date"   // 10 Nov 2011
val low  = forecasts(0) \ "@low"    // 37
val high = forecasts(0) \ "@high"   // 58
val text = forecasts(0) \ "@text"   // Partly Cloudy

This approach returns the elements as a NodeSeq:

scala> val day = forecasts(0) \ "@day"
day: scala.xml.NodeSeq = Thu

To extract the attributes as a String instead, add the text method to the end of the expression:

scala> val day = (forecasts(0) \ "@day").text
day: String = Thu

If the attribute doesn’t exist, this returns an empty string:

scala> val foo = ((weather \\ "forecast")(0) \ "@FOO").text
foo: String = ""

You can access data from other <forecast> elements in the same way. Here’s the date from the second element in the array:

scala> val date = ((weather \\ "forecast")(1) \ "@date").text
date: String = 11 Nov 2011

As with any array you need to be careful, because if you try to access an array element that doesn’t exist, you’ll get an IndexOutOfBoundsException:

scala> val date = ((weather \\ "forecast")(49) \ "@date").text
java.lang.IndexOutOfBoundsException: 49

Iterating over the elements

If instead of accessing the <forecast> nodes as individual array elements, you want to handle the same data in a loop, first grab all of the <forecast> nodes using an XPath expression, and then iterate over them, as desired:

val forecastNodes = (weather \\ "forecast")

forecastNodes.foreach{ n => 
  val day  = (n \ "@day").text
  val date = (n \ "@date").text
  val low  = (n \ "@low").text
  println(s"$day, $date, Low: $low")
}

This results in the following output:

Thu, 10 Nov 2011, Low: 37
Fri, 11 Nov 2011, Low: 39
Sat, 12 Nov 2011, Low: 32

Discussion

To explain this approach, it helps to see that when accessing array elements by their index value, the first portion of the search finds the <forecast> elements, and returns them as a NodeSeq:

scala> weather \\ "forecast"
res0: scala.xml.NodeSeq = NodeSeq(
<forecast high="58" low="37" day="Thu" code="29" date="10 Nov 2011" 
          text="Partly Cloudy"></forecast>, 
<forecast high="58" low="39" day="Fri" code="28" date="11 Nov 2011" 
          text="Mostly Cloudy"></forecast>, 
<forecast high="49" low="32" day="Sat" code="27" date="12 Nov 2011" 
          text="Cloudy"></forecast>)

Enclosing the expression in parentheses and adding (0) after it returns the zeroth element of the array:

scala> (weather \\ "forecast")(0)
res1: scala.xml.Node = <forecast high="58" low="37" day="Thu" code="29" 
date="10 Nov 2011" text="Partly Cloudy"></forecast>

Each element in the NodeSeq is an Elem instance:

scala> (weather \\ "forecast")(0).getClass
res0: Class[_ <: scala.xml.Node] = class scala.xml.Elem

Therefore, once you’re working with one <forecast> element, you can access its tag attributes, such as the day attribute:

scala> (weather \\ "forecast")(0) \ "@day"
res1: scala.xml.NodeSeq = Thu

As with any Scala sequence, add (1), (2), etc. to access the other <forecast> elements.

See Also