A Scala shell script to read HTML H1 tag attributes

I’m putting this Scala shell script out here as a “source code snippet” so I can find it again if I need it. This file reads an input file that contains a series of HTML <h1> tags. I use this as part of a process of publishing an Amazon Kindle ebook from an HTML file, and in one of the steps of the creation process, I use this script to help create the Table of Contents (TOC) for the book.

Here’s the source code:

exec scala -classpath ".:lib/htmlcleaner-2.2.jar:lib/commons-lang3-3.1.jar" -savecompiled "$0" "$@"

import org.htmlcleaner.HtmlCleaner
import org.apache.commons.lang3.StringEscapeUtils
import scala.io.StdIn

val INPUT_FILE = "h1tags.html"

def readFile(filename: String): Seq[String] = {
    val bufferedSource = io.Source.fromFile(filename)
    val lines = (for (line <- bufferedSource.getLines()) yield line).toList

val lines = readFile(INPUT_FILE)
val html = lines.mkString("\n")

val cleaner = new HtmlCleaner
val rootNode = cleaner.clean(html)
val h1tags = rootNode.getElementsByName("h1", true)
val h1Seq = for {
    e <- h1tags
    idAttr = e.getAttributeByName("id")
    if idAttr != null
    idText = StringEscapeUtils.unescapeHtml4(e.getText.toString.trim)
} yield (idAttr, idText)

h1Seq.foreach { e =>
    println(s"""<li><a href="#${e._1}">${e._2}</a><br></li>""")

FWIW, the file that contains the <h1> tags has entries that look like this:

<h1 id="copyright" class="unnumbered">Copyright</h1>
<h1 id="introduction-or-why-i-wrote-this-book">Introduction(or, Why I Wrote This Book)</h1>
<h1 id="who-this-book-is-for">Who This Book is For</h1>

Knowing that the file h1tags.html contains only <h1> tags, here’s how the Scala script works: Basically, it loops over each line in the input file and extracts the #id and the description from each <h1> tag. It then outputs an HTML <li> tag for each <h1> tag, and that output is used by another shell script that wraps this one to create a TOC.

As a final note, that code uses the HtmlCleaner and Apache Commons-Lang libraries to do what it does, as implied by the classpath entry at the top of the script.