A Scala shell script to read HTML H1 tag attributes

I’m putting this Scala shell script out here as a “source code snippet” so I can find it again if I need it. This file reads an input file that contains a series of HTML <h1> tags. I use this as part of a process of publishing an Amazon Kindle ebook from an HTML file, and in one of the steps of the creation process, I use this script to help create the Table of Contents (TOC) for the book.

Here’s the source code:

#!/bin/sh
exec scala -classpath ".:lib/htmlcleaner-2.2.jar:lib/commons-lang3-3.1.jar" -savecompiled "$0" "$@"
!#

import org.htmlcleaner.HtmlCleaner
import org.apache.commons.lang3.StringEscapeUtils
import scala.io.StdIn

val INPUT_FILE = "h1tags.html"

def readFile(filename: String): Seq[String] = {
    val bufferedSource = io.Source.fromFile(filename)
    val lines = (for (line <- bufferedSource.getLines()) yield line).toList
    bufferedSource.close
    lines
}

val lines = readFile(INPUT_FILE)
val html = lines.mkString("\n")

val cleaner = new HtmlCleaner
val rootNode = cleaner.clean(html)
val h1tags = rootNode.getElementsByName("h1", true)
val h1Seq = for {
    e <- h1tags
    idAttr = e.getAttributeByName("id")
    if idAttr != null
    idText = StringEscapeUtils.unescapeHtml4(e.getText.toString.trim)
} yield (idAttr, idText)

h1Seq.foreach { e =>
    println(s"""<li><a href="#${e._1}">${e._2}</a><br></li>""")
}

FWIW, the file that contains the <h1> tags has entries that look like this:

<h1 id="copyright" class="unnumbered">Copyright</h1>
<h1 id="introduction-or-why-i-wrote-this-book">Introduction(or, Why I Wrote This Book)</h1>
<h1 id="who-this-book-is-for">Who This Book is For</h1>

Knowing that the file h1tags.html contains only <h1> tags, here’s how the Scala script works: Basically, it loops over each line in the input file and extracts the #id and the description from each <h1> tag. It then outputs an HTML <li> tag for each <h1> tag, and that output is used by another shell script that wraps this one to create a TOC.

As a final note, that code uses the HtmlCleaner and Apache Commons-Lang libraries to do what it does, as implied by the classpath entry at the top of the script.

Add new comment

The content of this field is kept private and will not be shown publicly.

Anonymous format

  • Allowed HTML tags: <em> <strong> <cite> <code> <ul type> <ol start type> <li> <pre>
  • Lines and paragraphs break automatically.
By submitting this form, you accept the Mollom privacy policy.