htmlcleaner

A Scala shell script to read HTML H1 tag attributes

I’m putting this Scala shell script out here as a “source code snippet” so I can find it again if I need it. This file reads an input file that contains a series of HTML <h1> tags. I use this as part of a process of publishing an Amazon Kindle ebook from an HTML file, and in one of the steps of the creation process, I use this script to help create the Table of Contents (TOC) for the book.

Here’s the source code:

SBT entry for HTMLCleaner (SBT libraryDependencies syntax)

Q: What should an SBT entry for the HTMLCleaner library look like?

A: As of August 12, 2012, my entry looks like this:

libraryDependencies += "net.sourceforge.htmlcleaner" % "htmlcleaner" % "2.2"

Parsing “real world” HTML with Scala, HTMLCleaner, and StringEscapeUtils

While XML parsers work great for well-formed XML, out in the 'real world' internet, you can't count on HTML being XHTML, or even being well-formatted. As a result, various 'HTML cleaner' libraries for Java have appeared. They attempt to clean up the HTML so you can parse it.