I’ve used HtmlCleaner many times before to read/parse HTML content, but jsoup worked well today as a way to modify HTML content using Scala.
I’m putting this Scala shell script out here as a “source code snippet” so I can find it again if I need it. This file reads an input file that contains a series of HTML
<h1> tags. I use this as part of a process of publishing an Amazon Kindle ebook from an HTML file, and in one of the steps of the creation process, I use this script to help create the Table of Contents (TOC) for the book.
Here’s the source code:
If you ever need to get the “cleaned” HTML as a
String from the Java HTMLCleaner project, I hope this example will help:
Q: What should an SBT entry for the HTMLCleaner library look like?
A: As of August 12, 2012, my entry looks like this:
libraryDependencies += "net.sourceforge.htmlcleaner" % "htmlcleaner" % "2.2"
While XML parsers work great for well-formed XML, out in the 'real world' internet, you can't count on HTML being XHTML, or even being well-formatted. As a result, various 'HTML cleaner' libraries for Java have appeared. They attempt to clean up the HTML so you can parse it.