How to get cleaned HTML as a String from HTMLCleaner

If you ever need to get the “cleaned” HTML as a String from the Java HTMLCleaner project, I hope this example will help:

import org.htmlcleaner.{HtmlCleaner, PrettyXmlSerializer}

object RmHtmlComments extends App {

    val html = """
<html>
    <head>
        <title>Hello</title>
    </head>
    <body>
        <!-- TODO: yada yada yada -->
        <p>Hello, world</p>
    </body>
</html>
"""

    val cleaner = new HtmlCleaner
    val props = cleaner.getProperties
    props.setOmitComments(true)         // rm html comments
    val rootTagNode = cleaner.clean(html)

    // use the getAsString method on an XmlSerializer class
    val xmlSerializer = new PrettyXmlSerializer(props)
    val htmlOut = xmlSerializer.getAsString(rootTagNode)

    println(htmlOut)

}

While that code is written in Scala, you can easily convert it to Java. My purpose for this code is to have it remove the HTML comments from the String it’s given, and I can verify that it works for this use case. The output from that code looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<html>
    
<head>
        <title>Hello</title>
    </head>
    
<body>
        <p>Hello, world</p>
    </body>
</html>

Note that the Scala code uses the HTMLCleaner PrettyXmlSerializer class. You can also use the SimpleXmlSerializer class in the same way as the PrettyXmlSerializer class, but the output is a little bit uglier.

In summary, if you needed to see how to get the output from HTMLCleaner as a String, I hope this is helpful.