By Alvin Alexander. Last updated: May 18, 2020
If you ever need to get the “cleaned” HTML as a String
from the Java HTMLCleaner project, I hope this example will help:
import org.htmlcleaner.{HtmlCleaner, PrettyXmlSerializer} object RmHtmlComments extends App { val html = """ <html> <head> <title>Hello</title> </head> <body> <!-- TODO: yada yada yada --> <p>Hello, world</p> </body> </html> """ val cleaner = new HtmlCleaner val props = cleaner.getProperties props.setOmitComments(true) // rm html comments val rootTagNode = cleaner.clean(html) // use the getAsString method on an XmlSerializer class val xmlSerializer = new PrettyXmlSerializer(props) val htmlOut = xmlSerializer.getAsString(rootTagNode) println(htmlOut) }
While that code is written in Scala, you can easily convert it to Java. My purpose for this code is to have it remove the HTML comments from the String
it’s given, and I can verify that it works for this use case. The output from that code looks like this:
<?xml version="1.0" encoding="UTF-8"?> <html> <head> <title>Hello</title> </head> <body> <p>Hello, world</p> </body> </html>
Note that the Scala code uses the HTMLCleaner PrettyXmlSerializer
class. You can also use the SimpleXmlSerializer
class in the same way as the PrettyXmlSerializer
class, but the output is a little bit uglier.
In summary, if you needed to see how to get the output from HTMLCleaner as a String
, I hope this is helpful.