Examples of converting HTML to plain text with Scala and Jsoup

If you ever need to convert HTML to plain text using Scala or Java, I hope these Jsoup examples are helpful:

import org.jsoup.Jsoup
import org.jsoup.nodes.{Document, Element}

object JsoupHtmlToPlainTextTest extends App {

    val html =
        """
          |<html>
          |  <head><title>Hello, world</title></head>
          |  <body>
          |    <h1>Hello, world</h1>
          |    <p>Hello, world.</p>
          |    <p>This is a test.</p>
          |  </body>
          |</html>
        """.stripMargin

    // Example 1: this works, but all output is on one line
    val doc: Document = Jsoup.parse(html)
    //val s: String = doc.text()     //include <head> and <body> text
    val s: String = doc.body.text()  //<body> text only
    //println(s)

    // Example 2: this works, output is on multiple lines
    val formatter = new JsoupFormatter
    val plainText = formatter.getPlainText(doc)
    //println(plainText)

    // Example 3: this works as a way to select the <body> only
    val body: String = doc.select("body").first.text()
    //println(body)

    // Example 4: works: gets text from paragraphs only
    // https://jsoup.org/cookbook/input/parse-body-fragment
    val doc4 = Jsoup.parseBodyFragment(html)
    val body4 = doc4.body()
    val paragraphs = body4.getElementsByTag("p")
    import scala.collection.JavaConverters._
    val scalaParagraphs = asScalaBuffer(paragraphs)
    for (paragraph <- scalaParagraphs) {
        println(paragraph.text)
    }

}

While this is just some test code that I’m currently working on to understand Jsoup, the code shows four different ways to convert the given HTML into plain text. Hopefully the comments explain how the HTML to plain text conversion processes work, so I won’t write more about them. I just wanted to share this code snippet here today a) so I can find it again, and b) in hopes it might help others that need to convert HTML to text using Jsoup.

Add new comment

The content of this field is kept private and will not be shown publicly.

Anonymous format

  • Allowed HTML tags: <em> <strong> <cite> <code> <ul type> <ol start type> <li> <pre>
  • Lines and paragraphs break automatically.