Creating a CliffsNotes version of Daniel Ingram's "Mastering the Core Teachings" book

Summary: In this article I share an approach I used to make a condensed version of a PDF book by printing only the first two sentences of each paragraph in the book. In the long term I hope this will help me create a “CliffsNotes” version of the book.

Ever since I first read Daniel Ingram’s, Mastering the Core Teachings of the Buddha, An Unusually Hardcore Dharma Book, I’ve wanted to create a significantly smaller version of that book, something like a “CliffsNotes” version of the book.

Last night I took a little time to get started on that endeavor. I did a little research on how to read PDF documents from a Scala (or Java) program, and then created a little Scala program to read his PDF, grab the first two sentences of each paragraph, and print those sentences to a new document. I also made an attempt to programmatically determine which paragraphs are actually chapter headings/titles, and I wrote those out a little differently.

Skipping over all of the technical problems ... the beginning of the document looks like this:

Mastering the Core Teachings of the Buddha (abridged)

The result may be a little choppy, but having read the book, it makes sense to me.

So far the resulting document is an 89-page PDF, but by fixing the formatting I can get that down to ~75 pages. I hope to use this document to (a) help me better understand the organization of the book, and (b) help me in my effort to create a much smaller version of the original book.

The Scala source code that reads the PDF

The Scala source code I wrote to read the PDF and output the results to a file is shown below. As a warning, the code is pretty shabby, for the reasons I describe in the source code comments.

That being said, here’s the code:

package tests

import org.apache.pdfbox.util.PDFTextStripper
import org.apache.pdfbox.pdmodel.PDDocument
import java.io.File
import java.io.ByteArrayOutputStream
import java.io.StringWriter
import java.io.PrintStream
import java.util.regex.Pattern
import util.control.Breaks._

/**
 * This code was written to read Daniel Ingram's PDF, "Mastering the Core Teachings of the Buddha."
 *
 * The purpose of this code is to read the PDF and write the first two sentences of each paragraph
 * to an output file.
 *
 * What this code does is (a) gets the text from the PDF, (b) attempts to figure out certain things
 * about the text, and then (c) writes the text out as HTML. In addition to trying to figure out what
 * a paragraph is (and what the first two lines in each paragraph are), the code also attempts to
 * determine which parts of the text are chapter headers (or titles).
 *
 * The code is pretty shabby. This is the first time I've ever tried to read a PDF using Scala
 * or Java, and I wrote it all in one night. The text you get from the PDF is also surprisingly
 * shabby. I thought it would be easy to determine what a paragraph was, what a sentence was, etc.,
 * but all you get is a big chunk of text that you have to work through. (As just one example of
 * the agony, the text you get includes the chapter header that is printed at the top of every
 * page, and there's no way to distinguish that from the other text.)
 */
object ProcessMasteringHardcodeTeachings extends App {

    // the header and footer html
    import Global._

    val pdf = PDDocument.load(new File("/Users/Al/Projects/Scala/DanielIngram/book/Daniel-Ingram-Master-Hardcore-Teachings.pdf"))
   
    // comment-out to print to stdout as usual
    System.setOut(new PrintStream("/Users/Al/Projects/Scala/DanielIngram/book/book.html"))

    // pp. 10-374 is the complete book. chapter 9 is pp. 85-101 (i used it for testing).
    val stripper = new PDFTextStripper
    stripper.setStartPage(10)
    stripper.setEndPage(374)
   
    // the last regex lets me get each header and also keep the delimiter that it matches;
    // found the regex idea here: stackoverflow.com/questions/2206378/how-to-split-a-string-but-also-keep-the-delimiters
    //val paraBreakRegex = "\\.\\n|\\.”\\n|\\?\n|[A-Z]\\n"
    val paraBreakRegex = "\\.\\n|\\.”\\n|\\?\n|(?<=[A-Z]\\n)"
    val lineBreakRegex = "\\. |\\.” |!” |\\?” |\\? |! "

    printBeginningOfHtml

    // this block reads each paragraph, and then each sentence in each paragraph
    val pdfText = stripper.getText(pdf)
    val paragraphs = pdfText.split(paraBreakRegex)
    val numberMatchPattern = Pattern.compile("[1-9]")
    var lastHeader = ""

    // main loop
    for (p <- paragraphs) { //for each paragraph ...
        var lineCount = 0
        for (line <- p.split(lineBreakRegex)) {  //for each line in each paragraph ...
            breakable {
                if (lineCount > 1) break
                val cleanLine = repairTheLine(line)
                if (cleanLine == lastHeader) break  // this is an attempt to get rid of the titles that appear at the top of pages
                if (cleanLine == cleanLine.toUpperCase) {  //header lines are all-caps
                    lastHeader = cleanLine.replaceAll("[0-9]\\.", "").trim
                    printHeader(determineIfNumberPatternMatchesLine(cleanLine), lastHeader)
                } else {
                    // TODO this algorithm will mess up the html if a paragraph has only one line
                    printNonHeader(lineCount, cleanLine)
                    lineCount += 1
                }
            } //breakable
        }
    }

    printEndOfHtml
    pdf.close
   
    def determineIfNumberPatternMatchesLine(line: String) = {
        val matcher = numberMatchPattern.matcher(line)
        matcher.find
    }
   
    def printEndOfHtml {
        println("</ul>")
        println(footer)
    }
   
    def printBeginningOfHtml {
        println(header)
        println("<ul>")
    }
   
    def printHeader(foundNumberInLine: Boolean, line: String) {
        if (foundNumberInLine) {
            printH1(lastHeader)
        } else {
            printH2(lastHeader)
        }
    }
   
    def printNonHeader(lineCount: Int, cleanLine: String) {
        if (lineCount == 0) print(s"<li>$cleanLine. ")
        if (lineCount == 1) println(s"$cleanLine.</li>")
    }
   
    def repairTheLine(s: String) = replaceQuotes(replaceNewlinesWithSpaces(s))
   
    def printH1(s: String) {
        println("</ul>")
        println(s"<h1>$s</h1>")
        println("<ul>")
    }
   
    def printH2(s: String) {
        println("</ul>")
        println(s"<h2>$s</h2>")
        println("<ul>")
    }
   
    def replaceNewlinesWithSpaces(s: String) = s.replace('\n', ' ').replaceAll("  ", " ")

    def replaceQuotes(s: String) = {
        var a = s.replaceAll("‘", "'")
        a = a.replaceAll("’", "'")
        a = a.replaceAll("“", "\"")
        a = a.replaceAll("”", "\"")
        a
    }
       
}

The code requires a file named Global.scala, which contains the header and footer for the resulting document, most of which is some CSS that I cobbled together:

package test1

object Global {

    val header = """
<html>

<head>

<style>
ul {
    list-style-type: none;
}
li {
    padding: 5px 0;
    margin-left: 50px;
    list-style-position: inside;
    text-indent: 2em;
}

body {
    background-color: #f0f0f0;
    color: #333;
    font-family: Palatino,Georgia,Times;
    font-size: 18px;
    width: 900px;
    margin-left: 10%;
    margin-right: 10%;
    line-height: 1.5em;
}
h1 {
    color: #333399;
    font-size: 24px;
    margin-left: 40px;
    padding-top: 24px;
}
h2 {
    color: #444;
    margin-left: 40px;
}
</style>
        
</head>

<body>""".stripMargin
        
    val footer = """
</body>
</html>""".stripMargin

}

The code relies on the Apache PDFBox library to read the PDF. (As a word of warning, documentation for that project is virtually non-existent.)

Closing thoughts

I don’t think I can legally share the PDF I created from the original book, certainly not without contacting Mr. Ingram. Also, I don’t think the PDF is very helpful yet, so there’s really no reason to even think about sharing it yet. If/when I ever get the PDF to a point where I think it could be useful for other people I’ll contact Mr. Ingram, but until then, I hope this approach and code has been interesting.