Summary: In this article I share an approach I used to make a condensed version of a PDF book by printing only the first two sentences of each paragraph in the book. In the long term I hope this will help me create a “CliffsNotes” version of the book.
Ever since I first read Daniel Ingram’s, Mastering the Core Teachings of the Buddha, An Unusually Hardcore Dharma Book, I’ve wanted to create a significantly smaller version of that book, something like a “CliffsNotes” version of the book.
Last night I took a little time to get started on that endeavor. I did a little research on how to read PDF documents from a Scala (or Java) program, and then created a little Scala program to read his PDF, grab the first two sentences of each paragraph, and print those sentences to a new document. I also made an attempt to programmatically determine which paragraphs are actually chapter headings/titles, and I wrote those out a little differently.
Skipping over all of the technical problems ... the beginning of the document looks like this:

The result may be a little choppy, but having read the book, it makes sense to me.
So far the resulting document is an 89-page PDF, but by fixing the formatting I can get that down to ~75 pages. I hope to use this document to (a) help me better understand the organization of the book, and (b) help me in my effort to create a much smaller version of the original book.
The Scala source code that reads the PDF
The Scala source code I wrote to read the PDF and output the results to a file is shown below. As a warning, the code is pretty shabby, for the reasons I describe in the source code comments.
That being said, here’s the code:
package tests
import org.apache.pdfbox.util.PDFTextStripper
import org.apache.pdfbox.pdmodel.PDDocument
import java.io.File
import java.io.ByteArrayOutputStream
import java.io.StringWriter
import java.io.PrintStream
import java.util.regex.Pattern
import util.control.Breaks._
/**
* This code was written to read Daniel Ingram's PDF, "Mastering the Core Teachings of the Buddha."
*
* The purpose of this code is to read the PDF and write the first two sentences of each paragraph
* to an output file.
*
* What this code does is (a) gets the text from the PDF, (b) attempts to figure out certain things
* about the text, and then (c) writes the text out as HTML. In addition to trying to figure out what
* a paragraph is (and what the first two lines in each paragraph are), the code also attempts to
* determine which parts of the text are chapter headers (or titles).
*
* The code is pretty shabby. This is the first time I've ever tried to read a PDF using Scala
* or Java, and I wrote it all in one night. The text you get from the PDF is also surprisingly
* shabby. I thought it would be easy to determine what a paragraph was, what a sentence was, etc.,
* but all you get is a big chunk of text that you have to work through. (As just one example of
* the agony, the text you get includes the chapter header that is printed at the top of every
* page, and there's no way to distinguish that from the other text.)
*/
object ProcessMasteringHardcodeTeachings extends App {
// the header and footer html
import Global._
val pdf = PDDocument.load(new File("/Users/Al/Projects/Scala/DanielIngram/book/Daniel-Ingram-Master-Hardcore-Teachings.pdf"))
// comment-out to print to stdout as usual
System.setOut(new PrintStream("/Users/Al/Projects/Scala/DanielIngram/book/book.html"))
// pp. 10-374 is the complete book. chapter 9 is pp. 85-101 (i used it for testing).
val stripper = new PDFTextStripper
stripper.setStartPage(10)
stripper.setEndPage(374)
// the last regex lets me get each header and also keep the delimiter that it matches;
// found the regex idea here: stackoverflow.com/questions/2206378/how-to-split-a-string-but-also-keep-the-delimiters
//val paraBreakRegex = "\\.\\n|\\.”\\n|\\?\n|[A-Z]\\n"
val paraBreakRegex = "\\.\\n|\\.”\\n|\\?\n|(?<=[A-Z]\\n)"
val lineBreakRegex = "\\. |\\.” |!” |\\?” |\\? |! "
printBeginningOfHtml
// this block reads each paragraph, and then each sentence in each paragraph
val pdfText = stripper.getText(pdf)
val paragraphs = pdfText.split(paraBreakRegex)
val numberMatchPattern = Pattern.compile("[1-9]")
var lastHeader = ""
// main loop
for (p <- paragraphs) { //for each paragraph ...
var lineCount = 0
for (line <- p.split(lineBreakRegex)) { //for each line in each paragraph ...
breakable {
if (lineCount > 1) break
val cleanLine = repairTheLine(line)
if (cleanLine == lastHeader) break // this is an attempt to get rid of the titles that appear at the top of pages
if (cleanLine == cleanLine.toUpperCase) { //header lines are all-caps
lastHeader = cleanLine.replaceAll("[0-9]\\.", "").trim
printHeader(determineIfNumberPatternMatchesLine(cleanLine), lastHeader)
} else {
// TODO this algorithm will mess up the html if a paragraph has only one line
printNonHeader(lineCount, cleanLine)
lineCount += 1
}
} //breakable
}
}
printEndOfHtml
pdf.close
def determineIfNumberPatternMatchesLine(line: String) = {
val matcher = numberMatchPattern.matcher(line)
matcher.find
}
def printEndOfHtml {
println("</ul>")
println(footer)
}
def printBeginningOfHtml {
println(header)
println("<ul>")
}
def printHeader(foundNumberInLine: Boolean, line: String) {
if (foundNumberInLine) {
printH1(lastHeader)
} else {
printH2(lastHeader)
}
}
def printNonHeader(lineCount: Int, cleanLine: String) {
if (lineCount == 0) print(s"<li>$cleanLine. ")
if (lineCount == 1) println(s"$cleanLine.</li>")
}
def repairTheLine(s: String) = replaceQuotes(replaceNewlinesWithSpaces(s))
def printH1(s: String) {
println("</ul>")
println(s"<h1>$s</h1>")
println("<ul>")
}
def printH2(s: String) {
println("</ul>")
println(s"<h2>$s</h2>")
println("<ul>")
}
def replaceNewlinesWithSpaces(s: String) = s.replace('\n', ' ').replaceAll(" ", " ")
def replaceQuotes(s: String) = {
var a = s.replaceAll("‘", "'")
a = a.replaceAll("’", "'")
a = a.replaceAll("“", "\"")
a = a.replaceAll("”", "\"")
a
}
}
The code requires a file named Global.scala, which contains the header and footer for the resulting document, most of which is some CSS that I cobbled together:
package test1
object Global {
val header = """
<html>
<head>
<style>
ul {
list-style-type: none;
}
li {
padding: 5px 0;
margin-left: 50px;
list-style-position: inside;
text-indent: 2em;
}
body {
background-color: #f0f0f0;
color: #333;
font-family: Palatino,Georgia,Times;
font-size: 18px;
width: 900px;
margin-left: 10%;
margin-right: 10%;
line-height: 1.5em;
}
h1 {
color: #333399;
font-size: 24px;
margin-left: 40px;
padding-top: 24px;
}
h2 {
color: #444;
margin-left: 40px;
}
</style>
</head>
<body>""".stripMargin
val footer = """
</body>
</html>""".stripMargin
}
The code relies on the Apache PDFBox library to read the PDF. (As a word of warning, documentation for that project is virtually non-existent.)
Closing thoughts
I don’t think I can legally share the PDF I created from the original book, certainly not without contacting Mr. Ingram. Also, I don’t think the PDF is very helpful yet, so there’s really no reason to even think about sharing it yet. If/when I ever get the PDF to a point where I think it could be useful for other people I’ll contact Mr. Ingram, but until then, I hope this approach and code has been interesting.

