Summary: In this article I share an approach I used to make a condensed version of a PDF book by printing only the first two sentences of each paragraph in the book. In the long term I hope this will help me create a “CliffsNotes” version of the book.
Ever since I first read Daniel Ingram’s, Mastering the Core Teachings of the Buddha, An Unusually Hardcore Dharma Book, I’ve wanted to create a significantly smaller version of that book, something like a “CliffsNotes” version of the book.
Last night I took a little time to get started on that endeavor. I did a little research on how to read PDF documents from a Scala (or Java) program, and then created a little Scala program to read his PDF, grab the first two sentences of each paragraph, and print those sentences to a new document. I also made an attempt to programmatically determine which paragraphs are actually chapter headings/titles, and I wrote those out a little differently.
Skipping over all of the technical problems ... the beginning of the document looks like this:
The result may be a little choppy, but having read the book, it makes sense to me.
So far the resulting document is an 89-page PDF, but by fixing the formatting I can get that down to ~75 pages. I hope to use this document to (a) help me better understand the organization of the book, and (b) help me in my effort to create a much smaller version of the original book.
The Scala source code that reads the PDF
The Scala source code I wrote to read the PDF and output the results to a file is shown below. As a warning, the code is pretty shabby, for the reasons I describe in the source code comments.
That being said, here’s the code:
package tests import org.apache.pdfbox.util.PDFTextStripper import org.apache.pdfbox.pdmodel.PDDocument import java.io.File import java.io.ByteArrayOutputStream import java.io.StringWriter import java.io.PrintStream import java.util.regex.Pattern import util.control.Breaks._ /** * This code was written to read Daniel Ingram's PDF, "Mastering the Core Teachings of the Buddha." * * The purpose of this code is to read the PDF and write the first two sentences of each paragraph * to an output file. * * What this code does is (a) gets the text from the PDF, (b) attempts to figure out certain things * about the text, and then (c) writes the text out as HTML. In addition to trying to figure out what * a paragraph is (and what the first two lines in each paragraph are), the code also attempts to * determine which parts of the text are chapter headers (or titles). * * The code is pretty shabby. This is the first time I've ever tried to read a PDF using Scala * or Java, and I wrote it all in one night. The text you get from the PDF is also surprisingly * shabby. I thought it would be easy to determine what a paragraph was, what a sentence was, etc., * but all you get is a big chunk of text that you have to work through. (As just one example of * the agony, the text you get includes the chapter header that is printed at the top of every * page, and there's no way to distinguish that from the other text.) */ object ProcessMasteringHardcodeTeachings extends App { // the header and footer html import Global._ val pdf = PDDocument.load(new File("/Users/Al/Projects/Scala/DanielIngram/book/Daniel-Ingram-Master-Hardcore-Teachings.pdf")) // comment-out to print to stdout as usual System.setOut(new PrintStream("/Users/Al/Projects/Scala/DanielIngram/book/book.html")) // pp. 10-374 is the complete book. chapter 9 is pp. 85-101 (i used it for testing). val stripper = new PDFTextStripper stripper.setStartPage(10) stripper.setEndPage(374) // the last regex lets me get each header and also keep the delimiter that it matches; // found the regex idea here: stackoverflow.com/questions/2206378/how-to-split-a-string-but-also-keep-the-delimiters //val paraBreakRegex = "\\.\\n|\\.”\\n|\\?\n|[A-Z]\\n" val paraBreakRegex = "\\.\\n|\\.”\\n|\\?\n|(?<=[A-Z]\\n)" val lineBreakRegex = "\\. |\\.” |!” |\\?” |\\? |! " printBeginningOfHtml // this block reads each paragraph, and then each sentence in each paragraph val pdfText = stripper.getText(pdf) val paragraphs = pdfText.split(paraBreakRegex) val numberMatchPattern = Pattern.compile("[1-9]") var lastHeader = "" // main loop for (p <- paragraphs) { //for each paragraph ... var lineCount = 0 for (line <- p.split(lineBreakRegex)) { //for each line in each paragraph ... breakable { if (lineCount > 1) break val cleanLine = repairTheLine(line) if (cleanLine == lastHeader) break // this is an attempt to get rid of the titles that appear at the top of pages if (cleanLine == cleanLine.toUpperCase) { //header lines are all-caps lastHeader = cleanLine.replaceAll("[0-9]\\.", "").trim printHeader(determineIfNumberPatternMatchesLine(cleanLine), lastHeader) } else { // TODO this algorithm will mess up the html if a paragraph has only one line printNonHeader(lineCount, cleanLine) lineCount += 1 } } //breakable } } printEndOfHtml pdf.close def determineIfNumberPatternMatchesLine(line: String) = { val matcher = numberMatchPattern.matcher(line) matcher.find } def printEndOfHtml { println("</ul>") println(footer) } def printBeginningOfHtml { println(header) println("<ul>") } def printHeader(foundNumberInLine: Boolean, line: String) { if (foundNumberInLine) { printH1(lastHeader) } else { printH2(lastHeader) } } def printNonHeader(lineCount: Int, cleanLine: String) { if (lineCount == 0) print(s"<li>$cleanLine. ") if (lineCount == 1) println(s"$cleanLine.</li>") } def repairTheLine(s: String) = replaceQuotes(replaceNewlinesWithSpaces(s)) def printH1(s: String) { println("</ul>") println(s"<h1>$s</h1>") println("<ul>") } def printH2(s: String) { println("</ul>") println(s"<h2>$s</h2>") println("<ul>") } def replaceNewlinesWithSpaces(s: String) = s.replace('\n', ' ').replaceAll(" ", " ") def replaceQuotes(s: String) = { var a = s.replaceAll("‘", "'") a = a.replaceAll("’", "'") a = a.replaceAll("“", "\"") a = a.replaceAll("”", "\"") a } }
The code requires a file named Global.scala, which contains the header and footer for the resulting document, most of which is some CSS that I cobbled together:
package test1 object Global { val header = """ <html> <head> <style> ul { list-style-type: none; } li { padding: 5px 0; margin-left: 50px; list-style-position: inside; text-indent: 2em; } body { background-color: #f0f0f0; color: #333; font-family: Palatino,Georgia,Times; font-size: 18px; width: 900px; margin-left: 10%; margin-right: 10%; line-height: 1.5em; } h1 { color: #333399; font-size: 24px; margin-left: 40px; padding-top: 24px; } h2 { color: #444; margin-left: 40px; } </style> </head> <body>""".stripMargin val footer = """ </body> </html>""".stripMargin }
The code relies on the Apache PDFBox library to read the PDF. (As a word of warning, documentation for that project is virtually non-existent.)
Closing thoughts
I don’t think I can legally share the PDF I created from the original book, certainly not without contacting Mr. Ingram. Also, I don’t think the PDF is very helpful yet, so there’s really no reason to even think about sharing it yet. If/when I ever get the PDF to a point where I think it could be useful for other people I’ll contact Mr. Ingram, but until then, I hope this approach and code has been interesting.