My Scala Sed project: More features, returning strings

My Scala Sed project is still a work in progress, but I made some progress on a new version this week. My initial need this week was to have Sed return a String rather than printing directly to STDOUT. This change gave me more ability to post-process a file. After that I realized it would really be useful if the custom function I pass to Sed had two more pieces of information available to it:

  • The line number of the string Sed passed to it
  • A Map of key/value pairs the helper function could use while processing the file

Note: In this article “Sed” refers to my project, and “sed” refers to the Unix command-line utility.

Basic use

In a “basic use” scenario, this is how I use the new version of Sed in a Scala shell script to change the “layout:” lines in 55 Markdown files whose names are in the files-to-process.txt file:

#!/bin/sh
exec scala (more here) ...
!#

import (more here) ...

val filenames = readFileAsList("files-to-process.txt")

for (filename <- filenames) {
    println(s"processing $filename ...")
    val source = Source.fromFile(filename)
    val sedResult: String = SedFactory.getSed(source, updateLayout _).run
    writeFile(filename, sedResult)
}

def updateLayout(currentLine: String): SedAction = {
    if (currentLine.startsWith("layout:")) {
        UpdateLine("layout: book")
    } else {
        UpdateLine(currentLine)
    }
}

As you can see in the updateLayout function, all I do is a simple search-and-replace on the layout: line. The important things are (a) my custom function named updateLayout, and (b) this line of code where I pass that function to create an appropriate Sed interpreter:

val sedResult: String = SedFactory.getSed(source, updateLayout _).run

Using a Map

In a more complicated scenario, my custom function may need to update every line with some data kept in a key/value Map. In that scenario my function might look like this:

def updateHeader(
    currentLine: String, 
    currentLineNum: Int, 
    kvMap: Map[String, String]
): SedAction = {
    if (currentLine.startsWith("num:")) {
        // add the `next-page` and `prev-page` fields after the `num` field
        val nextPage = kvMap("next-page")
        val prevPage = kvMap("prev-page")
        val rez = s"${currentLine}\nprevious-page: ${prevPage}\nnext-page: ${nextPage}"
        UpdateLine(rez)
    } else {
        UpdateLine(currentLine)
    }
}

In this situation I need to create a unique map for each of the 55 files I’m processing, so some pseudocode for my program’s main loop looks like this:

for (filename <- filenames) {
    val source = Source.fromFile(filename)

    // do some work to derive the map variables, then this:
    val kvMap = Map(
        "num"       -> s"$counter",
        "next-page" -> nextPage,
        "prev-page" -> prevPage
    )
    
    val sedResult = new Sed(source, updateHeader _, kvMap).run
    writeFile(filename, sedResult)
}

Match expressions

Despite showing if/else expressions in those examples, what I usually do is write match expressions inside my custom Sed functions. Here’s an example that demonstrates a typical match expression:

def rmNextPrevPageLines(currentLine: String): SedAction = currentLine match {
    case r"^next-page:.*"     => DeleteLine
    case r"^previous-page:.*" => DeleteLine
    case _ => UpdateLine(currentLine)
}

This is much more sed-like. Please note that the code in this example is made possible by Jon Pretty’s Kaleidoscope library, which allows the use of regular expressions in match expressions. I write a little more about Kaleidoscope in How to use regex pattern matching in a Scala match expression, so see that article and the Kaleidoscope page for more details.

Sed limitations

Please note that because this version of Sed returns a String, one limitation of this approach is memory-related, i.e., you probably won’t want to process very large files with it.

My Sed project

If you’re interested in more details, here’s a link to my Scala Sed project:

That project has a couple of README files that explain some things. The new code I just demonstrated is in the Sed subproject, specifically under the com.alvinalexander.sed_tostring package. See the Sed class in that package and its associated tests for more details.

As a final warning, because things are very much a work-in-progress the code may change dramatically in the future, but if you’re interested in doing Sed-like processing on many files using Scala rather than sed, I hope this is a helpful start.

Bonus: Factories and HOFs

If you’re interested in some gory details, the reason I created a SedFactory is because there are three different Sed classes to give you flexibility in writing your custom Sed functions, some of which you saw above, where the custom functions had different signatures based on each function’s needs.

Therefore, SedFactory has three overloaded getSed methods:

object SedFactory {

    // currentLine, currentLineNum, map
    def getSed(
        source: Source,
        f:(String, Int, Map[String, String]) => SedAction,
        keyValueMap: Map[String, String] = Map("" -> "")
    ): SedTrait = {
        new Sed3Params(source, f, keyValueMap)
    }

    // currentLine, currentLineNum
    def getSed(
        source: Source, 
        f:(String, Int) => SedAction
    ): SedTrait = {
        new SedCurrentLineAndNum(source, f)
    }

    // currentLine
    def getSed(
        source: Source, 
        f:(String) => SedAction
    ): SedTrait = {
        new SedCurrentLine(source, f)
    }

    // more code ...

This approach lets you write custom functions to work with Sed that match these function signatures:

f:(String, Int, Map[String, String]) => SedAction  //Function3
f:(String, Int) => SedAction                       //Function2
f:(String) => SedAction                            //Function1

If you’re used to writing functions that take functions as parameters — i.e., higher-order functions, or HOFs — this approach will look familiar. And if you’re not, I’ll take this moment to plug my “Functional Programming, Simplified” book.

As I note in the README-DEV.md file under the Sed subproject, this approach — along with JVM type erasure — has the potential to cause problems in the future, but for Version 0.3, this works okay.