Use the Linux sed command to modify HTML content

A long time ago I created something I called a "Source code warehouse" that would help developers learn various programming languages by letting them easily find examples from open source programming projects from around the world. I initially did this for Java programs, and later expanded it to include source code files from other languages.

I included the source code files in between HTML <pre> and </pre> tags, and wrapped some simple content around that, but one thing I forgot to do was replace characters like <, >, and & that were included in the source code files. Unintended tags like this have a way of wreaking havoc in HTML documents, and the PHP section of the source code warehouse was by far the worst offender.

Today I fixed the PHP section of the warehouse by writing a Linux sed script that would:

  1. open a file
  2. get all the content between the <pre> and </pre> tags, and
  3. convert those offending characters to something that wouldn't mess up my HTML pages.

As a programming matter, this involves starting the changes at the opening <pre> tag and stopping them at the closing </pre> tag.

It turns out that working with a range of lines with the Unix/Linux sed command (while excluding the starting and stopping tags) was harder than I expected, but I came up with a kludge that got the job done.

A sed script to modify a range of lines in an HTML file

The source code for the sed script I created is shown here:

/<pre>/,/<\/pre>/ {

  # first convert <pre> to OPEN_PRE and </pre> to CLOSE_PRE
  s/<pre>/OPEN_PRE/
  s/<\/pre>/CLOSE_PRE/

  # now convert all html as desired
  s/\&/\&amp;/g
  s/</\&lt;/g
  s/>/\&gt;/g

  # at the end convert my labels back to html <pre> and </pre> tags
  s/OPEN_PRE/<pre>/
  s/CLOSE_PRE/<\/pre>/

}

My solution was to grab the range of lines beginning with the <pre> tag and ending with the </pre> tag, and then modify those. But, my problem was I couldn't figure out how to grab that range without also including the first line after the <pre> tag and the last line before the </pre> tag. So I used the "temporary-swap"" kludge. I turned these HTML tags that were stuck in my pattern space (that I didn't want to convert) into non-HTML labels that I was pretty sure would be unique, then converted them back when I was done.

Specifically, I convert <pre> and </pre> to the non-HTML strings OPEN_PRE and CLOSE_PRE. Then I convert all & <, and > characters in the pattern space to their ISO-Latin name equivalents. And then at the end I change the OPEN_PRE and CLOSE_PRE labels back to <pre> and </pre>, respectively. Note that the order of these operations is very important.

Call it a hack, but it got the job done. In the end I wish I'd written a small program in Ruby, but sed has usually treated me pretty well, and this is a hack I can live with.

Now, diving in and out of hundreds of directories to run this sed script is another matter, and I'll try to cover that in another blog post.