A long time ago I created something I called a "Source code warehouse" that would help developers learn various programming languages by letting them easily find examples from open source programming projects from around the world. I initially did this for Java programs, and later expanded it to include source code files from other languages.
I included the source code files in between HTML <pre> and </pre> tags, and wrapped some simple content around that, but one thing I forgot to do was replace characters like <, >, and & that were included in the source code files. Unintended tags like this have a way of wreaking havoc in HTML documents, and the PHP section of the source code warehouse was by far the worst offender.
Today I fixed the PHP section of the warehouse by writing a Linux sed script that would:
- open a file
- get all the content between the
<pre>and</pre>tags, and - convert those offending characters to something that wouldn't mess up my HTML pages.
As a programming matter, this involves starting the changes at the opening <pre> tag and stopping them at the closing </pre> tag.
It turns out that working with a range of lines with the Unix/Linux sed command (while excluding the starting and stopping tags) was harder than I expected, but I came up with a kludge that got the job done.
A sed script to modify a range of lines in an HTML file
The source code for the sed script I created is shown here:
/<pre>/,/<\/pre>/ {
# first convert <pre> to OPEN_PRE and </pre> to CLOSE_PRE
s/<pre>/OPEN_PRE/
s/<\/pre>/CLOSE_PRE/
# now convert all html as desired
s/\&/\&/g
s/</\</g
s/>/\>/g
# at the end convert my labels back to html <pre> and </pre> tags
s/OPEN_PRE/<pre>/
s/CLOSE_PRE/<\/pre>/
}
My solution was to grab the range of lines beginning with the <pre> tag and ending with the </pre> tag, and then modify those. But, my problem was I couldn't figure out how to grab that range without also including the first line after the <pre> tag and the last line before the </pre> tag. So I used the "temporary-swap"" kludge. I turned these HTML tags that were stuck in my pattern space (that I didn't want to convert) into non-HTML labels that I was pretty sure would be unique, then converted them back when I was done.
Specifically, I convert <pre> and </pre> to the non-HTML strings OPEN_PRE and CLOSE_PRE. Then I convert all & <, and > characters in the pattern space to their ISO-Latin name equivalents. And then at the end I change the OPEN_PRE and CLOSE_PRE labels back to <pre> and </pre>, respectively. Note that the order of these operations is very important.
Call it a hack, but it got the job done. In the end I wish I'd written a small program in Ruby, but sed has usually treated me pretty well, and this is a hack I can live with.
Now, diving in and out of hundreds of directories to run this sed script is another matter, and I'll try to cover that in another blog post.

