A long time ago I created something I called a "Source code warehouse" that would help developers learn various programming languages by letting them easily find examples from open source programming projects from around the world. I initially did this for Java programs, and later expanded it to include source code files from other languages.
I included the source code files in between HTML <pre>
and </pre>
tags, and wrapped some simple content around that, but one thing I forgot to do was replace characters like <
, >
, and &
that were included in the source code files. Unintended tags like this have a way of wreaking havoc in HTML documents, and the PHP section of the source code warehouse was by far the worst offender.
Today I fixed the PHP section of the warehouse by writing a Linux sed script that would:
- open a file
- get all the content between the
<pre>
and</pre>
tags, and - convert those offending characters to something that wouldn't mess up my HTML pages.
As a programming matter, this involves starting the changes at the opening <pre>
tag and stopping them at the closing </pre>
tag.
It turns out that working with a range of lines with the Unix/Linux sed
command (while excluding the starting and stopping tags) was harder than I expected, but I came up with a kludge that got the job done.
A sed script to modify a range of lines in an HTML file
The source code for the sed
script I created is shown here:
/<pre>/,/<\/pre>/ { # first convert <pre> to OPEN_PRE and </pre> to CLOSE_PRE s/<pre>/OPEN_PRE/ s/<\/pre>/CLOSE_PRE/ # now convert all html as desired s/\&/\&/g s/</\</g s/>/\>/g # at the end convert my labels back to html <pre> and </pre> tags s/OPEN_PRE/<pre>/ s/CLOSE_PRE/<\/pre>/ }
My solution was to grab the range of lines beginning with the <pre>
tag and ending with the </pre>
tag, and then modify those. But, my problem was I couldn't figure out how to grab that range without also including the first line after the <pre>
tag and the last line before the </pre>
tag. So I used the "temporary-swap"" kludge. I turned these HTML tags that were stuck in my pattern space (that I didn't want to convert) into non-HTML labels that I was pretty sure would be unique, then converted them back when I was done.
Specifically, I convert <pre>
and </pre>
to the non-HTML strings OPEN_PRE
and CLOSE_PRE
. Then I convert all & <
, and >
characters in the pattern space to their ISO-Latin name equivalents. And then at the end I change the OPEN_PRE
and CLOSE_PRE
labels back to <pre>
and </pre>
, respectively. Note that the order of these operations is very important.
Call it a hack, but it got the job done. In the end I wish I'd written a small program in Ruby, but sed
has usually treated me pretty well, and this is a hack I can live with.
Now, diving in and out of hundreds of directories to run this sed
script is another matter, and I'll try to cover that in another blog post.