A crazy sed script to convert HTML code to LaTeX

The crazy Unix/Linux sed script below is my first attempt at a script that will convert as much HTML as possible to LaTeX. For my purposes I'm mostly interested in tables, lists, buttons, and comboboxes, but I included a few other things as well. This is in an extremely experimental state, and is included here as much for backup purposes and sharing as anything else.

Update: I originally wrote this script in 2004. Unless you just want to fool around with sed, a much better approach these days is to use Pandoc to convert HTML to LaTeX.

Here's how you run the sed script on an HTML file named test.html:

sed -f html2latex.sed test.html > test.tex

That being said, here's the current source code for the html2latex.sed file:

#
# goal: to convert as much of an html document as possible to an 
# equivalent sed script.
# i understand that, because of this approach, this can never be 100%
# accurate, but really what I'm after is the conversion of things
# like tables and lists.
# note that my html tags are pretty accurate here, but my latex tags
# leave some things to be desired.
#
s?>?>?g
s?&lt;?<?g
s?&nbsp;? ?g

s?<html>??ig
s?</html>??ig

s?<head>??ig
s?</head>??ig

s?<title>\([^<]*\)</title>?\\section*{\1}?ig

s?<body>?\\begin{document}?ig
s?</body>?\\end{document}?ig

# i don't know what the latex tag should be here for a paragraph.
s?<p>\([^<]*\)</p>?{\1}?ig

s?<center>??ig
s?</center>??ig

#-------#
# TABLE #
#-------#
s?<table.*>?\\begin{tabular}{}{}?ig
s?<\/table.*>?\\end{tabular}{}{}?ig

#-----------#
# TABLE ROW #
#-----------#
# nothing at the beginning of a table row
s?<tr>??ig
# two backslashes at the end of a table row
s?</tr>?\\\\?ig

#--------------#
# TABLE COLUMN #
#--------------#
s?<td.*>?\&?ig

#-------#
# FONTS #
#-------#
s?<b>\([^<]*\)</b>?\\textbf{{\1}}?ig
s?<em>\([^<]*\)</em>?\\textit{{\1}}?ig
s?<font .*>??ig
s?</font>??ig
s?<br>?////?g

#--------#
# BUTTON #
#--------#
# guessing on button syntax here
s?<input
type="button".*value="\([^<]*\)">?\\begin{fbox}\1\\end{fbox}?ig

# need to do something here to handle multiline mode
s?<select.*<option.*selected>\([^<]*\)</option>?\\begin{fbox}{\1}\\end{fbox}?g

# delete all other option tags
#?<option.*</option>?d

# handle preformatted things
s?<pre>?\\begin{verbatim}?ig
s?</pre>?\\end{verbatim}?ig

s?<code>?\\begin{verbatim}?ig
s?</code>?\\end{verbatim}?ig
 
# handle bulleted lists.
# todo: fix "itemize" vs. "enumerate"
s?<ol.*>?//begin{enumerate}?ig
s?</ol.*>?//end{enumerate}?ig
s?<ul.*>?//begin{enumerate}?ig
s?</ul.*>?//end{enumerate}?ig
s?<li>\([^<]*\)</li>?\\item {\1}?g
s?<li>\([^<]*\).*$?\\item {\1}?g
s?</li>??g

s?<!--\([^<]*\)-->?\\begin{comment}{\1}\\end{comment}?ig

Be warned, and good luck! ;)