By Alvin Alexander. Last updated: February 1, 2024
The crazy Unix/Linux sed script below is my first attempt at a script that will convert as much HTML as possible to LaTeX. For my purposes I'm mostly interested in tables, lists, buttons, and comboboxes, but I included a few other things as well. This is in an extremely experimental state, and is included here as much for backup purposes and sharing as anything else.
Update: I originally wrote this script in 2004. Unless you just want to fool around with
sed, a much better approach these days is to use Pandoc to convert HTML to LaTeX.
Here's how you run the sed script on an HTML file named test.html:
sed -f html2latex.sed test.html > test.tex
That being said, here's the current source code for the html2latex.sed file:
#
# goal: to convert as much of an html document as possible to an
# equivalent sed script.
# i understand that, because of this approach, this can never be 100%
# accurate, but really what I'm after is the conversion of things
# like tables and lists.
# note that my html tags are pretty accurate here, but my latex tags
# leave some things to be desired.
#
s?>?>?g
s?<?<?g
s? ? ?g
s?<html>??ig
s?</html>??ig
s?<head>??ig
s?</head>??ig
s?<title>\([^<]*\)</title>?\\section*{\1}?ig
s?<body>?\\begin{document}?ig
s?</body>?\\end{document}?ig
# i don't know what the latex tag should be here for a paragraph.
s?<p>\([^<]*\)</p>?{\1}?ig
s?<center>??ig
s?</center>??ig
#-------#
# TABLE #
#-------#
s?<table.*>?\\begin{tabular}{}{}?ig
s?<\/table.*>?\\end{tabular}{}{}?ig
#-----------#
# TABLE ROW #
#-----------#
# nothing at the beginning of a table row
s?<tr>??ig
# two backslashes at the end of a table row
s?</tr>?\\\\?ig
#--------------#
# TABLE COLUMN #
#--------------#
s?<td.*>?\&?ig
#-------#
# FONTS #
#-------#
s?<b>\([^<]*\)</b>?\\textbf{{\1}}?ig
s?<em>\([^<]*\)</em>?\\textit{{\1}}?ig
s?<font .*>??ig
s?</font>??ig
s?<br>?////?g
#--------#
# BUTTON #
#--------#
# guessing on button syntax here
s?<input
type="button".*value="\([^<]*\)">?\\begin{fbox}\1\\end{fbox}?ig
# need to do something here to handle multiline mode
s?<select.*<option.*selected>\([^<]*\)</option>?\\begin{fbox}{\1}\\end{fbox}?g
# delete all other option tags
#?<option.*</option>?d
# handle preformatted things
s?<pre>?\\begin{verbatim}?ig
s?</pre>?\\end{verbatim}?ig
s?<code>?\\begin{verbatim}?ig
s?</code>?\\end{verbatim}?ig
# handle bulleted lists.
# todo: fix "itemize" vs. "enumerate"
s?<ol.*>?//begin{enumerate}?ig
s?</ol.*>?//end{enumerate}?ig
s?<ul.*>?//begin{enumerate}?ig
s?</ul.*>?//end{enumerate}?ig
s?<li>\([^<]*\)</li>?\\item {\1}?g
s?<li>\([^<]*\).*$?\\item {\1}?g
s?</li>??g
s?<!--\([^<]*\)-->?\\begin{comment}{\1}\\end{comment}?ig
Be warned, and good luck! ;)

