By Alvin Alexander. Last updated: February 1, 2024
The crazy Unix/Linux sed
script below is my first attempt at a script that will convert as much HTML as possible to LaTeX. For my purposes I'm mostly interested in tables, lists, buttons, and comboboxes, but I included a few other things as well. This is in an extremely experimental state, and is included here as much for backup purposes and sharing as anything else.
Update: I originally wrote this script in 2004. Unless you just want to fool around with
sed
, a much better approach these days is to use Pandoc to convert HTML to LaTeX.
Here's how you run the sed
script on an HTML file named test.html:
sed -f html2latex.sed test.html > test.tex
That being said, here's the current source code for the html2latex.sed
file:
# # goal: to convert as much of an html document as possible to an # equivalent sed script. # i understand that, because of this approach, this can never be 100% # accurate, but really what I'm after is the conversion of things # like tables and lists. # note that my html tags are pretty accurate here, but my latex tags # leave some things to be desired. # s?>?>?g s?<?<?g s? ? ?g s?<html>??ig s?</html>??ig s?<head>??ig s?</head>??ig s?<title>\([^<]*\)</title>?\\section*{\1}?ig s?<body>?\\begin{document}?ig s?</body>?\\end{document}?ig # i don't know what the latex tag should be here for a paragraph. s?<p>\([^<]*\)</p>?{\1}?ig s?<center>??ig s?</center>??ig #-------# # TABLE # #-------# s?<table.*>?\\begin{tabular}{}{}?ig s?<\/table.*>?\\end{tabular}{}{}?ig #-----------# # TABLE ROW # #-----------# # nothing at the beginning of a table row s?<tr>??ig # two backslashes at the end of a table row s?</tr>?\\\\?ig #--------------# # TABLE COLUMN # #--------------# s?<td.*>?\&?ig #-------# # FONTS # #-------# s?<b>\([^<]*\)</b>?\\textbf{{\1}}?ig s?<em>\([^<]*\)</em>?\\textit{{\1}}?ig s?<font .*>??ig s?</font>??ig s?<br>?////?g #--------# # BUTTON # #--------# # guessing on button syntax here s?<input type="button".*value="\([^<]*\)">?\\begin{fbox}\1\\end{fbox}?ig # need to do something here to handle multiline mode s?<select.*<option.*selected>\([^<]*\)</option>?\\begin{fbox}{\1}\\end{fbox}?g # delete all other option tags #?<option.*</option>?d # handle preformatted things s?<pre>?\\begin{verbatim}?ig s?</pre>?\\end{verbatim}?ig s?<code>?\\begin{verbatim}?ig s?</code>?\\end{verbatim}?ig # handle bulleted lists. # todo: fix "itemize" vs. "enumerate" s?<ol.*>?//begin{enumerate}?ig s?</ol.*>?//end{enumerate}?ig s?<ul.*>?//begin{enumerate}?ig s?</ul.*>?//end{enumerate}?ig s?<li>\([^<]*\)</li>?\\item {\1}?g s?<li>\([^<]*\).*$?\\item {\1}?g s?</li>??g s?<!--\([^<]*\)-->?\\begin{comment}{\1}\\end{comment}?ig
Be warned, and good luck! ;)