index.html 11.3 KB
<html xmlns="http://www.w3.org/1999/xhtml"> <!--*- nxml -*-->
<head>
  <title>Transforming XHTML to LaTeX and BibTeX</title>
  <link rel="stylesheet" href="article.css"/>

  <link rel="documentclass" title="llncs"/><!-- href? where does that come from? -->
  <link rel="bibliographystyle" title="splncs" /> <!-- href? -->
  <link rel="usepackage" title="graphicx" /><!-- href? -->
  <link rel="usepackage" title="url" href="ftp://cam.ctan.org/tex-archive/macros/latex/contrib/misc/url.sty" />
</head>
<body>
<div class="online"><a href="/">W3C</a></div>

<div class="maketitle">
<h1>Transforming XHTML to LaTeX and BibTeX</h1>


<address><a rel="author" href="http://www.w3.org/People/Connolly/">Dan Connolly</a><br />
<small class="online">$Revision: 1.23 $ of $Date: 2008/04/24 21:28:36 $</small>
</address>

</div>
<div class="abstract"><h4>Abstract</h4>

<p>We transform XHTML to LaTeX and BibTeX to allow technical articles
to be developed using familiar XHTML authoring tools and
techniques.</p>
</div>


<div>
<h2>Introduction</h2>

<p>Occasionally a web page turns the corner from a casually drafted
idea to an article worthy of publication. Computer science conferences
often require submissions using specific LaTeX styles; for example,
the <a
href="http://iswc2004.semanticweb.org/submission/authors_instruction.php">ISCW2004
submission instructions</a> require that submitted papers be formatted
in the style of the Springer publications format for <a
href="http://www.springeronline.com/sgw/cda/frontpage/0,10735,5-164-2-72376-0,00.html">Lecture
Notes in Computer Science (LNCS)</a>.
<a href="http://www.w3.org/Style/XSL/">XSLT</a> is
a convenient notation to express a transformation from
XHTML to LaTeX.</p>

<p>Tools to transform from LaTeX to HTML are commonplace, but there
are far fewer to go the other way.  A little bit of searching yielded
some work<a href="#Gur00">[Gur00]</a> that was designed to undo a
transformation to XHTML. It used an odd XHTML namespace and exhibited
various other quirks specific to reversing that transformation, but it
provided quite a boost up the LaTeX learning curve<a
href="#Mann94">[Mann94]</a>.</p>

<p>That code did not integrate with the BibTeX. In order to take
advantage of automatic bibliography formatting traditionally provided
by LaTeX styles, after studying the <a
href="http://www.cc.gatech.edu/classes/RWL/Projects/citation/Docs/UserManuals/Reference_Pages/bibtex_doc.html">BibTeX
format</a><a href="#Spen98">[Spen98]</a> for a bit, <tt><a
href="xh2bib.xsl">xh2bibl.xsl</a></tt> was born.</p>

<p>Together with tradtional <tt>pdflatex</tt> and <tt>bibtex</tt>
tools<a href="#tetex">[tetex]</a> and and XSLT processor such as
xsltproc<a href="#XSLTPROC">[XSLTPROC]</a>, this transformation can
turn ordinary web pages with just a bit of special markup into
camera-ready PDF in specialized LaTeX styles.</p>
</div>

<div><h3>A Quick Example</h3>

<p>This article demonstrates the basic features. See:</p>

<ul>
  <li><tt><a href="Overview.pdf">Overview.pdf</a></tt></li>
  <li><tt><a href="Overview.tex">Overview.tex</a></tt></li>
  <li><tt><a href="Overview.tex">Overview.bib</a></tt></li>
</ul>

<p>They are produced ala:</p>

<pre>
$ make Overview.pdf
xsltproc  --novalid --stringparam DocClass llncs \
  --stringparam Bib Overview --stringparam BibStyle splncs \
  --stringparam Status prepub  \
        -o Overview.tex xh2latex.xsl Overview.html
TEXINPUTS=.:../../../2004/LLCS: pdflatex  Overview.tex
This is pdfTeX, Version 3.14159-1.10b (Web2C 7.4.5)
<em>...</em>
Output written on Overview.pdf (3 pages, 62474 bytes).
Transcript written on Overview.log.
xsltproc  --novalid -o Overview.bib xh2bib.xsl Overview.html
BSTINPUTS=.:../../../2004/LLCS: bibtex  Overview
This is BibTeX, Version 0.99c (Web2C 7.4.5)
The top-level auxiliary file: Overview.aux
The style file: splncs.bst
Database file #1: Overview.bib
TEXINPUTS=.:../../../2004/LLCS: pdflatex  Overview
This is pdfTeX, Version 3.14159-1.10b (Web2C 7.4.5)
<em>...</em>
Output written on Overview.pdf (3 pages, 67583 bytes).
Transcript written on Overview.log.
TEXINPUTS=.:../../../2004/LLCS: pdflatex  Overview
This is pdfTeX, Version 3.14159-1.10b (Web2C 7.4.5)
<em>...</em>
Output written on Overview.pdf (3 pages, 67167 bytes).
Transcript written on Overview.log.
</pre>

</div>


<div>
<h2>Features</h2>

<p>The transformation <tt><a href="xh2latex.xsl">xh2latex.xsl</a></tt>
works in the obvious way for many idioms:</p>

<ul>
    <li>sections headings: <tt>h2</tt>, <tt>h3</tt>, <tt>h4</tt></li>
    <li>paragraphs: <tt>p</tt></li>
    <li>itemized lists: <tt>ul</tt>, <tt>dl</tt></li>
    <li>enumerated (numbered) lists: <tt>ol</tt></li>
    <li>tables: <tt>table border="1"</tt>, <tt>tr</tt>, <tt>td</tt></li>
    <li>verbatim: <tt>pre</tt></li>
    <li>phrase markup: <tt>em</tt>, <tt>code</tt>, <tt>tt</tt>,
    <tt>i</tt>, <tt>b</tt></li>
</ul>

<p>Table support is limited to tables with <tt>border="1"</tt>
and where all rows have the same number of cells. For example:</p>
<table border="1">
<tr><th>Name</th><th>Address</th><th>Phone</th></tr>
<tr><td>John Doe</td><td>123 High St.</td><td>555-1212</td></tr>
<tr><td>Jane Smith</td><td>456 Low St.</td><td>555-1234</td></tr>
</table>

<p>Specialized markup is required for other idioms. An <a
href="article.css">article.css</a> stylesheet provides
visual feedback for this special markup.</p>

<p>To use a latex package, add a link to the head of your document a la:</p>
<pre>
  &lt;link rel="usepackage" title="url"
    href="ftp://cam.ctan.org/tex-archive/macros/latex/contrib/misc/url.sty" />
</pre>

<p>The package name is taken from the title attrbute. The href attribute is not used in the LaTeX conversion.</p>

<p>We recommend the <a
href="ftp://cam.ctan.org/tex-archive/macros/latex/contrib/misc/url.sty">url.sty</a>
package, per <a
href="http://www.tex.ac.uk/cgi-bin/texfaq2html?label=setURL">a TeX
FAQ</a>. For example: <tt
class="url">http://www.w3.org/People/Connolly/</tt>.</p>

<div><h3>Front Matter</h3>

<p>The following patterns are used to extract the
title page material:</p>

<ul>
  <li><tt>div/@class="maketitle"</tt>
  <ul>
    <li>title: <tt>h1</tt></li>
      <li>abstract: <tt>div/@class="abstract"</tt></li>
      <li>author: <tt>address/a[@rel="author"]</tt></li>
  </ul>
  </li>
  <li>keywords: <tt>div[@class="keywords"]</tt></li>
  <li>terms: <tt>div[@class="terms"]</tt></li>
</ul>

<p><em>support for WWW2006 style authors, following
<a href="http://www.acm.org/sigs/pubs/proceed/sigfaq.htm">ACM style</a>,
is in progress.</em></p>

</div>

<div><h3>Cross references and footnotes</h3>

<p>The <tt>a[@rel="ref"]</tt> pattern is transformed to the LaTeX
<tt>\ref{<var>label</var>}</tt> idiom, assuming the reference takes
the form <tt>href="#<var>label</var>"</tt>. <em>@@needs testing</em></p>

<p>The footnote pattern is <tt>*[@class="footnote"]</tt>.</p>
</div>

<div><h3>Figures</h3>

<p>The <tt>div[@class="figure"]</tt> pattern is transformed to a
figure environment; any <tt>div/@id</tt> is used as a figure
label. The file pattern is <tt>object/@data</tt>.  <em>Figures are
currently assumed to be PDF; the <tt>object/@height</tt> attribute is
copied over.</em> The caption pattern is <tt>p[@class="caption"]</tt>.
<em>@@need to test this.</em>
Be sure to include the <tt>epsfig</tt> package a la:
</p>
<pre>
  &lt;link rel="usepackage" title="epsfig" />
</pre>
</div>

<div><h3>Citations and Bibliography</h3>

<p>An <tt>a</tt> element starting with an open square bracket
<tt>[</tt> is interpreted as a citation reference. The <tt>href</tt>
is assumed to be a local link ala <tt>#<var>tag</var></tt>.</p>

<p>The pattern <tt>dl/@class="bib"</tt> is used to find the
bibliography.
Each item marked up ala...</p>
<pre>
&lt;dt class="misc">[&lt;a name="tetex">tetex&lt;/a>]&lt;/dt>
&lt;dd>
&lt;span class="author">Thomas Esser&lt;/span>
&lt;cite>&lt;a
href="http://www.tug.org/tex-archive/help/Catalogue/entries/tetex.html"
>The TeX distribution for Unix/Linux&lt;/a>&lt;/cite>
February &lt;span class="year">2003&lt;/span>
&lt;/dd>
</pre>

<p>or</p>

<pre>
&lt;dt class="misc" id="tetex">[tetex]&lt;/dt>
...
</pre>

<p>Note the placement of the bibtex item type <tt>misc</tt> and the
tag <tt>tetex</tt> and keep in mind that <tt>bibtex</tt> ignores
works in the bibliography that are not cited from the body.</p>

<p>The <tt><a href="xh2bib.xsl">xh2bibl.xsl</a></tt> transformation
turns this markup into BibTeX format. <tt>xh2latex.xsl</tt> transforms
the entire bibliography <tt>dl</tt> to a <tt>\bibliography{...}</tt>
reference.</p>

<p><em>capitalization of titles seems to get mangled. I'm not sure if
that's a feature of certain bibliography styles or what.</em></p>

</div>

<div><h3>Bugs/Caveats/Misfeatures</h3>

<ul>
<li>Composed characters and such in the bibliography are
handled with a sort of kludge, e.g.
<tt>K&lt;span title='\"o'>&#246;&lt;/span>bler</tt>
</li>
<li>The <tt>samp</tt> element is used to pass LaTeX
math markup thru, e.g.
<tt>&lt;samp>\Delta&lt;/samp></tt>
</li>
</ul>
</div>

</div>

<div><h2>Makefile support</h2>

<p>Formatting a LaTeX document is done in several passes.  One <a
href=
"http://amath.colorado.edu/documentation/LaTeX/basics/steps/help_latex.html"
>typical manual</a> shows:</p>

<pre>
ucsub>  latex MyDoc.tex
ucsub>  bibtex MyDoc
ucsub>  latex MyDoc.tex
ucsub>  latex MyDoc.tex
</pre>

<p>The follwing excerpt from <tt><a
href="html2latex.mak">html2latex.mak</a></tt> shows
some rules to accomplish this using make:</p>

<pre>
.html.tex:
	$(XSLTPROC) --novalid $(HLPARAMS) \
		-o $@ xh2latex.xsl $&lt; 

.html.bib:
	$(XSLTPROC) --novalid -o $@ xh2bib.xsl $&lt;

.tex.aux:
	TEXINPUTS=$(TEXINPUTS) $(PDFLATEX) $&lt;

.tex.bbl:
	BSTINPUTS=$(BSTINPUTS) $(BIBTEX) $*


.aux.pdf:
	TEXINPUTS=$(TEXINPUTS) $(PDFLATEX) $*
	TEXINPUTS=$(TEXINPUTS) $(PDFLATEX) $*
</pre>

<p>Sources:</p>
<ul>
  <li><tt><a href="xh2latex.xsl">xh2latex.xsl</a></tt></li>
  <li><tt><a href="xh2bib.xsl">xh2bib.xsl</a></tt></li>
  <li><tt><a href="article.css">article.css</a></tt></li>
</ul>

</div>

<div>
<h2>References</h2>
<dl class="bib">

<dt class="misc">[<a name="tetex">tetex</a>]</dt>
<dd>
<span class="author">Thomas Esser</span>
<cite><a href="http://www.tug.org/tex-archive/help/Catalogue/entries/tetex.html">The TeX distribution for Unix/Linux</a></cite>
February <span class="year">2003</span>
</dd>

<dt class="misc">[<a name="Mann94">Mann94</a>]</dt>
<dd><span class="author">Shannon Mann</span>
<cite><a href="http://www.csclub.uwaterloo.ca/u/sjbmann/tutorial.html">Beginner's LaTeX Tutorial</a></cite>
<span class="year">1994</span>-06-16T15:32:27
</dd>

<dt class="misc">[<a name="Spen98">Spen98</a>]</dt>

<dd><span class="author">Spencer Rugaber</span>
<cite>
<a href="http://www.cc.gatech.edu/classes/RWL/Projects/citation/">The Citation project</a>
</cite>
Summer <span class="year">1998</span>.
</dd>

<dt class="misc">[<a name="Gur00" id="Gur00">Gur00</a>]</dt>
<dd><span class="author">Eitan M. Gurari</span>
<cite><a href="http://www.cse.ohio-state.edu/~gurari/docs/mml-00/xhm2latex.html">XSLT from XHTML+MathML to LATEX</a></cite>
<span class="month">July</span> 19, <span class="year">2000</span>
</dd>

<dt class="misc">[<a name="XSLTPROC" id="XSLTPROC">XSLTPROC</a>]</dt>
<dd><span class="author">Daniel Veillard</span>
<cite><a href="http://xmlsoft.org/XSLT/xsltproc2.html">The xsltproc tool</a></cite>
in <a href="http://xmlsoft.org/XSLT/">libxslt: The XSLT C library for Gnome</a>
1.1.2 <span class="month">Dec</span> 24 <span class="year">2003</span>
</dd>
</dl>
</div>

</body>
</html>