html-essay.html 24.5 KB

Raw Blame History Permalink

<!-- $Id: html-essay.html,v 1.2 1994/02/15 20:07:12 connolly Exp $ -->
<html>
<head>
<title>Toward a Formalism for Communication On the Web</title>
</head>
<body>

<ADDRESS>Daniel W. Connolly &lt;connolly@hal.com&gt; <P>
$Id: html-essay.html,v 1.2 1994/02/15 20:07:12 connolly Exp $
</ADDRESS>

<H2>Status</H2>

 <P>I had hoped to polish this more before publishing it, but I can't seem
to get caught up... there's so much new stuff all the time!

<H1>Some Background on SGML for the World-Wide Web
</H1>

 <p>In late 1992 and early 1993, I did quite a bit of work on the HTML DTD
while I was working at Convex in the online documentation group.

 <p>When I began, there was the LineMode browser and the NeXT
implementation, and a few nodes in The Web describing HTML with some
oblique references to SGML. I was not intimately familiar with SGML, but
I was quite familiar with the problems of document interchange, and I
was eager to apply some of my formal systems background to the problem.

<H2>On Formally Unconvertable Document Formats
</H2>

 <P>My experience with document interchange led me to classify document
formats using the essential distinction that some are "programmable" and
some are not. Most widely used source forms are programmable: TeX,
troff, postscript, and the like. On the other hand, there are several "static"
formats: plain text, Microsoft RTF, FrameMaker MIF, GNU's TeXinfo,

 <P>The reason that this distinction is essential with respect to document
interchange is that extracting information from documents in
"programmable" document formats is equivalent to the halting problem.
That is, it is arbitrarily difficult and cannot be automated in a
general fashion.

 <P>For example, I conjecture that it is impossible to write a program that
will extract the third word from a TeX document. It would be an easy
task for 80% of the TeX documents out there -- just skip over some
formatting stuff and grab the third bunch of characters surrounded by
whitespace. But that "formatting stuff" might be a program that
generates 100 words from the hypenation dictionary. So the simple
lexical scan of the TeX source would find a word that is <em>not</em> third
word of the document when printed.

 <P>This may seem like an obscure and unimportant problem, but I assure you
that the problem of converting TeX tables to FrameMaker MIF is just as
unsolvable.

 <P>So while "programmable" document formats have the advantage that
features can be added on a per-document basis, they suffer the
disadvantage that these features cannot be recovered by the machine and
translated in an automated fashion.


<H2>Document Formats as Communications Media
</H2>

 <P>If we look at document formats in light of the conventional
sender/message/medium/receiver communications model, we see that
document formats capture the message at various levels of
"concreteness".

 <P>The message begins as a collection of concepts and ideas in the mind of
the sender. In order to communicate, the sender and receiver must share
some language. That is, they must both understand some common set of
symbols and the way those symbols combine to represent ideas. The
senders job is to express the message in terms of the common symbols and
express them on the medium -- that is "render" or "present" them. The
the medium stimulates the receiver to reconstruct the symbols in his/her
brain -- that is, the receiver "interprets" or "recognizes" the symbols
from the medium. Those symbols interact with other symbols in the
receiver's brain, and the receiver "gets the message."

 <P>The communications medium is often a layered combination of more and
less concrete media. For example, folks first render their ideas in the
symbology of the English language, and then render those symbols as
sequences of spoken phonemes or written characters. Those written
characters are in turn combinations of lines, curves, strokes, and
points. The receiving folks then assemble the strokes into characters,
the characters into words, the words into phrases, sentences, thoughts,
ideas, and so on.

 <P>The most common and ubiquitous document format, plain ASCII text,
captures or digitizes messages at the level of written characters.
PostScript captures the characters as lines, curves, and paths. The GIF
format captures a document as an array of pixels. GIF is in many ways
infinitely more expressive than plain text, which is limited to
arrangements of the 96 ASCII characters.

 <P>The RTF, TeX, nroff, etc. document formats provide very sophisticated
automated techniques for authors of documents to express their ideas. It
seems strange at first to see that plain text is still so widely used.
It would seem that PostScript is the ultimate document format, in that
its expressive capabilities include essentially anything that the human
eye is capable of perceiving, and yet it is device-independent.

 <P>And yet if we take a look at the task of interpreting data back into
the ideas that they represent, we find that plain text is much to be
preferred, since reading plain text is so much easier to automate than
reading GIF files (optical character recognition) or postscript
documents (halting problem). In the end, while the source to a various
TeX or troff documents may correspond closely to the structure of the
ideas of the author, and while PostScript allows the author very precise
control and tremenous expressive capability, all these documents
ultimately capture an image of a document for presentation to the human
eye. They don't capture the original information as symbols that can be
processed by machine.

 <P>To put it another way, rendering ideas in PostScript is not going to
help solve the problem of information overload -- it will only compound
the situation.

 <P>As a real world example, suppose you had a 5000 page document in
PostScript, and you wanted to find a particular piece of information
inside it. The author may have organized the document very well, but
you'd have to print it to use those clues. If the characters aren't
kerned much, you might be able to use grep or sick a WAIS indexing
engine on it. Then, once you've found what looks like postscript code
for some relavent information, you'd pray that the document adheres to
the Adobe Document Structuring conventions so that you could pick out
the page containing the information you need and view that page.

 <P>If that's too perverse, look at the problem of navigating a large
collection of technical papers coded in TeX. Many of the authors use
LaTeX, and you may be able to convince the indexing engine to filter
out common LaTeX formatting idioms -- or better yet, weight headings,
abstracts, etc. more heavily than other sections based on the
formatting idioms. While there are heuristic solutions to this problem
that will work in the typical 80%/20% fashion, the general solution is
once again equivalent to the halting problem; for example, individual
documents might have bits of TeX programming that change the
significance of words in a way that the indexing engine won't be able
to understand.


<H2>SGML as a Layered Communications Medium
</H2>

 <P>So where does SGML fit into the sender/message/medium/receiver game?

 <P>I'll use PostScript as a basis of comparison. The PostScript model
consists of a fairly powerful and general purpose two dimensional
imaging model, that is, a set of primitive symbols for specifying sets
of points in two dimensions using handy computational techniques, and a
general purpose programming model for building complex symbols out of
those primitives. That model is applied extensively to the problem of
typography, and there is a an architecture (that is, a set of well known
symbols derived from the primitives) for using and building fonts.

 <P>So to communicate message consisting of symbols from human
communications in PostScript, one may choose from a well known set of
typefaces, or create a new typeface using the well known font
architecture, or free-hand draw some characters using postscript
primitives, or draw lines, boxes, circles and such using postscript
primitives, or scribble on a piece of paper, scan it, and convert the
bits to use the postscript image operator. The space of symbols is
nearly limitless, as long as those symbols can be expressed ultimately
as pixels on a page.

 <P>The distinctive feature of PostScript (an advantage at times, and a
disadvantage at others) is that whether you print it and deliver the
paper or you deliver the PostScript and the receiver prints it out, the
result is the same bunch of images.

 <P>The SGML model, on the other hand, specifies no general purpose
programming model where complex symbols can be defined in terms of
primitive symbols. The meaning of a symbol is either found in the SGML
standard itself, or in some PUBLIC document (which may or may not be
machine readable), or in some SYSTEM specific manner, or defined by an
SGML application. The only real primitives are the character and the
"non-SGML data entity".

 <P>The model perscribes that a document consist of a declaration, a
prologue, and an instance. The declaration is expressed in ASCII and
specifies the character sets and syntactic symbols used by the prologue
and instance. The prologue is expressed in a standard language using the
syntactic symbols from the delcaration, and specifies a set of entities
and a grammar of element types available to the instance.

 <P>The instance is a sequence of elements, character data, and entities
constrained by the grammar set forth in the prologue, and the SGML
standard does not specify any semantics or meaning for the instance.

 <P>So to communicate using SGML, the sender first chooses a character set
and certain processing quatities and capacities. For example "I'm
writing in ASCII, and I'll never use an element name more than 40
characters long" is some information that can be expressed in the SGML
declaration. [The standard allows the SGML declaration to be implicitly
agreed upon by sender and receiver, and this is generally the case].

 <P>The tricky part is the prologue, where the sender gives a grammar that
constrains the structure of the document. Along with the information
actually expressed in SGML in the prologue, there is usually some amound
of application defined semantics attached to the element types. For
example, the prologue may express in SGML that an H2 element must occur
within the content of an H1 element. But the convention that text in an
H1 is usually displayed larger and considered more important is
application defined.

 <P>Once the prologue is determined (this usually involves considerable
discussion between a collection of authors and consumers in some
domain -- in the end, there may be some "parameter entities" in the
prologue which allow some variation on a per-document basis), the sender
is constrained to a rigorous structure for the organization of the
symbols and character data of the document. On the other hand, s/he has
an automated technique for verifying that s/he has not viloated the
structure, and hence there is some confidence that the document can be
consumed and processed by machine.


<H1>The HTML DTD: Conforming, though Expedient
</H1>

<H2>Design Constraints of the HTML DTD
</H2>

 <P>Tim's original conception of HTML is that it should be about as
expressive as RTF. In contrast to traditional SGML applications where
documents might be batch processed and complex structure is the norm,
HTML documents are intended to be processed interactively. And the
widespread success of WYSIWYG word processors based on fairly flat
paragraph structure was proof that something like RTF was suitable for a
fairly wide variety of tasks.

 <P>As I learned a little about SGML, it was clear that the WWW browser
implementation of HTML sorely lacked anything resembling an SGML entity
manager. And there were some syntactic inconsitencies with the SGML
standard. And it didn't use the ID/IDREF feature where it should have...

 <P>Then, as I began to comprehend SGML with all its warts, (who's idea was
it to attach the significance of a newline character to the phase of the
moon anyway?) I was less gung-ho about declaring all the HTML out there
to be blasphemy to the One True SGML Way.

 <P>Thus I chose for my battle to find some formal relationship between the
SGML standard and the  HTML that was "out there." The quest was:

<H3>Find some DTD such that the vast majority of HTML documents are
instances of that DTD, conversely, such that all its instances make
sense to the existing WWW clients.
</H3>

 <P>I struggled mightily with such issues as:

<UL>
<LI>Should we be sticking &lt;! DOCTYPE HTML SYSTEM> in .html files? What
if somebody puts an entity declaration in there? (And does that mean
that WWW clients have to be able to parse SGML prologues in general?

<LI>What's the syntax of an attribute value? If we allow SHORTTAG YES,
does that mean we have to parse <CODE>&lt;em/this/</CODE> style of
markup too?

<LI>Can we put some short reference maps in the DTD that will cause real
SGML parsers and current WWW browsers to do the same thing w.r.t
newlines? (i.e. can we make all that phase-of-the-moon processing with
newlines a moot issue)

<LI>What about marked sections? Short reference maps?

<LI>What character set should we be using? How do I express ISO-Latin-1
in the SGML declaration? How should authors express the '<' character?
How should this be expressed in the DTD?

<LI>How do you put quotes in an attribute value literal?

<LI>How can I deal with the current paragraph element idioms without
using minimization?

<LI>Can I stick base64 encoded stuff in a CDATA element? Do I have to
watch out for <'s and such?

<LI>How do we combine SGML and multimedia data in the same data stream?

</UL>


 <P>I found solutions to some problems, and punted on others. I probably
should have put more comments in the DTD regarding the compromises. But
I wanted to keep the DTD stripped down to the normative information and
keep the informative information in other documents.

 <P>I did, by the way, draft a series of 4 or 5 documents demonstrating
various structural and syntactic features of SGML -- a sort of
validation suite. I'm not sure where it went.

 <P>I'd like to respond to Elliot Kimber's critique of the HTML DTD that I
posted.

<pre>
>At the bottom of this posting is a slightly modified copy of the
>HTML DTD that conforms to the HyTime standard.  I have not modified
>the elements or content models in any way.  I have not added any
>new elements.  I have only added to the attribute lists of a few
>elements.
>
>The biggest change I made was to the way URL addresses are handled.
>In order to use HyTime (as opposed to application-specific)
>methods for doing addressing, I had to change the URL address
>from a direct reference into an entity reference where the
>entity's system identifier is its URL address.
</pre>

 <P>I suggested this long ago, but Tim shot the idea down. As I recall, he
said that all that extra markup was a waste. On the one hand, I agree
with him -- the purpose of a language is to be able to express common
idioms succinctly, and SGML/HyTime are poor in that respect. On the
other hand, once you've chosen SGML, you might as do as the Romans do.

<pre>
>  This makes
>the link elements conform to the architectural forms and puts
>in enough indirection to allow other addressing methods to
>be used to locate the objects without having to modify the
>links, only the entity declarations.
</pre>

 <P>Why is it easier to modify entity declarations than links? Six of one,
half-dozen of the other if you ask me.

<pre>
>  I use SUBDOC entities
>for refering to other complete documents, although I'm not
>sure this the best thing, but there's no other construct in
>SGML that works as well.  Note that nowwhere in 8879 does it
>define what must happen as the result of a SUBDOC reference,
>except that a new parsing context is established.  The actual
>result of a SUBDOC reference is a matter of style and presumably
>in a WWW context it would result in the retrieval of the document
>and its presentation in a seperate window.  The key is that
>the subdoc reference establishes a specific relationship between
>the source of the link and the target, namely one document
>refering to another.  The target document could also be defined
>as a data entity with whatever notation is appropriate (possibly
>even SGML if it's another SGML document).  This may be the better
>approach, I don't know.
</pre>

 <P>I don't expect that the data entity/subdocument entity distinction
matters one hill of beans to contemporary WWW clients. I'm interested to
know if it means anything to HyTime engines.

<pre>
>If I were re-designing the HTML, I would add direct support
>for HyTime location ladders using at a minimum the nameloc,
>notloc, and dataloc addressing elements.  However, if these
>elements are needed for interchange they could be generated
>from the information contained in WWW documents using the
>DTD below, so it's not critical.
>
</pre>

 <P>Could you expand on that? If we'll be "generating" compliant SGML for
interchange, we might as well use TeXinfo or something practical like
that for application-specific purposes.

<pre>
>This is just one attempt at applying HyTime to the HTML.
>I'm sure there are other equally-valid (or more valid)
>ways it could be done.  Given the current functionality
>of the WWW, I'm sure there are ways to express that functionality
>using HyTime constructs.  HyTime constructs may also suggest
>useful ways to extend the WWW functionality, who knows.
</pre>

 <P>I finally got to actually read the HyTime standard the other day, and
the clink and noteloc forms looked most useful. I'm also interested in
expressing some of the "relative link" idioms used in HTML.
(e.g how would we express HREF="../foo/bar.html#zabc" using HyTime? The
object of the game is to do it in such a way that the markup can be
copied verbatim from one system to another (say unix to VMS) and have
the right meaning)

<pre>
>&lt;!ENTITY % URL "CDATA"
>        -- The term URL means a CDATA attribute
>           whose value is a Universal Resource Locator,
>           as defined in ftp://info.cern.ch/pub/www/doc/url3.txt
>        -->
>&lt;!--=====================================================================
>    WEK:  I have defined URL addresses as a notation so that they can
>          be then used in a notloc element.
>    =====================================================================-->
>&lt;!NOTATION url PUBLIC "-//WWW//NOTATION URL/Universal Resource Locator
>                             /'ftp: info.cern.ch/pub/www/doc/url3.txt'
>                             //EN"
>>
</pre>

 <P>Cool good idea.

<pre>
>
>&lt;!ENTITY % linkattributes
>        "NAME NMTOKEN #IMPLIED
>        HREF ENTITY #IMPLIED
>
> --=== WEK =======================================================
>
>      HREF is now an entity attribute rather than containing a
>      URL address directly.  To create a link using a URL address,
>      declare a SUBDOC or data entity and make the system
>      identifier the URL address of the object:
>
>      &lt;!ENTITY  mydoc SYSTEM "URL address of document " SUBDOC >
>
>      This indirection gives to things:
>
>      1. A way to protect links in the source from changes in the
>         location of a document since the physical address is only
>         specified once.
</pre>

 <P>Ah... now I get it... in case you have lots of links to mydoc or parts
of mydoc, you only have one place that defines where mydoc is. Nifty.

<pre>
>
>      2. An opportunity to use other addressing methods, including
>         possibly replacing the URL with an ISO formal public
>         identifier.
>    =================================================================-->
>
>        TYPE NAME #IMPLIED -- type of relashionship to referent data:
>                                PARENT CHILD, SIBLING, NEXT, TOP,
>                                 DEFINITION, UPDATE, ORIGINAL etc. --
>        URN CDATA #IMPLIED -- universal resource number. unique doc id --
>        TITLE CDATA #IMPLIED -- advisory only --
>        METHODS NAMES #IMPLIED -- supported methods of the object:
>                                        TEXTSEARCH, GET, HEAD, ... --
>        -- WEK: --
>        LINKENDS  NAMES #IMPLIED
>          -- Linkends takes one or more NAME= values for local links--
>        HyNames  CDATA #FIXED 'TYPE ANCHROLE URN DOCORSUB'
>        ">
</pre>

 <P>I thought the ANCHROLEs of a clink were defined by HyTime to be
REFsomething and REFSUB. Or are those just defaults? Also... does the
HyNames think work locally like this? What a HACK!

<pre>
>
>&lt;!--=== WEK ==========================
>
>    The HyNames= attribute maps the local attribute names to their
>    cooresponding HyTime forms.
>
>    The Methods= attribute is bit of a puzzle since it is really
>    a part of the hyperlink presentation/processing style, not
>    a property of the anchors, but there's nothing wrong with
>    having application-specific stuff in your HyTime application.
</pre>

The Methods= attribute has been striken :-(. It was motivated by the
observation that textsearch interactions in WWW go like this:

<OL>
<LI>Doc A says "click here[23] to see the index"
<LI>user clicks
<LI>client fetches link 23, "http://host/index"
<LI>displays "cover page" document
<LI>user enters FIND abc
<LI>client fetches "http://host/index?abc"
<LI>search results are displayed
</OL>

Wheras in gopher, you get to save a step if you like:

<OL>
<LI>Doc A says "click here[23] to search the index"
<LI>user clicks
<LI>client displayes "enter search words here: " dialog
<LI>user enters FIND abc
<LI>client fetches "http://host/index?abc"
<LI>search results are displayed
</OL>

So to specify the latter, you would create a link with Methods=textsearch.

<pre>
>    I added LinkEnds= so that the various linking elements will
>    completely conform to the clink and ilink forms.  The presence
>    of the LinkEnds= attribute does not imply required support
>    for this type of linking, but it does make HTML more consistent
>    with other DTDs that do use the LinkEnds= attribute form.
>
>    Note that 10744 shows the attribute name for the ILINK form
>    to be 'linkend', not 'linkends'.  I consider this to be a
>    typo, as there's no logical reason to disallow multiple anchors
>    from a clink and lack of it puts an undue requirement of
>    specifying otherwise unneeded nameloc elements.  In any case,
>    an application can transform linkends= to linkend= plus a
>    nameloc, so it doesn't matter in practice.
</pre>

Are there <EM>any</EM> HyTime implementations out there? Do they use
'linkend' or 'linkends'? It's hard to beleive that HyTime became a
standard without a proof-of-concept implementation.

<pre>
>
>&lt;!ELEMENT P     - O EMPTY -- separates paragraphs -->
>&lt;!--=== WEK ==========================================================
>
>    Design note:  This seems like a clumsy way to structure information.
>                  One would expect paragraphs to be containing.
>
>    ==================================================================-->
</pre>

Yeah, well, try implementing end tag inference in &lt;1000 or so lines of code.
Maybe we'll get it right next time...

<pre>
>&lt;!ELEMENT DL    - -  (DT | DD | P | %hypertext;)*>
>&lt;!--    Content should match ((DT,(%hypertext;)+)+,(DD,(%hypertext;)+))
>        But mixed content is messy.
>  -->
>&lt;!--=== WEK ============================================================
>
>    Design note:  This content should be:
>
>    &lt;!ELEMENT DL  - - (DT+, DD)+ >
>    &lt;!ELEMENT (DT | DD) - O (%hypertext;)* >
>
>    There's no reason for DT and DD to be empty.  Perhaps there was
>    some confusion about the problems with mixed content?  There are
>    none here.
>
>    These comments apply to the other list elements as well.
>
>    ====================================================================-->
</pre>

The problem is that DL, DT, DD, UL, OL, and LI were marked up in extant
HTML documents as if minimization were supported. But I didn't want to
introduce minimization into the implementation, so I made the DT, DD,
and LI elements empty.
<p>

It's possible I'm confused about mixed content, but the way I understand
it, you don't want to use mixed content except in repeatable or groups
because authors will stick whitespace in where it is meant to be ignored
but it won't be.

<pre>
>
>&lt;!-- Character entities omitted.  These should be separate from
>     the main DTD so specific applications can define their values.
>     ISO entity sets could be used for this.
>  --&gt;
</pre>

Another point I should have explained in the DTD: the WWW application
specifies that HTML uses the Latin-1 character set, and that the Ouml
entity represents exactly that character from the Latin-1 character and
not some system specific thingy. Translation to system character sets is
done <em>outside</em> of the SGML parser.

</body>
</html>