NOTE-xh-19980511 21.5 KB

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
 "http://www.w3.org/TR/REC-html40/sgml/HTML4-loose.dtd">
<HTML>
<HEAD>
  <TITLE>XML in HTML Meeting Report</TITLE>
</HEAD>
<BODY>
<DIV class="header">
  <H2 align="right">
    <A href="http://www.w3.org/">
    <IMG border="none" align="left" alt="W3C" src="http://www.w3.org/Icons/WWW/w3c_home"></A>NOTE-xh-19980511
  </H2>
  <H1 align="center">
    XML in HTML Meeting Report
  </H1>
  <H3 align="center">
    W3C Note 11 May 1998
  </H3>
  <DL>
    <DT>
      This Version:
    <DD>
      <A HREF="http://www.w3.org/TR/1998/NOTE-xh-19980511">http://www.w3.org/TR/1998/NOTE-xh-19980511</A><BR>
      $Date: 2007/01/26 10:12:49 $
    <DT>
      Latest Version:
    <DD>
      <A HREF="http://www.w3.org/TR/NOTE-xh">http://www.w3.org/TR/NOTE-xh</A>
    <DT>
      Editors:
    <DD>
      <A href="http://www.w3.org/People/Connolly/">Dan Connolly</A>
      <A href="mailto:connolly@w3.org"><TT>&lt;connolly@w3.org&gt;</TT></A> W3C<BR>
      Lauren Wood
      <A href="mailto:lwood@sq.com"><TT>&lt;lauren@softquad.com&gt;</TT></A> Softquad
  </DL>
  <H2>
    Status of This Document
  </H2>
  <P>
  This document summarizes the discussion and conclusions of a meeting held
  to coordinate across several W3C Working Groups. While the decisions of this
  forum are not binding on any of the working groups, they represent substantial
  experience and analysis and should guide future work.
  <P>
  Please direct comments to
  <A HREF="../../MarkUp/Forums#www-html">www-html</A>, a public discussion
  forum.
  <P>
  This document is a NOTE made available by the W3 Consortium for discussion
  only. This indicates no endorsement of its content, nor that the Consortium
  has, is, or will be allocating any resources to the issues addressed by the
  NOTE.
  <P>
    <HR>
  <H2>
    Contents
  </H2>
  <OL>
    <LI>
      <A HREF="#About">About the Meeting</A>
      <OL>
	<LI>
	  <A HREF="#Background">Background</A>
	<LI>
	  <A HREF="#who">Participants</A>
      </OL>
    <LI>
      <A HREF="#Summary">Summary of Discussion</A>
      <OL>
	<LI>
	  <A HREF="#RDF">RDF Requirements</A>
	<LI>
	  <A HREF="#MathML">MathML Requirements</A>
	<LI>
	  <A HREF="#Types">Types of HTML</A>
	<LI>
	  <A HREF="#General">General Requirements</A>
	<LI>
	  <A HREF="#sol">Possible Solutions</A>
      </OL>
    <LI>
      <A HREF="#Conclusions">Conclusions and Future Work</A>
      <OL>
	<LI>
	  <A HREF="#link">Including XML by Reference</A>
	<LI>
	  <A HREF="#attrs">Using Attributes to Hide New Idioms</A>
	<LI>
	  <A HREF="#xml-block">Using <TT>&lt;XML&gt;</TT>, an HTML Enhancement</A>
	<LI>
	  <A HREF="#script-hack">Using Script to Hide Content in Older Browsers</A>
	<LI>
	  <A HREF="#sprinkles">Using Namespaces, Stylsheets, and the DOM</A>
      </OL>
    <LI>
      <A HREF="#References">References</A>
  </OL>
</DIV>
<P>
  <HR>
<H2>
  <A NAME="About">About the Meeting</A>
</H2>
<P>
A number of issues regarding the use of XML<A HREF="#XML">[XML]</A> in HTML
documents were brought to the attention of the W3C Hypertext Coordination
Group. In particular, MathML<A HREF="#mathml">[MathML]</A> and
RDF<A href="#rdf-syntax">[RDF]</A> are written in XML and intended to be
used in HTML documents.
<P>
In response, the coordination group held a meeting 11-12 Feb 1998 in San
Jose, CA. We would like to thank the host, Sun Microsystems.
<H3>
  <A NAME="Background">Background</A>
</H3>
<P>
As discussed in <A HREF="#Dialects">[Dialects]</A>, evolution of the HTML
specification proceeds by introduction of new idioms which interact with
deployed software in one of the following ways:
<DL>
  <DT>
    The idiom is ignored altogether.
  <DD>
    for example, <TT>&lt;img src="..."&gt;</TT> was ignored by the deployed software
    base when it was introduced. New empty elements and new attributes generally
    behave this way.
  <DT>
    The enhanced functionality of the new idiom is ignored, but the content is
    otherwise handled sensibly.
  <DD>
    for example, <TT>&lt;em&gt;abc&lt;/em&gt;</TT> displays without emphasis
    on some very old user agents. New "inline" elements often behave this way.
  <DT>
    The idiom is disruptive in deployed software
  <DD>
    for example, forms and tables display as a jumble of noise in software deployed
    before they were introduced. New block elements are particularly difficult
    to deploy gracefully.
</DL>
<P>
For the past few years, the HTML Working Group has vetted new proposals on
behalf of the web community, considering the value of each versus the cost
of deployment. But with the introduction of XML into the web, markup design
is decentralized. Each community or even each user can use whatever elements
and attributes they choose and give them whatever meaning and significance
they choose. As MathML and RDF show, at least some of this XML markup is
intended for use inside HTML documents.
<P>
This meeting explored mechanisms to use XML markup in HTML documents: existing
mechanisms and possible enhancements. In particular:
<UL>
  <LI>
    how do we "hide" new idioms from deployed software?
  <LI>
    how do we introduce new idioms with distinctive display characteristics,
    such as MathML?
</UL>
<H3>
  <A NAME="who">Participants</A>
</H3>
<P>
Participants from all W3C working groups, especially RDF, MathML, CSS&amp;FP,
and XML, and DOM were invited. A wide variety of experience and requirements
were represented by the meeting participants:
<UL>
  <LI>
    Vidur Apparao, Netscape
  <LI>
    Jon Bosak, Sun (host)
  <LI>
    Dean Burson, Lotus
  <LI>
    Dan Connolly, W3C (co-chair)
  <LI>
    Ramanathan Guha, Netscape
  <LI>
    Bruce Hunt, Adobe
  <LI>
    Jacob Levy, Sun
  <LI>
    Eve Maler, ArborText
  <LI>
    Murray Maloney, CN Group
  <LI>
    John McCarthy, Berkeley Labs
  <LI>
    Robert Miner, University of Minnesota
  <LI>
    Scott Isaacs, Microsoft
  <LI>
    Jean Paoli, Microsoft
  <LI>
    T.V. Raman, Adobe
  <LI>
    Nisheeth Ranjan, Netscape
  <LI>
    David Singer, IBM
  <LI>
    Bob Sutor, IBM
  <LI>
    Ralph Swick, W3C
  <LI>
    Paul Topping, Design Science
  <LI>
    Chris Wilson, Microsoft
  <LI>
    Lauren Wood, SoftQuad (co-chair)
</UL>
<H3>
  Miscellaneous
</H3>
<P>
The participants request that W3C make the W3C site searchable.
<H2>
  <A NAME="Summary">Summary of Discussion</A>
</H2>
<H3>
  <A NAME="RDF">RDF Requirements</A>
</H3>
<P>
The
<A HREF="http://www.w3.org/RDF/Group/1998/02/WD-rdf-syntax-19980216/#usage">Appendix
B</A> of <A href="#rdf-syntax">[RDF]</A> says:
<BLOCKQUOTE>
  The recommended technique for embedding RDF statements in an HTML document
  is simply to insert the RDF in-line. This will make the resulting document
  non-conformant to HTML specifications up to and including HTML 4.0 but the
  RDF Working Group hopes that the HTML specification will evolve to support
  this.
</BLOCKQUOTE>
<P>
The discussion around the RDF requirements showed that possible solutions
for RDF included putting all the information into attributes; putting it
in an external file; and putting it at the end of the document. in general
the participants thought that putting information into attributes was safer
than putting it in an external file because of worries about security and
forcing tools to be able to cope with multiple files. Since many tools already
have to cope with multiple files, other participants thought this was not
a drawback where security was not an issue. Some participants thought that
putting the information in an external file would sometimes be a necessity,
so tools would have to learn to cope.
<H3>
  <A NAME="MathML">MathML Requirements</A>
</H3>
<P>
MathML has many requirements. One of these is a system that can cope with
several small chunks of XML in one document, since a document may have many
small equations. It has extreme formatting requirements, only some of which
are shared by other objects. There was some discussion of MathML needs in
terms of the DOM and formatting properties. The MathML has to be able to
be passed as a chunk to an external renderer, and the XML has to be able
to be formatted in a reasonable way. The MathML does not include HTML elements
within it. That was discussed within the MathML WG, but rejected. The requirement
that the content of MathML should not show up in down-level browsers was
not as strong for MathML as for RDF, although some of the participants thought
it would be best.
<H3>
  <A NAME="Types">Types of HTML</A>
</H3>
<P>
The participants came to the conclusion that there was definite agreement
on doing an XML block, where the contents of the block are well-formed XML,
without any HTML semantics. There was much discussion about whether there
was a reasonable method to include significant non-standard non-empty elements
could be found, and whether there was a possibility of defining some sort
of "good" HTML that people would use. Reasons for not allowing HTML semantics
in the XML block, even on elements with the same element types as exist in
HTML, included
<OL>
  <LI>
    Browsers would need to expose rendering model to other processors too soon.
  <LI>
    Different error-handling mechanisms
  <LI>
    All XML processors would need to process HTML, and users might expect that
    processing to match current HTML browsers
</OL>
<P>
There was also some support for doing an XML version of HTML, where all the
XML rules would apply.
<P>
The discussion about whether it was possible to require that the contents
of any non-standard elements be well-formed XML mostly came to the conclusion
that it wasn't; or that it would be extremely expensive for those users simply
wanting to add, e.g., a CHAPTER element to their pages. There was support
for the notion that there is a difference between adding XML to pages (where
the contents of the XML would be well-formed XML) and adding unknown elements
in a standard way to HTML (where the contents of the unknown element would
not follow XML well-formed rules.) Whether the HTML in an unknown HTML element
needed to be "good" HTML wasn't fully clarified at the meeting.
<P>
Another problem is that old browsers render PIs.
<H3>
  <A NAME="General">General Requirements</A>
</H3>
<P>
During the discussion the following requirements were generally agreed upon.
<UL>
  <LI>
    A method in HTML to declare that a tag begins a block of XML
  <LI>
    A method in HTML to declare that an unknown tag is significant (versus the
    default "ignored" case), and whether the tag is empty or not.
</UL>
<P>
Agreement on terminology: XML blocks, significant non-standard HTML elements
(sometimes also called sprinkles), and crud (or real-world HTML). But how
do we distinguish between XML blocks and significant elements? An XML block
contains XML -- not HTML. A significant element contains HTML -- not XML
(unless it's empty, of course; we have to be able to distinguish between
empty and non-empty).
<H3>
  <A NAME="sol">Possible Solutions</A>
</H3>
<P>
The question of how to "sprinkle" non-standard elements in an HTML document
while retaining HTML semantics of all elements with HTML element and attribute
types devoured most of the meeting. We did not come to a final conclusion
on this subject. One proposed solution was to use new elements called CONTAINER
and LEAF, with the CLASS attribute used to show the type. The drawback is
that users can't define non-standard attributes. There was also much discussion
as to whether users would accept this sort of solution, or whether they would
want to invent their own element types. It was felt that this solution would
allow users to keep on using "real" HTML (a.k.a crud) inside the wrapper
elements.
<P>
Another proposal was to allow users to define their own wrapper elements.
If all elements within the block have end tags, even if they are EMPTY elements,
then this could be the way to extensible HTML (not XML). There were several
points against this, including the large number of non-standard EMPTY elements
that already exist. Many participants thought that defining browser behaviour
for this would be almost impossible, and that migrating HTML users to XML
with the HTML tagset was a better solution.
<P>
How to clean up HTML came up again and again in the discussions. The participants
agreed that it is impossible in the general case to create valid HTML from
an arbitrary page on the Web without human intervention. Users will not want
to risk breaking documents which function. Current HTML has three components:
the element type names, default rendering, and semantics (e.g. forms).
<P>
There was a strong contingent that said users should wait for XML tools to
become generally available and use those, rather than trying to add XML to
HTML.
<P>
The MathML group would like a mechanism to tell browsers a plain-text string
to render, if the equation can't be rendered. This sort of mechanism would
potentially be useful for other XML content with high rendering requirements
as well.
<P>
The biggest reason to come up with a standard method for adding XML (or unknown
HTML) to HTML is to allow poeple to use styles and the DOM with these elements.
Currently they can't. Browsers do not apply CSS styles to unknown elements,
and unknown container elements are not exposed as containers in the MSIE
object model. (The DOM WG decided not to tackle the problem, and only talks
about valid HTML 4.0 documents, and XML as a separate entity.)
<P>
A potential solution was to write HTML as XML, i.e. with MIME-type text/xml.
Then all the XML rules would apply. One problem with this is that some browsers
sniff the document irrespective of MIME-type and display the content if it
looks like HTML according to some heuristic<A HREF="#InetSDK">[InetSDK]</A>,
<A HREF="http://www.microsoft.com/msdn/sdk/inetsdk/help/itt/monikers/appendix_a.htm">Appendix
A</A>. This may include, for example, having a TITLE element anywhere within
the first 200 bytes of the document. Thus document providers may have to
add a comment long enough to get rid of the heuristics.
<H2>
  <A NAME="Conclusions">Conclusions and Future Work</A>
</H2>
<H3>
  <A NAME="link">Including XML by Reference</A>
</H3>
<P>
The first option for using XML in HTML documents is to include it by reference,
using <TT>&lt;LINK&gt;</TT>, <TT>&lt;A&gt;</TT>,
<TT>&lt;OBJECT&gt;</TT> or perhaps even <TT>&lt;IMG&gt;</TT>. This markup
conforms to existing W3C Recommendations. This gives predictable behaviour
across the whole spectrum of HTML user agents, at the cost of managing and
accessing the compound document.
<H3>
  <A NAME="attrs">Using Attributes to Hide New Idioms</A>
</H3>
<P>
Another option with predictable behaviour is to use tags and attributes only,
and avoid character data which will be displayed by deployed software. Strictly
speaking, documents enhanced this way do not conform to the HTML 2, 3.2,
or 4.0 specification, but each of those specifications included a note to
implementors to ignore unknown attributes.
<P>
The XML namespace facility<A HREF="#XML-Names">[XML-Names]</A> should be
used to manage the risk of name collisions for new attributes and elements.
Note that unfortunately, much of the deployed base of user agents will display
XML namespace declarations as text.
<H3>
  <A NAME="xml-block">Using <TT>&lt;XML&gt;</TT>, an HTML Enhancement</A>
</H3>
<P>
The linking and attributes mechanisms do not satisfy all of the requirements
presented at the meeting. It was agreed that an enhancement to HTML to accomodate
XML blocks is necessary.
<P>
The definition of an XML block is a chunk of well-formed XML that is inside
an HTML document. Any elements within the chunk that happen to have the same
element types as HTML elements are <EM>not</EM> considered to be HTML elements.
The error-handling as defined in the XML specification applies, i.e. the
processor <EM>must</EM> halt on well-formedness errors.
<P>
There were two proposals for this. (Other proposals that were discussed were
discovered to be variations of these).
<OL>
  <LI>
    using namespaces, which means the presence of a colon in an element type
    implies that the contents are well-formed XML
  <LI>
    using a specific element type (the discussion centered around XML and XML-BLOCK
    and eventually we settled for XML)
</OL>
<P>
Using a specific element type has the advantage that the meaning is clear,
and that attribute can be added to the element for such things as MIME-type
and a link to an external file containing the XML content.
<P>
For the XML block case, the group decided on a vote of 10 for and 1 abstension
(none against) to use an element called XML. This must be added to a future
version of HTML. The attributes are TYPE for the MIME-type and SRC for the
URL of the content if it is in an external file. The contents of the XML
element are XML. There is an xml PI at the beginning of the XML block that
contains all other information that the XML block needs.
<H3>
  <A NAME="script-hack">Using Script to Hide Content in Older Browsers</A>
</H3>
<P>
Interoperability with the 3.0 generation of browsers is required for successful
deployment of RDF, among other applications. This means that the XML block
is not a complete solution either.
<P>
There are a number of ways in which content can be made to not show up in
browsers that don't understand the element.
<OL>
  <LI>
    the XML could be in a separate file, linked to from the HTML document in
    some way.
  <LI>
    the XML could be in the HEAD of the HTML document
  <LI>
    the DTD for the XML fragment could be written in such a way that all content
    appears as attribute values
  <LI>
    the XML content could be put at the end of the document, which doesn't really
    hide it, but this method does get the content out of the way of the main
    document content.
</OL>
<P>
Of these, putting the content in the HEAD is the most problematic because
of the difficulties for HTML browsers of defining where the HEAD ends.
<P>
Any of these methods would be considered to not break HTML or XML, and the
participants decided that these should be written up (with the exception
of putting content in the HEAD) as the recommended methods for coping with
XML where the content should not show up in older browsers.
<P>
There are, of course, times when none of these methods are suitable for some
reason. The group therefore decided to also figure out which of the many
unliked methods was the least undesirable. The choices were
<UL>
  <LI>
    putting the XML content inside a comment
  <LI>
    putting the XML content inside a SCRIPT element with the value of the LANGUAGE
    attribute being "XML"
  <LI>
    putting the XML content inside an APPLET element
</UL>
<P>
The proposal to put the XML content inside an OBJECT element was quickly
rejected, as it would not work in Netscape Navigator 3.0.
<P>
The problem with APPLET is that if the user has applet loading turned off,
the content will show. The problem with SCRIPT is that it breaks the currently
defined content model of SCRIPT. There were also worries about whether future
XML users will use the SCRIPT element themselves, which would not be possible
if it were a reserved element. This concern wasn't shared by the entire group.
The problem with using comments is that comments are meant to not contain
parsed data, and users couldn't put another comment inside the XML content.
<P>
The vote (1 per company) was 1 for comments, 1 for APPLET, and 8 for SCRIPT.
<P>
Details of the XML block and SCRIPT mechanisms are the subject of a Working
Draft in progress.
<H3>
  <A NAME="sprinkles">Using Namespaces, Stylsheets, and the DOM</A>
</H3>
<P>
The discussion of using XML markup in HTML documents such that it would be
"significant" to stylesheet and DOM implementations did not reach a clear
consensus.
<P>
We observed that XML can be modelled using the HTML 4.0 DIV, SPAN, and CLASS
markup, which are significant to stylesheet and DOM implementations. Some
experience with this style suggested the community would not embrace it,
but the discussion was not conclusive.
<P>
A proposal for a "sprinkles" mechanism is the subject of a Working Draft
in progress.
<H2>
  <A NAME="References">References</A>
</H2>
<DL class="bib">
  <DT>
    <A NAME="rdf-syntax">[RDF]</A>
  <DD>
    <A HREF="http://www.w3.org/RDF/Group/1998/02/WD-rdf-syntax-19980216"><CITE>Resource
    Description Framework (RDF) Model and Syntax</CITE></A><BR>
    W3C Working Draft 16 Feb 1998<BR>
    Ora Lassila, Ralph R. Swick, eds.
  <DT>
    <A NAME="XML">[XML]</A>
  <DD>
    <A HREF="http://www.w3.org/TR/1998/REC-xml-19980210"><CITE>Extensible Markup
    Language (XML) 1.0</CITE></A><BR>
    W3C Recommendation 10-February-1998<BR>
    Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, eds.
  <DT>
    <A NAME="HTML4">[HTML4]</A>
  <DD>
    <CITE><A HREF="http://www.w3.org/TR/REC-html40-971218">HTML 4.0
    Specification</A></CITE><BR>
    W3C Recommendation 18-Dec-1997<BR>
    Dave Raggett, Arnaud Le Hors, Ian Jacobs, eds.
  <DT>
    <A NAME="mathml">[MathML]</A>
  <DD>
    <A HREF="http://www.w3.org/TR/1998/REC-MathML-19980407"><CITE>Mathematical
    Markup Language (MathML) 1.0 Specification</CITE></A><BR>
    W3C Recommendation 07-April-1998<BR>
    Patrick Ion , Robert Miner
  <DT>
    <A NAME="Dialects">[Dialects]</A>
  <DD>
    <CITE><A HREF="http://www.w3.org/pub/WWW/TR/WD-doctypes-960302">HTML Dialects:
    Internet Media and SGML Document Types</A></CITE><BR>
    W3C Working Draft 06-Mar-96<BR>
    Daniel W. Connolly
  <DT>
    <A NAME="InetSDK">[InetSDK]</A>
  <DD>
    <A HREF="http://www.microsoft.com/msdn/sdk/inetsdk/help/"><CITE>Internet
    Client SDK</CITE></A>, December 19, 1997, Microsoft Corporation
  <DT>
    <A NAME="XML-Names">[XML-Names]</A>
  <DD>
    <CITE><A HREF="http://www.w3.org/TR/1998/WD-xml-names-19980327">Namespaces
    in XML</A></CITE>, W3C Working Draft 27-March-1998<BR>
    Tim Bray, Dave Hollander, Andrew Layman
</DL>
</BODY></HTML>