hacking.html 8.44 KB

Raw Blame History Permalink

<!doctype html public "-//W3C//DTD HTML 1997-05-18//EN"
"html.dtd">
<HTML>
<HEAD>
  <TITLE>XML Hacking is Fun!</TITLE>
</HEAD>
<BODY>
<P>
<A href="../../"><IMG src="../../Icons/WWW/w3c_home" ALT="W3C"></A> |
<A HREF="../../Architecture/">Architecture</A> |
<A HREF="../../MarkUp/SGML/">XML</A>
<H1>
  XML Hacking is Fun!
</H1>
<ADDRESS>
  <A HREF="../../People/Connolly/">Dan Connolly</A><BR>
  Created: Mon May 12 16:06:27 CDT 1997<BR>
  $Id: hacking.html,v 1.5 1998/04/29 03:20:20 connolly Exp $
</ADDRESS>
<P>
For me, XML puts the fun back into web hacking. I wrote three XML parsers
last weekend. Great stress relief!
<P>
See also: <A href="../notes.html">some more notes on XML implementation
experience</A>, mostly by Bert Bos.
  <HR>
<DL>
  <DT>
    <A href="xml.py">xml.py</A>
  <DD>
    <A href="http://www.python.org">python</A> module for XML.</></>
  <DT>
    <A href="xml-check.pl">xml-check.pl</A>
  <DD>
    quick and dirty XML well-formedness checker in perl. Got bored with this
    and moved on to python after a bit.</></>
</DL>
<H2>
  Converting XML to Lout
</H2>
<DL>
  <DT>
    <A href="loutwr.py">loutwr</A>
  <DD>
    lexical details of writing lout format
  <DT>
    <A href="xml2lout.py">xml2lout</A>
  <DD>
    rules/stack-based conversion to lout
  <DT>
    <A href="html2lout.py">html2lout</A>
  <DD>
    add some rules for HTML
  <DT>
    <A href="report2lout.py">report2lout</A>
  <DD>
    add some rules for a latex/lout-like
    <A href="../../MarkUp/9705/report.dtd">report DTD</A> on top of html
</DL>
<H2>
  XML Typing notes
</H2>
<P>
XML document types should evolve gracefully. Technically, format negotiation
is a solution to deployment of revised data formats, but it did not meet
the market constraints (i.e. it wasn't cost-effective for the involved parties)
in the case of HTML forms, tables and foriegn payload (scripts and stylesheets).
<P>
I'm investigating ways to express the MIME multipart alternative concept
at the element level in XML. This allows new features in XML documents to
be deployed like color over the b/w TV signal. It allows the new and the
old semantics to be expressed in the same file, which cuts down the cost
of managing the data (copy, rename, verify, datestamp, inodes, ...) and caching
it.
<P>
My intuition says that we can borrow the inheritance and subtyping ideas
from <A href="../../OOP/">OOP</A> to model a form of type negotiation for
XML.
<P>
<DL>
  <DT>
    Akpotsui, Extase K. A; Quint, Vincent; Roisin, C&eacute;cile.
    <A HREF="ftp://ftp.inrialpes.fr/pub/opera/publications/MCM97.ps.gz"><CITE>Type
    Modelling for Document Transformation in Structured Editing
    Systems</CITE></A>. Mathematical and Computer Modelling 25/4 (February 1997)
    1-19 (with 26 references). Authors' affiliation: INRIA/Project Op&eacute;ra.
  <DD>
    Abstract:
    <BLOCKQUOTE>
      This paper addresses the problem of type transformation in structured editing
      systems and proposes a type description model convenient for type comparison
      and document conversation. Two kinds of transformations are considered: dynamic
      transformations allow a structured editor to change the structure of a part
      of a document when the part is copied of moved, and static transformations
      allow specific tools to restructure documents when their generic structure
      is modified. We present in this paper the current state of our research on
      formal analysis for these transformations.
    </BLOCKQUOTE>
</DL>
<P>
Cut/paste issues. Shows that DTD's are not just regexps: &amp; ? are novel.
<P>
Also shows that separating element names from element types is essential
for some kinds of modelling. I suspect DTD's should be extended to allow
this (well... replaces with something that expresses this.) For example,
allow XPTR style selectors rather than just namegroups in element declarations:
<PRE>
&lt;!element (parent1 child) ANY&gt;
&lt;!element (parent2 child) (x|y|z)&gt;
</PRE>
<P>
@@don't use class, just make up new elements and use containment!
<H2>
  XML Modules
</H2>
<P>
About namespaces in DTDs... how about:
<PRE>
&lt;![ module-name [
&lt;!entity module-name "IGNORE"&gt;
... module contents ...
]]&gt;
</PRE>
<P>
which is just like:
<PRE>
#ifdef _module_h
#define _module_h
... module contents ...
#endif /* _module_h */
</PRE>
<P>
I made a <A href="fix-sgml.el">patch to psgml mode</A> to allow me to use
this syntax.
<P>
You still have to have a partial order on your modules. And it's still just
one big namespace. So it's just like C -- which is good enough for lots of
things, but not for truly independent development.
<H2>
  Marked Sections, and Here Documents, and Archives
</H2>
<P>
Is an unescaped &gt; allowed in XML content? (9711 spec says yes.)
<P>
HTML 2.0 spec discouraged it in order to avoid ]]&gt; showing up in documents,
which is an error in SGML'86.
<P>
XML of 9711 has the same misfeature, but it's marked "for compatibility".
<P>
Marked sections can't contain ]]&gt;
<P>
What's the purpose of a marked section, anyway? If it's just to be able to
put XML inside XML without lots of tedious escaping, then the above limitation
isn't a showstopper.
<P>
But it seems to me that the purpose is to be able to include foriegn data
like SCRIPT and STYLE, in which case this limitation is really painful.
<P>
Based on shell/perl HERE documents and MIME multipart syntax, I suggest the
following:
<PRE>
&lt;![myStringHere[ ... ]myStringHere]&gt;
</PRE>
<P>
which allows ... to contain ANY sequence of characters. Any sequence of bytes,
actually! This solves the script/style problem, plus gives XML the potential
to replace tar, zip, etc. in the same way that HERE documents facilitate
shar archives. (But Just Say No to turning-complete archive formats.)
<H2>
  Empty end Tags
</H2>
<P>
I'm implemented support for:
<PRE>
&lt;foo&gt; ... &lt;/&gt;
</PRE>
<P>
The implementation cost is trivial. The deployment cost is the risk that
folks will expect legacy HTML elements to work this way:
<PRE>
&lt;blockquote&gt; ... &lt;/&gt;
</PRE>
<H2>
  attribute value syntax
</H2>
<P>
???
<H2>
  Character Entities
</H2>
<P>
Bad idea. general entites are very powerful, and all we need is a way to
escape three characters (maybe two).
<P>
Other characters should be done with "replaced elements" with fallback inside,
e.g.:
<PRE>
&lt;emdash&gt;---&lt;/&gt;
</PRE>
<P>
Going to Unicode is probably cost-effective in the long term, but the documents
don't degrade gracefully.
<H2>
  Convenience Entities: macros and includes
</H2>
<P>
These are obviated by linking. The idiom:
<PRE>
&lt;!doctype html public "-//IETF//DTD HTML//EN" [
&lt;!entity product-name "Gee Whiz&amp;tm;"&gt;
&lt;!entity legal system "legal.html"&gt;
]&gt;
... &amp;product-name;
...
&amp;legal.html;
</PRE>
<P>
can be done ala:
<PRE>
&lt;!doctype html system "http://www.w3.org/9705/html.dtd"&gt;
&lt;div style="display: none"&gt;
&lt;span id=product-name&gt;Gee Whiz&amp;tm;&lt;/span&gt;
&lt;/div&gt;

... &lt;a href="#product-name" xml-link=replace&gt;Gee Whiz&amp;tm;&lt;/&gt;
&lt;a href="legal.html" xml-link=replace&gt;Copyright (c) 1997 by US&lt;/a&gt;
</PRE>
<P>
The a's could be left empty. But for the benefit of downlevel clients, you
can (by machine) propagate the destination of the link (or a part of it)
to the souce. clients,
<H2>
  Parameter Entities
</H2>
<DL>
  <DT>
    .cm
  <DD>
    content model. Fully parenthesized. Can be used anywhere a gi can be used.
  <DT>
    .orList
  <DD>
    union expression. orLists can be concatendated. @#hmmm.. namegroup?
  <DT>
    .valType
  <DD>
    attribute value type, e.g. CDATA with overloaded semantics
  <DT>
    .tagType
  <DD>
    list of attribute declarations, ala a list of methods, i.e. an object type
  <DT>
    .dtd
  <DD>
    link to another entity in DTD syntax
</DL>
<H2>
  DT and DD
</H2>
<P>
I want DT/DD to be able to format ala:
<PRE>
    term    definition
      definition def d
      efiintion
</PRE>
<P>
so I changed the content models of dt and dd so that dd is contained within
dt.
<H2>
  Testing Notes
</H2>
<P>
@@link to MIX.
<PRE>
ok3: uses internal declaration subset. Boo.
	note that this is a perfect example of how
	entities are redundant with respect to linking

ok3a: @@ WF client should check for data outside root element

torture:
whacked internal declaration out
removed references to other entities

#@@ is an unescaped &gt; allowed in xml? what about ]]&gt;?
 is ]]&gt; a reportable error? well-formedness error? validity error?

This doesn't match:
&lt;p&gt;PI with markup: &lt;?Myparser &amp;lt;p&gt; or &lt;p&gt; --

which?&gt;&lt;/p&gt;
</PRE>
</BODY></HTML>