<title>Processing XML 1.1 documents with XML Schema 1.0 processors</title>
Processing XML 1.1 documents with XML Schema 1.0 processors
W3C Working Group Note 11 May 2005
This version:
   <a href=""></a>
Latest version:
<dd><a href=""></a></dd>
Henry S. Thompson, University of Edinburgh/W3C
<p>This document is also available in these non-normative formats: <a href=""></a>.</p>
<p class="copyright"><a href="">Copyright</a>&#xa0;&#xa9;&#xa0;2005&#xa0;<a href=""><acronym title="World Wide Web Consortium">W3C</acronym></a><sup>&#xae;</sup> (<a href=""><acronym title="Massachusetts Institute of Technology">MIT</acronym></a>, <a href=""><acronym title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>, <a href="">Keio</a>), All Rights Reserved. W3C <a href="">liability</a>, <a href="">trademark</a>, and <a href="">document use</a> rules apply.</p>
<h2><a id="abstract" name="abstract"/>Abstract</h2><p>XML Schema 1.0 did not anticipate new versions of XML, and mandated
  XML 1.0 documents as the starting point for schema-validity
  assessment.  Some users and specifications would like to use XML
  Schema processors which process XML 1.1 documents, and some
  implementors of XML Schema processors would like to provide XML 1.1
  support.</p><p>This Note suggests an implementation strategy for implementors to
  adopt to enable users and specifications to get such support in a
  consistent way. All aspects of XML Schema which are liable to
  re-interpretation as a result of changes in XML 1.1 are discussed.</p><p>An implementation of schema-validity assessment employing such a
  strategy is strictly speaking non-conformant to the current version
  of the XML Schema specification. The XML Schema WG none-the-less
  believes that interoperability will best be served by such
  non-conformant processors being made available to users, until such
  time as a subsequent version of XML Schema addressing this issue
  normatively is approved.</p></div><div>
<h2><a id="intro" name="intro"/>1 Introduction</h2><p>As published the XML Schema specification references XML 1.0<span>and XML Namespaces 1.0</span> explicitly,
and incorporates by reference certain key definitions, in particular those of
the <code>Char</code>, <code>Name</code><span>, QName</span> and <code>S</code> character classes.
The contents of these classes has changed in XML 1.1<span>and XML Namespaces 1.1</span>, so although nothing in
the existing XML Schema specification specifically bars the processing of
infosets produced by XML 1.1 conformant parsers, such infosets, if they exploit
any of the relevant changes in XML 1.1, will not be accepted as valid by
conformant XML Schema 1.0 processors.</p><p>The XML Schema WG has judged that any changes to the existing
specification to support XML 1.1 go beyond what could be considered as errata,
and so will have to wait for a new version of the specification.  As this may
take some time, this Note addresses the question of what should be done in the
interim to best serve the XML community.</p><p>In the sections which follow, a non-normative strategy is set out
suggesting a number of changes which processors implementing the XML Schema
specification can make to enable sensible and interoperable support for XML
1.1.  Any implementation of XML Schema employing such a strategy is strictly
speaking non-conformant to the current version of the XML Schema specification.
The XML Schema WG none-the-less believes that interoperability will best be
served by the availability of such non-conformant processors until such time as a subsequent
version of XML Schema addressing this issue normatively is approved. </p></div><div class="div1">
<h2><a id="d0e138" name="d0e138"/>2 Survey of XML 1.1 challenges for XML Schema 1.0</h2><p>Consider the following four cases:</p><ol class="enumar"><li><p>C1 vs. C0 in content, e.g. #x83 vs. #x03</p></li><li><p>Old vs. new name chars in element names, e.g. <code>y</code> (25th letter in English alphabet) vs.
<code>&#x133;</code> (25th letter in Dutch alphabet)</p></li><li><p>Old vs. new name chars in ID-typed content, e.g. <code>y</code> vs. <code>&#x133;</code></p></li><li><p>LF vs NEL in length-specified list-typed content</p></li></ol><p>(&#x133; == U+0133 (#x133) is common in Dutch, e.g. in the word
<em>&#x133;s</em> == English <em>ice-cream</em>.  It's a good example of something arbitrarily and
irritatingly not allowed as a name character in XML 1.0 which is
allowed as a name character in 1.1).</p><p>In each of the above cases, the first alternative is OK and has the same
behaviour with respect to Schema validation in both XML 1.0 and XML 1.1,
whereas the second alternative either
is not Schema-valid under the strict XML 1.0 interpretation (1-3) or might be
expected to have different behaviour between XML 1.0 and
XML 1.1 (4).</p><p>In other words, if you used a conformant XML Schema validator on the
following four instances (Figure 1), using the same schema document (Figure
2) each time, all four
would have validity problems.</p><div class="exampleOuter"><div class="exampleInner"><pre>&lt;?xml version='1.0'?&gt;
&lt;root&gt;There's an &amp;amp;#3; here: &amp;#3;&lt;/root&gt;</pre></div><div class="exampleInner"><pre>&lt;?xml version='1.0'?&gt;
&lt;&#x133;s/&gt;</pre></div><div class="exampleInner"><pre>&lt;?xml version='1.0'?&gt;
&lt;root id=&quot;&#x133;&quot;/&gt;</pre></div><div class="exampleInner"><pre>&lt;?xml version='1.0'?&gt;
&lt;!-- There's a NEL character (U+0085) between the 'a' and the 'b' below --&gt;
&lt;root list=&quot;a&#x85;b&quot;/&gt;</pre></div></div><div class="note"><p class="prefix"><b>Note:</b></p><div class="exampleInner"><pre>&lt;?xml version='1.0'?&gt;
&lt;xs:schema xmlns:xs=&quot;;&gt;
 &lt;xs:element name=&quot;root&quot;&gt;
   &lt;xs:documentation&gt;String content, id attr of type ID,
                     list attr of type [list of token], length 2

    &lt;xs:extension base=&quot;xs:string&quot;&gt;

     &lt;xs:attribute name=&quot;id&quot; type=&quot;xs:ID&quot;/&gt;
     &lt;xs:attribute name=&quot;list&quot;&gt;
         &lt;xs:list itemType=&quot;xs:token&quot;/&gt;
        &lt;xs:length value=&quot;2&quot;/&gt;

 &lt;xs:element name=&quot;&#x133;s&quot;/&gt;
&lt;/xs:schema&gt;</pre></div><p>Schema for use with XML documents in Figure 1</p></div></div><div class="div1">
<h2><a id="d0e193" name="d0e193"/>3 First step towards XML 1.1: the parser</h2><p>The first obvious step for anyone considering modifying an existing XML
Schema processor of any kind to allow XML 1.1 documents is replacing its front
end, presumably currently an XML 1.0 parser, i.e. a parser which converts
<em>only</em> documents with a <code>version='1.0'</code> XML declaration
(or none), and enforces XML 1.0 well-formedness, with an XML 1.1 parser, i.e.
one which enforces <em>either</em> XML 1.0 <em>or</em> XML 1.1
well-formedness, depending on the <code>version</code> stated in the XML declaration.</p><p>The resulting behaviour will be as follows:</p><table border="1"><colgroup span="1"><col span="1"/><col span="1" align="center"/><col span="1" align="center"/></colgroup><thead><tr><td/><td>XML 1.0 Declaration</td><td>XML 1.1 Declaration</td></tr></thead><tbody><tr><td>XML 1.0 Content</td><td>
      </td></tr><tr><td>XML 1.1 Content</td><td>
      </td></tr></tbody></table><p>Note that by &quot;XML 1.0 Content&quot; is meant documents exemplifying the <em>first</em> member of each of the
four pairs of differences introduced above, and by &quot;XML 1.1 Content&quot; is meant
documents exemplifying the <em>second</em> member thereof.  The top two
cells then require no explanation -- these are just the existing XML Schema
processor, using an XML 1.1 parser front end, behaving correctly on data it
already should be processing correctly.</p><p>The bottom two cells are the interesting ones.  The bottom-left cell is
characterised by what I'll call <em>misaligned</em> XML versions.  Let's
consider the outcomes here one at a time.  Note that these cases cover not
only what our putative XML Schema 1.0 processor with an XML 1.1 parser would
do, but also what an unmodified 1.0/1.0 processor should do today.</p><dl><dt class="label">A, B (<em>misaligned</em> versions): X1</dt><dd><p>These cases are (correctly) rejected as ill-formed by the front-end XML parser,
because they break the 1.0 rules for CDATA content (A) and element names (B).</p></dd><dt class="label">C (<em>misaligned</em> versions): X2</dt><dd><p>This case is (correctly) rejected as schema-invalid by the XML Schema processor -- a string with an
&#x133; in it is not an NCName per XML 1.0.</p></dd><dt class="label">D (<em>misaligned</em> versions): X3</dt><dd><p>This case is (correctly) rejected as schema-invalid by the XML Schema
processor -- a 'list' with only NEL
separators is a single token when considered as XML 1.0 content.</p></dd></dl><p>Moving on to the final, lower-right, cell, this is of course where things
get interesting:</p><dl><dt class="label">A (<em>aligned</em> versions): OK/**</dt><dd><p>The behaviour of this case depends on an implementation choice. Some
  processors, which take their input only in the form of encoded
  character streams and always use an XML parser as a front end,
  depend on that front end to enforce the basic constraint that all
  <code>xs:string</code>s consist of XML 1.0 Chars. Other XML Schema processors,
  particularly those which also accept synthetic infosets as input,
  enforce that constraint explicitly. It follows that a processor of
  the first kind, simply by changing to use an XML 1.1 front-end, will
  thereby accept case A documents, but processors of the second kind
  will not, because they will still be explicitly checking instances
  of <code>xs:string</code> using its XML Schema 1.0 definition.&quot;</p></dd><dt class="label">D (<em>aligned</em> versions): OK</dt><dd><p>This case is (correctly) accepted -- a 'list' with a NEL
separator will have been normalized to have a space (#x20) separator by
the XML 1.1 front-end parser, and so the XML Schema processor will find two tokens.</p></dd><dt class="label">C (<em>aligned</em> versions): **</dt><dd><p>This case is (incorrectly) rejected as schema-invalid by the XML
Schema processor -- because the <code>ID</code> type is derived from the
<code>Name</code> type, which in turn has a <code>pattern</code> facet based on
the XML 1.0 definition for Names, which does not allow the &#x133;.</p></dd><dt class="label">B (<em>aligned</em> versions): **</dt><dd><p>This case is actually very similar to the previous one, but with
respect to a different document, that is, the <em>schema</em> document. 
<em>That</em> document is (incorrectly) rejected as schema-invalid by the XML
Schema processor -- because the relevant element name turns up as the value of
the <code>name</code> attribute on the <code>xs:element</code> element, and
that <em>attributes</em> type in the schema for schema documents is 
<code>NCName</code>, which is derived from the
<code>Name</code> type, which in turn has a <code>pattern</code> facet based on
the XML 1.0 definition for Names, which does not allow the &#x133;.</p></dd></dl></div><div class="div1">
<h2><a id="d0e478" name="d0e478"/>4 <span>Recommended strategy</span>: Move to 1.1-compatible type definitions</h2><p>What does it mean to say the last two results are <em>incorrect</em>?
It means that type definitions which enforce XML-1.0-appropriate constraints 
are being applied to self-identified XML 1.1 data.</p><p>The simplest resolution is to simply change the XML Schema processor
itself so that the
relevant built-in type definitions enforce the XML 1.1 contraints.  This
will make all the entries in the lower-right quadrant 'OK'.</p></div><div class="div1">
<h2><a id="d0e491" name="d0e491"/>5 The details</h2><p>The XML Schema 1.0 type definitions which include either direct dependencies
on XML 1.0 productions (that is, xsd:Name, which depends on XML 1.0
Name, xsd:NMTOKEN, which depends on XML Nmtoken, xsd:QName, which depends on XML 1.0 Letter, Digit, CombiningChar and Extender via XML Namespaces QName and xsd:string, which depends on XML 1.0 Char), as well as those type definitions which inherit from them (that is, xsd:NCName, xsd:ID, xsd:IDREF, xsd:IDREFS, xsd:ENTITY, xsd:ENTITIES, xsd:NMTOKENS, xsd:normalizedString, xsd:token and xsd:language), must use the
XML 1.1 productions.</p><p>This change will fix the <code>B</code> and <code>C</code> results by using the XML 1.1
definition of Name.  For processors which don't depend on their XML front-end
parser to check CDATA, it will also fix the incorrect result they get for the
<code>A</code> example by using the XML 1.1 definition of Char.</p></div><div class="div1">
<h2><a id="d0e507" name="d0e507"/>6 Backward incompatibilities</h2><p>The approach selected here isn't perfect.  The unconditional switch to
1.1-appropriate type definitions means that version 1.0 XML documents with
1.1-only Name characters in e.g. ID-typed attributes will be valid, where an
unmodified Schema 1.0 processor would find them invalid.</p><p>The immediate negative consequences of this are presumably small, since
anyone already schema-validating their XML 1.0 documents will presumably have
<em>corrected</em> any examples of this.  But as and when processors
implementing this Note are widespread, it may be that documents with such
attribute type definitions and values will be
created, identified as version 1.0 and validated by modified processors, only
to be (correctly) rejected by unmodified processors.  We judge the risk of this
having serious negative consequences are small enough to be discounted, but it
is of course open to implementors to detect this case and issue a warning.</p><p>The other weakness is with respect to cases where no front-end XML
  parser is involved, that is where schema validity assessment is
  carried out on what are sometimes called &quot;synthetic infosets&quot;.</p><p>Since on this proposal enforcement of XML 1.0 conformance for
  element names and character content is the responsibility of the
  front-end parser, it follows that for a synthetic infoset to contain
  for example an element with an XML-1.1-only element name will never
  be a problem solely because of its name, even if it has a document
  information item <strong>[version]</strong> property with value <code>1.0</code>.</p><p>Again we judge the likelihood of this causing a problem to be
  vanishingly small, particularly as any attempt to <em>serialize</em> such a
  synthetic infoset should raise an error.</p></div><div class="div1">
<h2><a id="d0e532" name="d0e532"/>7 Summary of Recommendations for Interoperability</h2><p>To produce an XML-1.1-friendly version of an XML Schema 1.0 processor:</p><ol class="enumar"><li><p><em>Replace</em> <span>its</span> XML 1.0 front-end parser with an XML 1.1
front-end parser;</p></li><li><p><em>Change</em> <span>its</span> implementations of the XML Schema types <code>Name</code>,
<code>NMTOKEN</code>, <code>QName</code> and <code>string</code>, to use the relevant XML (Namespaces) 1.1 productions;</p></li></ol></div></div></body></html>