iri.html 13.4 KB

Raw Blame History Permalink

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html lang="EN">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  <meta content="text/html; charset=ISO-8859-1" http-equiv="Content-Type">
  <title>Mappings and identity in URIs and IRis</title>
  <style type="text/css">
code           { font-family: monospace; }

div.constraint,
div.issue,
div.note,
div.notice     { margin-left: 2em; }

li p           { margin-top: 0.3em;
                 margin-bottom: 0.3em; }

div.exampleInner pre { margin-left: 1em;
                       margin-top: 0em; margin-bottom: 0em}
div.exampleOuter {border: 4px double gray;
                  margin: 0em; padding: 0em}
div.exampleInner { background-color: #d5dee3;
                   border-top-width: 4px;
                   border-top-style: double;
                   border-top-color: #d3d3d3;
                   border-bottom-width: 4px;
                   border-bottom-style: double;
                   border-bottom-color: #d3d3d3;
                   padding: 4px; margin: 0em }
div.exampleWrapper { margin: 4px }
div.exampleHeader { font-weight: bold;
                    margin: 4px}</style>
  <link type="text/css" rel="stylesheet"
  href="http://www.w3.org/StyleSheets/TR/base.css">
</head>

<body>

<div class="head">
<p><a href="http://www.w3.org/"><img width="72" height="48" alt="W3C"
src="http://www.w3.org/Icons/w3c_home"></a></p>

<h1><a id="title" name="title"></a>Mappings and identity in URIs and IRIs</h1>

<p>Preface: This document was originally written in 2003, before the IRI spec
was an RFC. Some of this has since been addressed in the RFC.</p>

<p></p>

<div class="div1">

<div class="div2">
<p>Summary: There is a discrepancy between namespaces and URI specs about
what identifiers are equivalent. The ony reason this has not caused a problem
is that in practice the test cases (two equivalent but not equal unicode
character sequences being used) has not occurred in practice. Using IRIs
maliciously could however deliberately introduce a bug which could cause a
security problem.</p>

<p>Using relationship notation (why not use N3?) to discuss the
inconsistencies between some current thinkings about IRIs, URIs, and for
example namespace names.</p>

<h2><a name="Requiremen">Requirements</a>:</h2>

<p></p>

<p>1. URI identity is shared by all parties.  Within a given context (<a
href="#context">*</a>), there is a single (inverse functional) relationship
between an ASCII string a and a thing x identified by a string s taken as a
uri is uri(x, a).</p>
</div>

<div class="div2">
<p>2. The users of any specification which mention URIs, when one can prove
that the two are equivalent by reading [scheme-independent] specs, then one
can use one in place of another.  That is, when URI (or IRI) strings are
deemed "equivalent" then they must refer to the same object.</p>
</div>

<p>3.  We should be able to use the same software to parse and compare URIs
wherever they are used, eg in namespace names or in hypertext links.</p>
</div>

<h2>What do we get from the  specs?</h2>

<p>Let us formalize the concepts in the documents we are talking about.</p>

<h3>The URI spec</h3>

<p>uri(x,a) =&gt; A(a)</p>

<p>where A(a) means that it is a sequence of ASCII  characters (grounded in
ANSI X3.4-1986).</p>

<p>The ANSI spec gives a 1:1  mapping  ascii(a, s) from the set A of ASCII
character to the set S of septets (integers between 0 and 127 inclusive).</p>

<p>Let sames(s1, s2) be the "strcmp" relation between two strings which are
septet for septet identical.</p>

<p>Consider the equivalence relation ea(a1, a2) which we use here to indicate
that two uris identify the same thing. It (is symmetric and transitive and)
has properties</p>

<p>ea(a1, a2)   &amp;  uri(t1, a1)   =&gt;   uri(t1, a2)</p>

<p>for all a1, a2   (for some t uri(t, a1)  &amp;  uri(t, a2))   &lt;=&gt;
ea(a1, a2)</p>

<p>A(a) &lt;=&gt;  ea(a,a)</p>

<p>Now in fact we are going to deal with the ASCII encoded septets for which
a similar equivalnce holds</p>

<p>es(s1, s2) &lt;=&gt;  Exists a1, s2 such that   ascii(a1, s1) &amp;
ascii(a2, s2) &amp; ea(a1, a2)</p>

<p>The URI spec mentions two uses of hexadecimal encoding. Hex encoding
relates octet strings to septet strings. When the URI spec was written, the
significance of the octets greater than 127 was not defined.</p>

<p>It  implies that if you see %HH in a URI you should consider it as an
encoding of an octet.  There is (a the level of this spec) the notion that
the URI is an encoding of a string of octets.  Those from 0-127 are
considered as representing ASCII characters.  There is no assumption about
what the others represent.  The IRI spec will later take advantage of
this.</p>

<p>hexify(s1, s2)  is true if the difference if any between s1 and s2 is only
that for one or more characters in s1 are replaced in s2 by their %HH  or %hh
encoding, and ascii(s2).</p>

<p>ascii(s)   =&gt; hexify(s, s)</p>

<p>hexify(s, s)</p>

<p>There are another 128 characters in this notional "extended" set, each of
which has a hex encoding.</p>
</div>

<div class="head">
<p>(DanC: hexify(s+ c, s+hexify(c))</p>

<p>hexify('A') = '%65'</p>

<p>corrollary: hexify(s1, s2) =&gt; ascii(s2))</p>

<p>I take hexify to be a subrelation of equality. That is, the URI spec
authorizes one to use s2 where you would have used s2.  In some cases such as
7-bit transport such as HTTP you have to.  It is important that hexification
preseves the identity of the resource.</p>

<p>hexify(t, s1) &amp;  hexify(t, s2) =&gt; es(s1, s2)</p>

<p>{ for some s, hexify(t1, s) &amp;  hexify(t2, s) } &lt;=&gt; et(t1, t2)</p>
</div>

<p>Note that equivalence is preserved by the  interchange of  "%20"  with "
", but not by interchange of  "%2F" with "/".</p>

<p></p>

<p>URI encoding maps octets into URIs</p>

<p>@@ relative</p>

<p>rel(s, b, r)  many-many relation between ascii strings, that r is a
relative URI reference for s relative to b.  Implication of spec is</p>

<p>rel(s1, b, r) . rel(s2, b, r)  =&gt; e(s1, s2)</p>

<p>abs(s)   &lt;=&gt;   forAll b:    rel(s, b, s)</p>

<div class="body">
<h3>THE Unicode Spec</h3>

<p>UTF-8 <a href="#Unicode32">[Unicode 3.2]</a> gives us a relation utf8(i,
s)</p>

<p>Note by the way that</p>

<p>ascii(s) =&gt; utf8(s,s)</p>

<p>utf8(i, s) is true if i is a string of unicode characters, and s is an
extended ASCII string of octets, and the relationship is as specified in the
utf-8 specification.</p>

<p>sameu(i1, i2)</p>

<p>is true whenever the two unicode strings convey exactly the same series of
glyphs and/or control characters. There are strings which are not
identical</p>

<h3>The IRI spec</h3>

<p>This says that (basically, with some work on corner cases etc)  there
should be a convention that any 8-bit string which is not ASCII which can be
interpreted as a UTF-8 encoding should be interpreted as a uitf-8
encoding.</p>
</div>

<p>What does that mean?  I take it to mean that you can encode it and
de-encode it.</p>

<div class="body">
<p>There is a cannonicalization function which the IRI spec uses, defined in
@@, which allows a particular</p>

<p>ucan(i,i)</p>

<p>Axioms are that it is a function:</p>

<p>ucan(i, j1) .  ucan(i, j2)  =&gt;  strcmp(j1, j2)</p>

<p>for all i: can(i,i)</p>

<p>e(s1, s2).</p>

<p>There is a function (not 1:1) which we define as</p>

<p>iri_uri(i, s)  &lt;=&gt;  for some j, t:   ucan(i, j).  utf8(j, t).
hexify(t, s)</p>

<p>IRIs are defined as the domain that function, where the range is URIs.  An
IRI is any unicode string which when canonicalized and utf-8 encoded and
hexified is a URI.</p>

<p>There is a uri equivalent to every iri.  There is NOT an IRI for every
8-bit string t. There is at least one IRI for every URI: itself.</p>

<p>For requirement 2, equivalent IRIs must identify the same</p>

<p>iri_uri(i1, s1).  iri_uri(i1, s2).  sameu(i1, i2)  =&gt;  e(s1, s2)</p>

<p></p>

<h3>The namespace spec</h3>

<p></p>

<div class="div2">
<p>The namespaces specification 2.3 talks about identifiers being different.
Specifically, "http://www.example.org/ros%c3%a9" and
"http://www.example.org/ros%C3%a9" are different.  Let's call these constant
strings D1 and D2 for short.</p>

<p>ne(D1, D2)</p>

<p>Now "difference" is something which allows them for example to occur as
different attributes in an XML element.  It seems to me that this is ne is
the negation of e.  It is the common understanding of differentness such that
two things can't be both different and the same.   To make it otherwise would
be very confusing and would prevent (3).</p>

<p>ne(s1, s2) =&gt; ~e(s1, s2)</p>
</div>

<p>Ouch.  We have one spec saying that these are different, and another
saying that they are the same.</p>

<p>That isn't logically compatible.   The whole layering of the different
forms of equality described in Tim Bray's draft finding is of the form</p>

<p>e_uri(s,t) =&gt; e(s,t)</p>

<p>e_http(s,t) =&gt; e_uri(s,t)</p>
</div>

<p>and so on.  None of the specs until namespaces say "these are
different".</p>

<p>So if you accept the requirements above, and you accept any of the
equivalences we have to throw out thatpart of XML namespaces.</p>

<h2>Choices</h2>

<p>In general there are two ways of operating:</p>

<p>1.  ignore the equivalences like the namespace spec. This causes a bug if
anyone uses two identfiers which are diffrent strings but equivalent.  The
only practical way of doing that is to make any non-canonical IRIs or URIs
illegal.  This means IRIs cannot be used except in their trivial URI form.</p>

<p>2. Transmit in any form, receiver makes right. Receiver must compare
equivalnce-sware or must cannonicalize before intrenal use (whichhas the same
effect).</p>

<p>3. Make IRIs be just unicode strings.  Scratch the axiom that hexifying
leaves a valid and equivalent IRI.  Allow the hexified forms to be used to
identify quite different things, in IRIs.   Allow IRIs to be converted into
URIs, but NOT allow any place where URIs and IRIs can be used interchangebly.
This works toward a DanC-proposed world of unicdoe character string
comparison.  It does not allow a smooth transiition for existing browsers etc
whcih mix URIs and IRIs.</p>

<h2>Reality factors</h2>

<p>There are NOT very many actual uses of  D1 and D2, because there aren't
really any motivations for making them.</p>

<p>-This is why we haven't had a big problem recently.</p>

<p>There ARE motivations for using (non-uri) IRIs.  people are infact using
them though maybe not for namespaces yet.</p>

<p>- This is why endorsing IRIs forces us to fix this.</p>

<p>There ARE lot sof applications which canonicalize URIs in various ways.</p>

<p>Theer IS software which compares namespaces character-for-charcter.</p>

<p>There are NOT many if any uses of different IRIs or different URIs for the
same namespace.</p>

<h2>Conclusion</h2>

<p>We should continue the recommendation <strong>not</strong> to use  URIs or
IRIs which are equivalent but arbitrarily different strings.  The easist way
of ensuing this is to use a cannonical form.  We can therefore deprocate the
transmission or use of non-canonical forms.</p>

<p>We should switch as soon as possible to canonicalizing IRIs in all
applications before comparison (or using equiavlence-aware comparisons).  The
Namespaces spec should change to say when things are the same.  the
constraint in XML to constrain that attributes cannot occur twice should be
made more complicated.   It should say that you can't have two occurrences
which are the same attribute name, or two attrributes which are equivalent in
any  way, leaving I regret some fuzziness. For example, you can't use the
xhtml1.0 and xml1.1 namespaces in the same document to put two src attributes
on an image!  they arenot even the same namespace, but clearly they are
equivalent at the application level.  It should be clear that the fact that
strings are different is not a guarantee that the namespaces are different.
The parser just isn't expected to spot this.  But I think the parser ought to
be allowed to consistently cannonicalize.  That makes life much easier for
the application.  DanC wanted to be able to do strcmp, and he can if the
parser canonicalizes.</p>

<p>We should then in a few years be able to relax the constraint on not
transmitting multiple different forms.</p>

<p>We need a  very good IRI cannonicalization test suite.</p>

<p>We should formalize with names the various functions above, and make sure
there are good working coded implmentations of them in the mjor languages. A
standard API will help.  URI working group stuff.</p>

<p>timbl</p>

<p>2003/04</p>
<hr>

<h2 id="References">References</h2>
<dl>
  <dt>IRI</dt>
    <dd>foobar<cite></cite></dd>
</dl>

<h3>Footnotes:</h3>

<p><a name="context">context</a></p>

<p>The foundational architecture of the web is that there is a global context
common to all publically published documents, in which each URI is agreed by
everyone to identify the same thing.  In practice of course, things break and
people are confused and misled.   Those making formal systems often restrict
the scope of data to that in which this ideal approximation can be taken to
hold in practcie as well as in theory.</p>

<p>The fact that the use of uris varies with time (sad but true) (we are NOT
talking about living documents or concepts whose reopresentations change,
here, but really reuse of the same URI for a totally different concept) means
that to model things over a relatively long time one might want to model the
time varying nature:</p>

<p>u(x, s, t)</p>

<p>This time modelling can be done and has been done in many ways, but is not
addressed here.</p>

<p></p>
<hr>
</body>
</html>