Model.html 19.8 KB

Raw Blame History Permalink

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="generator" content=
    "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
    <title>
      Univeral Resource Identifiers -- Axioms of Web architecture
    </title>
    <link href="di.css" rel="stylesheet" type="text/css" />
    <meta http-equiv="Content-Type" content=
    "text/html; charset=us-ascii" />
  </head>
  <body bgcolor="#DDFFDD" text="#000000" lang="en" xml:lang="en">
    <address>
      Tim Berners-Lee
      <p>
        Date: January 1998
      </p>
      <p>
        Status: personal view. Editing status: Spellchecked.
      </p>
    </address>
    <p>
      <a href="Overview.html">Up to Design Issues</a>
    </p>
    <h3>
      Axioms of Web Architecture: 0
    </h3>
    <ul>
      <li>
        <a href="Model.html#Model">The Web model</a>
      </li>
      <li>
        <a href="Model.html#Resource">Resources</a>
      </li>
      <li>
        <a href="Model.html#Fragement">Fragment IDs</a>
      </li>
      <li>
        <a href="Model.html">Document sets and relative
        addressing</a>
      </li>
      <li>...
      </li>
    </ul>
    <hr />
    <h1>
      <a name="Model" id="Model">The Web Model</a>
    </h1>
    <p>
      The web is a very general concept -- one universal space of
      information. The concepts it requires such as identifiers and
      information resources (documents) are as general and abstract
      as possible. However, there have been some design decisions
      made which define some interfaces, and effectively define
      modules or agents which are independent. These agents are
      independent in many ways
    </p>
    <ul>
      <li>There is knowledge they have individually but do not
      share
      </li>
      <li>There is knowledge their designers had individually but
      did not share
      </li>
    </ul>
    <p>
      This is basic modularity. The interfaces are defined by the
      data formats and protocols, and the important features to
      understand about the design I have ranted about in the linked
      articles in this series. This modularity, ability for
      different parts of the system, shows up when different specs
      are independent, such that you could change one without
      having to change the other.
    </p>
    <h2>
      <a name="Resource" id="Resource">The Information Resource</a>
    </h2>
    <p>
      (Formerly, <a href="#Resource1">Resource</a>)
    </p>
    <p>
      This is the current term for a certain unit of information in
      the Web. In many cases on the current Web, thinking
      "document" will do. It is something which conveys
      information. The Web model is that information in the
      information space is in the abstract chunked into addressable
      things known as resources.
    </p>
    <p>
      In the technical architecture, resources have identifiers,
      Universal Resource Identifiers, and the properties of these
      identifiers are elaborated later. In fact the concept of a
      unit of information is central, not only in the technical
      architecture, but in society's concepts of information, as a
      document is not only the unit for reference, retrieval and
      presentation (typically), but also the unit of ownership,
      license to use, payment, confidentiality, endorsement, etc.
      So though technically we can derive such things as compound
      document, generic documents, and resources which look
      anything but the typical notion of a "document", we have to
      be able to support these social aspects of information at the
      same time, so we can't mess with it too much.
    </p>
    <h2>
      <a name="Fragement" id="Fragement">Fragment Id and "#"</a>
    </h2>
    <p>
      In the hypertext architecture, when making a reference, such
      as a hypertext link, we don't just refer to an information
      resource. Well, we can, but we can also refer to a particular
      part of or view of a resource. The string which, within the
      document, defines the other end of the link has two parts. It
      has the identifier of the document as a whole, and then
      optionally it has a hash sign "#" and a string representing
      the view of the object required. &nbsp;This suffix is called
      a fragment identifier. &nbsp;(Even though it doesn't
      represent necessarily a fragment of the document: it could
      represent how the document should be viewed.). The fragment
      identifier only has relevance in the context of the web page
      in question. This has an implication how the software is
      built. For example, An "access" module can be given just the
      bit of the URI without the fragment identifier. It gets the
      information, and creates a software object for the hypertext
      page. That object is passed the fragment identifier.
    </p>
    <p>
      <img src="ParseHash.png" width="100%" alt=
      "The URI is split off at the hash into a fragement ID and the rest"
      border="0" />
    </p>
    <p>
      In fact, analyzing the system a little more, the access
      function can be broken into the underlying access which
      creates the object by passing two things to some kind of
      object creator ("factory"): a data stream and a MIME type.
    </p>
    <h3>
      Generally
    </h3>
    <p>
      Hypertext is a specific application, but this principle works
      for other applications on the Web. In fact, when we discuss
      <a href="Webize">webizing</a> an application, we take some
      computer language, and we take what were document-global
      things, say global variables in a programming language, and
      make them truly global by appending the URI of the document
      and "#".
    </p>
    <p>
      Clearly, in different applications the fragment identifier
      will have completely different function. The independence
      here means that new applications (such as the Semantic Web)
      can be built, just like hypertext web, just by introducing
      new types of document.
    </p>
    <h2>
      Independence
    </h2>
    <p>
      The model of how the web works is that there are two separate
      functions. &nbsp;The part (blue in the picture) which
      accesses the document deals with its identifier, but does not
      know what view will be required. &nbsp;It creates some
      software object which represents and presents the resource.
      That object does not need to know how it was created
      (necessarily), and so does not need to know the URI it was
      identified by. However, it does know how to interpret the
      Fragment ID.
    </p>
    <p>
      So we have two axioms:
    </p>
    <table border="1" cellpadding="2">
      <tbody>
        <tr>
          <td>
            The access machinery does not need to look at the
            fragment ID.
          </td>
          <td></td>
        </tr>
      </tbody>
    </table>
    <table border="1" cellpadding="2">
      <tbody>
        <tr>
          <td>
            The presentation object does not need to know the URI
            of the resource
          </td>
        </tr>
      </tbody>
    </table>
    <p>
      The equivalent axioms&nbsp;when we are talking about
      specifications amount to:
    </p>
    <table border="1" cellspacing="5" cellpadding="5">
      <tbody>
        <tr>
          <td>
            The specifications for access protocols are independent
            of the specifications for fragment identifiers.
          </td>
        </tr>
      </tbody>
    </table>
    <h3>
      Why?
    </h3>
    <p>
      For one thing, consider the special case of a link within a
      document. &nbsp;In this case, the link <b>only</b> specifies
      a fragment identifier. &nbsp;The object can follow the link
      itself. &nbsp;It doesn't have to consult the access code in
      order to figure out &nbsp;where the link goes to.
      &nbsp;Because the "#" syntax s universal to all access
      methods, the object can process the link internally.
      &nbsp;For a static HTML file, for example, this means that
      you can write and HTMl file with internal links without
      worrying or knowing about exactly what URIs the file will
      get. &nbsp;It means you don't have to alter the file if you
      chose to serve it in some new name or address space. &nbsp;If
      the "#" syntax was not a universal specification for the web,
      this would break: you couldn't do it. As Jim Gettys points
      out, as the era of digitally signed documents comes upon us,
      changing a signed document will break the signature on it. So
      allowing one to make a self-consistent document with internal
      links in a way independent of the namespace is even more
      essential.
    </p>
    <h3>
      Why else?
    </h3>
    <p>
      This independence is very important for the evolution of the
      Web. &nbsp;It means that people can go off and design all
      kinds of new systems for naming, addressing and accessing
      documents, without having to worry about what sort of
      documents will be moved. &nbsp;It means that people can go
      off and make new media types (MIME types), each of which can
      have different concepts for views and fragments, without
      having to talk to the people developing the access
      technology. This has already (1998) proved incredibly
      enabling to the community, as HTTP has advanced in parallel
      with many other ways of accessing data, and the number of
      exciting media types has grown very rapidly, and will be the
      key to many new revolutions built on top of the basic Web
      idea.&nbsp;
    </p>
    <p>
      If you look at the diagram you ill notice how the fragment
      IDs are generated by and understood by just the one module.
      &nbsp;You see how, when designing a new MIME type, one is
      quite free to be creative in making new and powerful forms of
      fragment ID, knowing hat no other specifications will refer
      to them, and nothing else will break.
    </p>
    <h2>
      Document sets and relative addressing
    </h2>
    <p>
      Now let us look at what happens when we follow a link.
      &nbsp;For example, say a hypertext page is clicked on.
      &nbsp;The page has a representation of the end point of the
      link. &nbsp;It hands it to the application. &nbsp;In fact,
      often, there are links between pages whose URIs are very
      similar and only differ in the right hand part. &nbsp;This
      isn't true of all name spaces: for example, when making links
      between news articles identifies by the news id (news:foo)
      unique ID, you have to specify the whole thing. However, if
      you restrict publication of a set of documents to a
      hierarchical name or address space, then you can arrange for
      documents which are very related and have many links to be in
      the same part of the tree.
    </p>
    <p>
      In this case, the links between these documents are "relative
      URIs".
    </p>
    <p>
      What happens then is that the relative URI, which only has
      the locally different part of the URI in it, is handed back
      to what in the diagram I have called the "application", to be
      turned into an absolute URI by being combined with the
      absolute URI of the resource, which the application has
      remembered.
    </p>
    <p>
      Note that the application is aware of the absolute URI but
      still the resource does not have to.
    </p>
    <p>
      Note that the fragment id is still circulated around a loop
      between the object (green) which understands it and the
      applications (yellow) which handles it transparently but does
      not understand or change it.
    </p>
    <p>
      Now there was a design decision that the application could
      have passed to the access module both the relative URI and
      the absolute URI. Then, different namespaces would have been
      able to have different algorithms for resolving a base URI
      and a relative URI into a new absolute URI. But the decision
      was made that the relative address format should be common
      across all name spaces.
    </p>
    <p>
      <img src="Parse2.png" width="100%" alt=
      "The URI is split off at the hash into a fragement ID and the rest"
      border="0" />
    </p>
    <h3>
      Why?
    </h3>
    <p>
      Just as we considered internal links above, now consider
      relative links between a bunch of documents, like the
      sections of a book, which are close in the tree. &nbsp;In
      practice, such document sets are moved from place to place,
      from file systems into HTTP space or FTP space, and because
      the relative address rules are universal, the documents do
      not have to be modified every time they are moved. (Yes, if
      you move half the set to one place and half to another, you
      have to fix links). &nbsp;This is happening all the time.
      &nbsp;People are creating and programs are generating
      hypertext with relative links without knowing or caring what
      absolute URI will be used to refer to the material.
    </p>
    <h2>
      The access scheme
    </h2>
    <p>
      <img src="Parse3.png" width="100%" alt=
      "The URI is split off at the hash into a fragement ID and the rest"
      border="0" />
    </p>
    <p>
      The so-called "access scheme" is the first part of the URI.
      As we have seen above, you don't have to know anything about
      it to parse relative URIs or to process the fragment
      identifier of a URI. The knowledge of particular schemes is
      limited to the "access" function (blue in the above diagram).
    </p>
    <p>
      The scheme is a very important flexibility point, and should
      not be abused. Anyone dereferencing a URI must have a
      knowledge of the scheme it uses.
    </p>
    <p>
      The access scheme defines a huge part of URI space. The
      scheme defines a subspace with particular properties
    </p>
    <p>
      The access scheme is <i>by definition</i> the highest point
      of flexibility. What does that mean? It means that if the
      whole Web develops problems which we cannot solve within the
      existing protocols, or if new spaces are designed which
      really can't be accessed through or mapped into existing
      spaces, then we can create a new space. We have faith that we
      will be able to use this flexibility point in the future,
      because it worked successfully for integrating the older
      spaces such as Gopher and FTP spaces into the Web.
    </p>
    <table border="1" cellpadding="2">
      <tbody>
        <tr>
          <td>
            If you have ported a concept between environments in
            the past, then there is a better hope that you can in
            the future.
          </td>
        </tr>
      </tbody>
    </table>
    <h3>
      The danger of too many access schemes
    </h3>
    <p>
      However, we do not do this lightly. When we introduce a new
      space, it may have very different properties and we expect
      that the deployment of new software will be needed to allow
      access to it. Some spaces may be gatewayable into HTTP space,
      and this will often provide a transition path. This is why
      early browsers allowed one to declare in a configuration file
      what gateways to use for what new spaces.
    </p>
    <p>
      If we use this extension point frivolously, ironically, it
      will cease to work. Suppose very many schemes are introduced.
      The access scheme space itself becomes a namespace with all
      the problems which current namespaces such as DNS are trying
      to solve, but which are very hard problems:
    </p>
    <ul>
      <li>Clashes in the namespace would destroy interoperability;
      </li>
      <li>Ownership of the space becomes commercially valuable;
      </li>
      <li>Democratic and fair management becomes essential and
      difficult;
      </li>
    </ul>
    <p>
      Worse, though, technology will be needed to automatically
      dereference the schemes themselves and download code to
      handle them. Something like DNS will be needed. The top level
      namespace then becomes in fact DNS, or something like it.
      This, however, begs the question. What happens if later DNS
      needs to be replaced? There is no top-level extension switch
      left. The world is stuck with whatever form of access-scheme
      name service exists.
    </p>
    <p>
      Therefore, I conclude that access schemes should not be open
      to trivial extension, and that the access scheme should only
      be extended by the introduction of new standards with full
      open review by the entire community.
    </p>
    <h3>
      Alternatives to new schemes
    </h3>
    <p>
      Whereas some schemes (like "data:") are clearly neat and new
      and orthogonal to HTTP, many schemes could in fact be
      integrated into http, using HTTP extension mechanisms.
    </p>
    <p>
      In fact, is HTTP is to be taken as a general computing
      protocol, then use of an <a href="Extensible.html">extensible
      language system</a> for the HTTP request message would allow
      a huge amount of extension, covering protocols with different
      functionality (exporting different interfaces).
    </p>
    <h3>
      Evolving scheme spaces
    </h3>
    <p>
      When considering the evolution of a space, it is important to
      remember that primarily the access scheme refers to a part of
      the URI space, and secondarily it refers to a protocol.
      Therefore, one can in fact change the protocols used to
      access resources within a scheme's namespace, without
      changing the space. For example, a new DNS protocol could be
      introduced which over time would replace the current one,
      without changing the DNS space. This would effectively
      redefine the HTTP and FTP protocols, but would not harm the
      namespaces. When touch-tone dialing was introduced, the
      telephone numbering system remained the same. So an indexing
      system could be introduced which, when deployed, would allow
      http:// space objects to be found with greater reliability or
      speed than the current protocols, while maintaining the HTTP
      space as being the concatenation of a DNS name and an opaque
      string.
    </p>
    <hr />
    <h2>
      Footnote
    </h2>
    <h4>
      <a name="Resource1" id="Resource1">Resource</a>
    </h4>
    <p>
      The word "document" in the original "Universal Document
      Identifier" in the first web spec was changed to "Resource"
      in the IETF discussions, because (a) the word "document"
      didn't seem to cover all kinds of information resources such
      as movies and sounds, and (b) actually URIs exist for
      communication endpoints such as mailboxes (mailto:) and login
      ports (telnet:). "Resource" was, though, later used by RDF as
      a term for anything - the top class which is the superclass
      of all classes. This stemmed from RDF's initial use as a
      language for describing information resources on the Web,
      although RDF was designed to be used to describe anything as
      a general knowledge representation system. The term
      "Information Resource" was adopted by the TAG for the Web
      Architecture document. When people, including the author in
      the article above, refer to an information resource, they
      often
    </p>
    <h2>
      Related material elsewhere in these notes
    </h2>
    <p>
      <i>Content/Version negotiation and Fragment ID persistence:
      warnings and awareness.</i> See <a href=
      "Fragment.html">Fragment Identifiers</a>
    </p>
    <p>
      <i>&nbsp;If you negotiate between MIME types which have
      different fragment ID representations, you run a risk &amp;
      should warn the client.</i>
    </p>
    <p>
      To be added:
    </p>
    <p>
      <i>Level breaking with care: optimizing in HTTPNG etc</i>
    </p>
    <hr />
    <p>
      <a href="Overview.html">Up to Design Issues</a>, On to URIs
    </p>
  </body>
</html>