Architecture.html 57.4 KB

Raw Blame History Permalink

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="generator" content=
    "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
    <title>
      Web Architecture from 50,000 feet
    </title>
    <meta http-equiv="Content-Type" content=
    "text/html; charset=us-ascii" />
    <link href="di.css" rel="stylesheet" type="text/css" />
  </head>
  <body bgcolor="#DDFFDD" text="#000000" lang="en" xml:lang="en">
    <address>
      <p>
        <br />
        Status: An attempt to give a high-level overview of the
        architecture of the WWW. This been presented to and
        discussed at the IWWW conferences, the W3C chairs forum and
        the W3C Advisory Committee. Editing status: Being updated
        for October 1999. More verbose in new areas. Comments
        welcome
      </p>
    </address>
    <p>
      <a href="Overview.html">Up to Design Issues</a>
    </p>
    <hr />
    <h1>
      <a name="Architectu" id="Architectu">Web Architecture from
      50,000 feet</a>
    </h1>
    <p>
      This document attempts to be a high-level view of the
      architecture of the World Wide Web. It is not a definitive
      complete explanation, but it tries to enumerate the
      architectural decisions which have been made, show how they
      are related, and give references to more detailed material
      for those interested. Necessarily, from 50,000 feet, large
      things seem to get a small mention. It is architecture, then,
      in the sense of how things hopefully will fit together. I
      have resisted the urge, and requests, to try to write an
      architecture for a long time: This was from a feeling that a
      dead and therefore less valuable document must any attempt to
      select which, of all the living ideas, seem most stable,
      logically connected and essential. So we should recognize
      that while it might be slowly changing, this is also a living
      document.
    </p>
    <p>
      The document is written for those who are technically aware
      or intend soon to be, so it sparse on explanation and heavy
      in terms of terms.
    </p>
    <h3>
      <a name="Goal" id="Goal">Goal</a>
    </h3>
    <p>
      The W3C Consortium's broadly stated mission is to lead the
      Web to its "full potential", whatever that means. My
      definition of the Web is a universe of network-accessible
      information, and I break the "full potential" into two by
      looking at it first as a means of human-to-human
      communication, and then as a space in which software agents
      can, though access to a vast amount of everything which is
      society, science and its problems, become tools to work with
      us.
    </p>
    <p>
      <i>(See keynote speeches such as "<a href=
      "../Talks/1998/0227-WebFuture/slide1-1.htm">Hopes for the
      future</a>" at the Quebec Internet Forum, and have written up
      in outline for example in short essay "<a href=
      "../1998/02/Potential.html">Realizing the full potential of
      the Web</a>")</i>
    </p>
    <p>
      In this overview first I will deal with the properties of the
      space itself, and then look it is use as a human medium and
      then as a medium for machine reasoning.
    </p>
    <p>
      This article assumes that the goals of interoperability, and
      creating an evolvable technology, are taken for granted and
      assumed throughout. The principles of universality of access
      irrespective of hardware or software platform, network
      infrastructure, language, culture, geographical location, or
      physical or mental impairment are core values in Web design:
      they so permeate the work described that they cannot be
      mentioned in any one place by will likewise be assumed
      throughout. <i>(See <a href=
      "../International">Internationalization Activity</a> and the
      <a href="../WAI/Overview.html">Web Accessibility
      Initiative</a>)</i>
    </p>
    <h3>
      <a name="Principles" id="Principles">Principles of Design</a>
    </h3>
    <p>
      Similarly, we assume throughout the design process certain
      general notions of what makes good design. Principles such as
      <b><a href="Principles.html#KISS">simplicity</a></b> and
      <b><a href="Principles.html#Modular">modularity</a></b> are
      the stuff of software engineering; <b><a href=
      "Principles.html#Decentrali">decentralization</a></b> and
      <b><a href="Principles.html#Tolerance">tolerance</a></b> are
      the life and breath of Internet. To these we might add the
      principles of <b><a href="Evolution.html#Least">least
      powerful</a></b> language, and the <a href=
      "Evolution.html#ToII"><b>test of independent
      invention</b></a> when considering evolvable Web technology.
      I do not not elaborate on these here (but see <a href=
      "Principles.html">Principles</a>).
    </p>
    <h3>
      <a name="fundamenta" id="fundamenta">The fundamentals: The
      Universal Web</a>
    </h3>
    <p>
      The most fundamental specification of Web architecture, while
      one of the simpler, is that of the Universal Resource
      Identifier, or URI. The principle that anything, absolutely
      anything, "on the Web" should identified distinctly by an
      otherwise opaque string of characters (A URI and possibly a
      fragment identifier) is core to the universality.
    </p>
    <p>
      Great multiplicative power of reuse derives from the facts
      that all languages use URIs as identifiers: This allows
      things written in one language to refer to things defined in
      another language. The use of URIs allows a language leverage
      the many forms of persistence, identity and various forms of
      equivalence. Each language simply refers to the URI spec -
      this is a flexibility point allowing the properties of naming
      and addressing schemes to be defined separately.
    </p>
    <p>
      <i>(See the <a href="http://www.ietf.org/rfc/rfc2396.txt">URI
      specification</a>; <a href="#Footnote:">Footnote</a>;
      <a href="NameMyth.html">Myths of Naming and
      addressing</a>)</i>
    </p>
    <p>
      There are many design decisions about the properties of URIs
      which are fundamental in that they determine the properties
      of the Web, but which I will not go into here. They include
      the rules for the parsing and use of relative URI syntax, the
      relationship of view identifiers (fragment ids) to URIs. It
      is important that these are respected in the design of new
      URI schemes.
    </p>
    <p>
      <i>(See the first few <a href="Overview.html">Design
      Issues</a> articles for detailed discussions of these)</i>
    </p>
    <h3>
      <a name="schemes" id="schemes">URI schemes</a>
    </h3>
    <p>
      The Web is by design and philosophy a decentralized system,
      and its vulnerabilities lie wherever a central facility
      exists. The URI specification raises one such general
      vulnerability, in that the introduction of new URI scheme is
      a potential disaster, immediately breaking interoperability.
    </p>
    <p>
      Guidelines for new Web developments are that they should
      respect the generic definition and syntax of URIs, not
      introduce new URI schemes without due cause, not introduce
      any different scheme which puts itself forward as to be
      universal as a superset of URIs which would effectively
      require information apart from a URI to be used as a
      reference. Also, in new developments, all significant objects
      with any form of persistent identity should be "first class
      objects" for which a URI exists. New systems should use URIs
      where a reference exists, without making constraint on the
      scheme (old or new) which is chosen.
    </p>
    <p>
      The principle of minimalist design requires that the URI
      super-space itself makes the minimum constraint upon any
      particular URI scheme space in terms of properties such as
      identity, persistence and dereferencability. In fact, the
      distinction between names and addresses blurs and becomes
      dangerously confusing in this context. (See Name myths). To
      discuss the architecture of that part of the Web which is
      served using HTTP we have to become more specific.
    </p>
    <ul>
      <li>
        <em>A URI activity is proposed [Oct 99, member only]</em>
      </li>
    </ul>
    <h3>
      <a name="Specific" id="Specific">Specific schemes</a>
    </h3>
    <p>
      A few spaces are worthy of note which in which identity is
      fairly well defined, but have no defined dereferencing
      protocol: the message identifier (mid) and content identifier
      (cid) spaces adopted from the MIME world, the md5: hash code
      with verifiable pure identity, and the pseudo-random
      Universally Unique Identifier (uuid) from the Apollo domain
      system and followers. These may be underused as URIs.
    </p>
    <p>
      It is also worth pointing out the usefulness of URIs which
      define communication endpoints which do have a persistent
      identity even for connection-oriented technologies for which
      there is no other addressable content. An example is the
      "mailto" scheme which should perhaps have been called
      "mailbox". This object is the most fundamental and very
      widely used object in the email world. It represents
      conceptually a mailbox - something you can mail to. It is a
      mistake to take the URI as a verb: a URI is a noun. Typical
      browsers represent a "mailto:" URI as a window for sending a
      new message to the address, but opening an address book entry
      and a list of messages previous received from or sent to that
      mailbox would also be a useful representation.
    </p>
    <p>
      There is an open question as to what the process should be
      for formulating new URI schemes, but it is clear that to
      allow unfettered proliferation would be a serious mistake. In
      almost all other areas, proliferation of new designs is
      welcomed and the Web can be used as a distributed registry of
      them, but not for the case of URI schemes.
    </p>
    <p>
      It is reasonable to consider URI spaces which are designed to
      have greater persistence than most URIs have today, but not
      technical solutions with no social foundation.
    </p>
    <h2>
      <a name="HTTP" id="HTTP">The HTTP space</a>
    </h2>
    <p>
      The most well-known URI space is the HTTP space,
      characterized by a flexible notion of identity <i>(See
      Generic URIs)</i>, and a richness of information about and
      relating resources, and a dereferencing algorithm which
      currently is defined for reference by the HTTP 1.1 wire
      protocol. In practice, caching, proxying and mirroring
      schemes augment HTTP and so dereferencing may take place even
      without HTTP being invoked directly at all.
    </p>
    <p>
      <i>(See the HTTP 1.1 protocol specification.)</i>
    </p>
    <p>
      The HTTP space consists of two parts, one hierarchically
      delegated, for which the Domain Name System is used, and the
      second an opaque string whose significance is locally defined
      by the authority owning the domain name.
    </p>
    <p>
      <i>(See the DNS specification)</i>
    </p>
    <p>
      The Achilles' heel of the HTTP space is the only centralized
      part, the ownership and government of the root of the DNS
      tree. As a feature common and mandatory to the entire HTTP
      Web, the DNS root is a critical resource whose governance by
      and for the world as a whole in a fair way is essential. This
      concern is not currently addressed by the W3C, except
      indirectly though involvement with ICANN.
    </p>
    <p>
      The question of improving the persistence of URIs in the HTTP
      space involves issues of tool maturity, user education, and
      maturity of the Web society. The changing of URIs ("moving"
      of resources) is strongly discouraged.
    </p>
    <ul>
      <li>
        <a href="/Provider/Style/URI">See: Cool URIs don't
        change</a>
      </li>
    </ul>
    <p>
      Research work elsewhere has included many "naming" schemes
      variously similar or dissimilar to HTTP, the phrase "URN"
      bring used either for any such or one such scheme. The
      existence of such projects should not be taken, to indicate
      that persistence of HTTP URIs should not also be pursued, or
      that URIs in general should be partitioned into "names" and
      "addresses". It is extremely important that if a new space is
      created, that it be available as a sub-space of the universal
      URI space, so that the universality of the Web is preserved,
      and so that the power of the new space been usable for all
      resources.
    </p>
    <p>
      One can expect HTTP to mature to provide alternate more
      modern standard ways of dereferencing HTTP addresses, whilst
      keeping the same (hierarchy plus opaque string) address
      space.
    </p>
    <h3>
      <a name="State" id="State">State distribution protocols</a>
    </h3>
    <p>
      Currently on the Internet, HTTP if used for Web pages, SMTP
      for email messages, and NNTP for network news. The curious
      thing about this is that the objects transferred are
      basically all MIME objects, and that the choice of protocol
      is an optimization made by the user often erroneously. An
      ideal situation is one in which the "system" (machines,
      networks and software) decides adaptively which sorts of
      protocols to use to efficiently distribute information,
      dynamically as a function of readership. This question of an
      efficient flexible protocol blending fetching on demand to
      preemptive transmission is currently seen as too much of a
      research are for W3C involvement.
    </p>
    <h2>
      <a name="Content" id="Content">Content and Remote
      Operations</a>
    </h2>
    <p>
      The URI specification effectively defines a space, that is a
      mapping between identifiers (URIs) and resources. This is all
      in theory which is needed to define the space, but in order
      to make the content of the space available, the operation of
      dereferencing an identifier is a fundamental necessity. In
      HTTP this is the "GET" operation. In the Web architecture,
      GET therefore has a special status. It is not allowed to have
      side effects (and it is idempotent) and HTTP has many
      mechanisms for refining concepts of idempotency and identity.
      While other remote operations on resources (objects) in the
      Web are quite valid, and some are indeed included in HTTP,
      the properties of GET are an important principle. The use of
      GET for any operation which has side-effects (such as
      unsubscribing from a mailing list, filling a shopping cart,
      etc) is incorrect.
    </p>
    <p>
      The introduction of any other method apart from GET which has
      no side-effects and is simply a function of the URI is also
      incorrect, because the results of such an operation
      effectively form a separate address space, which violates the
      universality. A pragmatic symptom would be that hypertext
      links would have to contain the method as well as the URI in
      order to able to address the new space, which people would
      soon want to do.
    </p>
    <p>
      <em>(Example: Instead of defining a new method CVSSTAT to
      retrieve the code management status of a document, that
      status should be given a URI in the server's space, and
      headers used to point the aware client to it. Otherwise, we
      end up with a class of document which contains interesting
      information but cannot be linked to.)</em>
    </p>
    <p>
      The extension of HTTP to include an adaptive system for the
      proactive distribution of information as a function of real
      or anticipated need, and for the location of close copies, is
      a natural optimization of the current muddle of push and pull
      protocols (SMTP, NNTP, HTTP, and HTTP augmented by "channel"
      techniques). This is an area in which the answers are not
      trivial and research is quite appropriate. However, it is in
      the interests of everything which will be built on the Web to
      make the form of distribution protocols invisible wherever
      possible.
    </p>
    <p>
      HTTP in fact combines a basic transport protocol with formats
      for a limited varieties of "metadata", information about the
      payload of information. This is a historical inheritance from
      the SMTP world and as an architectural feature which should
      be replaced by a <a href=
      "Metadata.html#MetadataHeaders">clearer distinction</a>
      between the basic HTTP functionality and a dramatically
      richer world of <a href="Metadata.html">metadata</a>.
    </p>
    <p>
      <i>(See <a href="../Propagation/Overview.html">old
      propagation activity statement</a>)</i>
    </p>
    <h3>
      <a name="Remote" id="Remote">Remote Operations: Web
      Services</a>
    </h3>
    <p>
      HTTP was originally designed as a protocol for remote
      operations on objects, with a flexible set of methods. The
      situation in which distributed object oriented systems such
      as CORBA, DCOM and RMI exist with distinct functionality and
      distinct from the Web address space causes a certain tension,
      counter to the concept of a single space. The HTTP-NG
      activity investigated many aspects of the future development
      of NG, including a possible unification of the world of
      Remote procedure Call (RPC) with existing Web protocols. The
      study ended but has not generated the momentum for further
      work, but the use of XML for inter-company remote operations
      becaome prevalent (2001) and became known as Web Services.
      See the W3C Web Serices activity.
    </p>
    <p>
      Both HTTP and XML have come upon the problem of
      extensibility. The XML/RDF model for extensibility is general
      enough for what RPC needs, in my opinion, and I note that an
      RPC message is a special case of a structured document. To
      take the whole RPC system and represent it in the RDF model
      would be quite reasonable. Of course, a binary format (even
      if just compression) for XML would be required for efficient
      transmission. But the evolvability characteristics of RDF are
      just what RPC needs.
    </p>
    <p>
      Web Services differe from previous remote operation work in
      that the transactions are less frequent, and slower, and
      between non-trusted parties. Such things as proof of delivery
      become important . while techniques such as storeing messaegs
      for years can becoem part of a protocol. The Web Services
      Architecture Group was chatered to define the
      interrelationships betwwn the required functionality such as
      Pckaging, Security, Reliability, QoS and so on discussed as
      Web Services requirements at the 3C WS wokshop.
    </p>
    <h3>
      <a name="Level" id="Level">Level breaking: Messages and
      Documents.</a>
    </h3>
    <p>
      There has been to date an artificial distinction between the
      transmission standards for "protocols"and "content". In the
      ever continuing quest for generalization and simplification,
      this is a distinction which cannot last. Therefore, new
      protocols should be defined in terms of the exchange of
      messages, where messages are XML, and indeed, RDF documents.
    </p>
    <p>
      The distinction has been partly historical, and partly
      useful, in that, with protocols defined on top of "messages",
      and defined in order to transport "documents" (or whatever
      vocabulary), the confusing but illuminating recursion of
      protocols being defined in terms of messages exchanged by
      protocols defined in terms of other messages and so on.
    </p>
    <p>
      In fact this recursion happens all the time and is important.
      Email messages contain email messages. Business protocols are
      enacted using documents which are put on the web or sent by
      SMTP or HTTP using internet messages. The observation that
      these are in fact the same (historically this almost lead to
      HTTP messages being defined in SGML) leads to a need for
      generalization and a gain from the multiplicative power for
      combining the ideas. For example, regarding documents and
      messages as identical gives you the ability to sign messages,
      where you could only sign documents, and to binary encode
      documents, where you could only binary encode messages, and
      so on. What was level breaking becomes an architectural
      reorganization and generalization.
    </p>
    <p>
      The ideal goals, then, for an evolution of HTTP - would
      include:
    </p>
    <ul>
      <li>A protocol for allowing many concurrent message
      exchanges;
      </li>
      <li>A data typing and marshalling standard for objects as
      general and as extensible as XML documents with namespaces;
      </li>
      <li>A schema system which allows any (Corba, DCom, RMI, etc)
      RPC interface to be defined, with an implied standard
      efficient format for RPC transmission;
      </li>
      <li>Extensions to the RPC state transition protocols to allow
      asynchrony needed for web applications (bidirectionality,
      streaming, asynchronous calls...);
      </li>
      <li>An implementation of a sophisticated socially aware state
      propagation (Web) protocol on top of the new RPC
      functionality, but in a modular way making use of the the
      extensibility to allow a much simpler basic design than HTTP
      1.1.
      </li>
    </ul>
    <p>
      <i>(See the old HTTP-NG activity statement, the <a href=
      "../TR/WD-HTTP-NG-architecture/Overview.html">HTTP-NG
      architecture note</a>)</i>
    </p>
    <p>
      Where new protocols address ground which is covered by
      HTTP-NG, awareness and lack of duplication is obviously
      desirable.
    </p>
    <h3>
      <a name="Extension" id="Extension">Extension of access
      protocols</a>
    </h3>
    <p>
      The ubiquity of HTTP, while not a design feature of the Web,
      which could exist with several schemes in concurrence, has
      proved a great boon. This sunny situation is clouded a little
      by the existence of the "https" space which implied the use
      of HTTP through a Secure Socket Layer (SSL) tunnel. By making
      this distinction evident in the URI, users have to be aware
      of the secure and insecure forms of a document as separate
      things, rather than this being a case of negotiation in the
      process of dereferencing the same document. Whilst the
      community can suffer the occasional surfacing of a that which
      should be hidden, it is not desirable as a precedent, as many
      other dimensions of negotiation (including language, privacy
      level, etc) for which proliferation of access schemes is
      inappropriate.
    </p>
    <p>
      Work at W3C on extension schemes for protocols has been
      undertaken for a long time and while not adopted in a
      wide-scale way in HTTP 1.1, currently takes the form of the
      Mandatory specification. Many features such as PICS or RTSP
      could have benefitted from this had it been defined in early
      HTTP versions.
    </p>
    <p>
      <i>(See the Mandatory Specification)</i>
    </p>
    <p>
      Extension of future protocols such as HTTP-NG is clearly an
      important issue, but hopefully the experience from the
      extensibility of data formats will provide tools powerful
      enough to be picked up directly and used by the HTTP-NG
      community in due course.
    </p>
    <p>
      Specifications for protocols or data formats must allow for
      and distinguish mandatory and optional extensions. A generic
      facility for doing this in XML is clearly called for.
    </p>
    <h2>
      <a name="Data" id="Data">Data Formats</a>
    </h2>
    <h3>
      <a name="Format" id="Format">Format Negotiation</a>
    </h3>
    <p>
      When the URI architecture is defined, and when one has the
      use of at least one dereferencable protocol, then all one
      needs for an interoperable global hypertext system is at
      least one common format for the content of a resource, or Web
      object.
    </p>
    <p>
      The initial design of the Web assumed that there would
      continue to be a wild proliferation of proprietary data
      formats, and so HTTP was designed to have a feature of
      negotiation common formats between client and server.
      Historically this was not used due to, on the one hand, the
      proliferation of HTML as a common format, and, on the other
      hand, the size of the known formats list which a typical
      client had to send with each transaction.
    </p>
    <p>
      As an architectural feature, this is still desirable. The Web
      is currently full of user awareness of data formats, and
      explicit user selection of data formats, which complicates it
      and hides the essential nature of the information.
    </p>
    <p>
      The discussion of data formats should be seen in this light.
    </p>
    <ul>
      <li>See: The CC/PP protocol in development
      </li>
    </ul>
    <h3>
      <a name="MIME" id="MIME">MIME types</a>
    </h3>
    <p>
      In HTTP, the format of data is defined by a "MIME type". This
      formally refers to a central registry kept by IANA. However,
      architecturally this is an unnecessary central point of
      control, and there is no reason why the Web itself should not
      be used as a repository for new types. Indeed, a transition
      plan, in which unqualified MIME types are taken as relative
      URIs within a standard reference URI in an online MIME
      registry, would allow migration of MIME types to become first
      class objects.
    </p>
    <p>
      The adoption by the community of a tested common recommended
      data format would then be a question not of (central)
      registry but of (possibly subjective) endorsement.
    </p>
    <p>
      Currently the Web architecture requires the syntax and
      semantics of the URI fragment identifier (the bit after the
      "#") to be a function of MIME type. This requires it to be
      defined with every MIME registration. This poses an unsolved
      problem when combined with format negotiation.
    </p>
    <h3>
      <a name="XML" id="XML">Common Syntax for Structured
      documents: XML</a>
    </h3>
    <p>
      While HTML was, partly for political reasons, based upon the
      generic SGML language, the community has been quite aware
      that while sharing a common syntax for structured documents
      was a good idea, something simpler was required. XML was the
      result.
    </p>
    <p>
      <i>(See the <a href="/XML/Activity.html">XML Activity
      Statement</a>)</i>
    </p>
    <p>
      While in principle anyone is free to use any syntax in a new
      language, the evident advantages from sharing the syntax are
      so great that new languages should where it is not overly
      damaging in other ways be written in XML. Apart from the
      efficiency of sharing tools, parsers, and understanding, this
      also leverages the work which has been put in to XML in the
      way of internationalization, and extensibility.
    </p>
    <h3>
      <a name="Namespaces" id="Namespaces">Namespaces</a>
    </h3>
    <p>
      The extensibility in XML is essential in order to crack a
      well-known tension in the software world between free but
      undefined extension and well-defined but awkward extension in
      the RPC world. An examination of the needs for evolution of
      technology in a distributed community of developers shoes
      that the language must have certain features:
    </p>
    <ul>
      <li>It must be possible to precisely define a language (the
      set of tokens, grammar, and semantics) as a first class
      object;
      </li>
      <li>It must be possible to make documents in a mixture of
      languages (language mixing)
      </li>
      <li>Every document should be self-defining by carrying the
      URI(s) of the language(s) in which it is written;
      </li>
      <li>It must be possible to process a document understanding a
      subset of the languages (partial understanding).
      </li>
    </ul>
    <p>
      <i>(See <a href=
      "../Talks/1998/0415-Evolvability/slide1-1.htm">Evolvability
      Talk</a> at WWW7, and design issues:</i> <a href=
      "Evolution.html">Evolvability</a>)<br />
      (See Note "<a href="../TR/NOTE-webarch-extlang.html">Web
      architecture: extensible languages</a>, )
    </p>
    <p>
      These needs lie behind the evolution of data formats whether
      of essentially human-targeted or of machine-understandable
      (semantic) data.
    </p>
    <p>
      When a new language is defined, XML should in general be
      used. When it is, the new language, or the new features
      extending an existing language, must be defined as a new
      namespace. (That is, new non-XML syntaxes, processing
      instructions, or tunnelling of functionality within other XML
      entities etc is inappropriate). A namespace URI must be used
      to identify the language. XML should be considered to include
      XML 1.0 and Namespaces.
    </p>
    <p>
      The XML and RDF schema languages are mature now (2002). New
      namespaces must be designed assuming the use of schemas, and
      not relying on DTD functionality. Where the functionality
      being introduced maps onto a logical assertion model, then
      the mapping onto the RDF model below should be defined, and,
      normally, RDF used. An alternative is to define an XML schema
      and a mapping algorithm from an XML document using the
      namespace to RDF.
    </p>
    <p>
      Language specifications should define the ways in which they
      can be extended. This typically involves defining types of
      element which subtypes can be created in future languages.
      The structural constraints of the original language will then
      define how the new language may syntactically be mixed with
      the old, and the semantics of the old specification will
      define how the new elements should be interpreted at the
      semantic level of that specification. (Note typically in an
      object-oriented support class, this will require classes
      supporting the new elements to support the same API as the
      superclass in the original language). Future work in this
      area is required to clarify this and how it is expressed in
      the schema language.
    </p>
    <p>
      New languages (namespaces) may, in summary, be introduced in
      two ways. Firstly, as a completely new application (such as a
      downloaded bank transfer), allowing interoperability where
      previously formats were proprietary and indecipherable.
      Secondly, as an extension to an existing application such as
      HTML or RDF. In the latter case languages such as style
      sheets for human readable documents and inference rules for
      logical documents will define the interpretation of the new
      language at a given semantic level.
    </p>
    <p>
      The namespace document (with the namespace URI) is a place
      for the language publisher to keep definitive material about
      a namespace. Schema languages were the first languages
      available for this, but could only give syntactic
      constraints. More generally, one would expect a more powerful
      language to allow all kinds of information to be provided,
      either inline (like RDF) or by reference (like RDDL or RDF).
      There is a huge a mount of value to be gained from having a
      document be self-describing in the Web. (This does not
      preclude the operation of checking a document against a
      different schema if one wants to as a local operation). The
      first stage in self-describing documents is to do it at the
      XML schema (structure) level. Successive stages are to give
      semantic information. [See grounded documents]
    </p>
    <p>
      Languages, like resources may be living or frozen. Making the
      language a living language is in my opinion dangerous and
      asking for HTML-like divergence. Even when a language is
      frozen, the namespace document may change as new languages
      become available to express different forms of semantics
      about the language. The namespace document may for example
      include or link to:
    </p>
    <ul>
      <li>Syntactic constraints (e.g. in xml-schema)
      </li>
      <li>Range and domain of properties (e.g. in rdf-schema)
      </li>
      <li>Default or mandatory style sheet for display of the
      language to a person (e.g. in CSS or XSL)
      </li>
      <li>and so on...
      </li>
    </ul>
    <p>
      A namespace document clearly may have a mixture of languages.
    </p>
    <h2>
      <a name="Human" id="Human">Human Readable Information</a>
    </h2>
    <p>
      By <b>human readable</b> information I mean documents in the
      traditional sense which are intended for human consumption.
      While these may be transformed, rendered, analyzed and
      indexed by machine, the idea of them being <i>understood</i>
      is an artificial-intelligence complete problem which I do not
      address as part of the Web architecture. When I talk about
      <b>machine-understandable</b> documents, therefore, I mean
      data which has explicitly been prepared for machine
      reasoning: part of a semantic web. (The use of the term
      "semantics" by the SGML community to distinguish content from
      form is unfortunately confusing and not here).
    </p>
    <h3>
      <a name="Separation" id="Separation">Separation of Form and
      Content</a>
    </h3>
    <p>
      An architectural rule which the SGML community embraced is
      the separation of form and content. It is an essential part
      of Web architecture, making possible the independence of
      device mentioned above, and greatly aiding the processing and
      analysis. The addition of presentation information to HTML
      when it could be put into a style sheet breaks this rule. The
      rule applies to many specifications apart from HTML: in the
      Math Markup Language (MathML) two levels of the language
      exist, one having some connection with mathematical meaning,
      and the other simply indicating physical layout.
    </p>
    <h3>
      <a name="Graphics" id="Graphics">Graphics</a>
    </h3>
    <p>
      The development of different languages for human readable
      documents can be relatively independent. So 2D graphic
      languages such as PNG and SVG are developed essentially
      independently of 3D languages such as VRML (handled not by
      W3C but by the VRMLC, now Web3D) and text languages such as
      HTML and MathML. Coordination is needed when aspects of
      style, fonts, color and internationalization are considered,
      where there should be a single common model for all
      languages.
    </p>
    <p>
      PNG was introduced as a compact encoding which improved on
      GIF both technically (color, flexibility and transparency)
      and politically (lack of encumbrance). <a href=
      "/Graphics/SVG/">SVG</a> is required as a common format in
      response to the large number of suggestions for an object
      oriented drawing XML language.
    </p>
    <h3>
      <a name="HTML" id="HTML">HTML</a>
    </h3>
    <p>
      The value of a common document language has been so enormous
      that HTML has gained a dominance on the Web, but it does not
      play a fundamental key role. Web applications are required to
      be able to process HTML, as it is the connective tissue of
      the Web, but it has no special place architecturally.
    </p>
    <p>
      HTML has benefitted and suffered from the "ignore what you
      don't understand" rule of free extension. In future, the plan
      is to migrate HTML from being an SGML application to being
      defined as an XML namespace, making future extension a
      controlled matter of namespace definition. The first step is
      a specification for documents with precisely HTML 4.0
      features but which are XML documents.
    </p>
    <p>
      <i>(See <a href="../TR/NOTE-rdfarch.html">W3C Data
      Formats</a> note)</i>
    </p>
    <h3>
      <a name="XHTML" id="XHTML">XHTML transition</a>
    </h3>
    <p>
      The transition strategy from HTML as it is practiced today to
      HTML based on XML in the future is difficult. It is driven by
      many constraints:
    </p>
    <ol>
      <li>Currently many web pages are badly formed and do not
      adhere to the HTML 4.0 standard, nor to SGML;
      </li>
      <li>Browsers must for a long time be able to read these
      legacy web pages.
      </li>
      <li>Many browsers exist which cannot parse XML;
      </li>
      <li>There is a way, "XHTML" to write a well-formed XML
      document so that it appears to a typical legacy browser to be
      HTML and it is parsed correctly;
      </li>
      <li>One can tell the difference between an old HTML document
      and an XML document by the namespace declaration;
      </li>
    </ol>
    <p>
      The transition strategy is to start using XML internally
      within a site, and for internal documents, while formating
      web pages as XHTML. This will allow web sites to use as many
      XML tools, encouraging the market for XML tools.
    </p>
    <p>
      The second phase is for web sites to convert HTML pages to
      XHTML pages. An incentive to do this will be to be able to
      use XML tools directly on the site (for reading: a special
      converter will be needed to write XHTML pages).This will
      create a base of well-formed XML pages, which hopefully will
      encouraging the inclusion of XML parsing in browsers and
      search engines. During the transition phase, any XML-capable
      browser finding an XML document must assume that it is
      well-formed XML with namespaces. The community must be
      careful to condemn any lax interpretation of the XML
      specification, such as nominally XHTML pages which are not
      well-formed XML and exactly the XHTML namespace. Anything
      which is not XML may be fed by a browser to a legacy HTML
      engine. Legacy HTML pages will of course <strong>not</strong>
      be extendable using other namespaces. XHTML pages will be
      extensible bearing in mind that legacy browsers will ignore
      any tag they don't recognize. Hopefully the transition will
      be eased by the availability of open source code which will
      take a typical old HTML page and convert it (with zero loss
      in most cases) into a completely valid XHTML page. This will
      allow all new tools to be built simply to accept XML, and
      therefore be ready as the use of XML spreads.
    </p>
    <p>
      Eventually, the weight of sites which need to use other
      languages, or other XML features such as Unicode, will
      hopefully cause a general upgrade until all the vast majority
      of Web clients are capable of handling XML with namespaces,
      and sites will be able to insist on it from their readers. At
      this point, a web service of translation of legacy pages
      would be one solution for general access to the archive of
      historical badly-formed documents.
    </p>
    <p>
      <em>Note: This transition strategy has been the cause of much
      debate, as some favor a complete switch from HTML to XML
      without compatability.</em>
    </p>
    <h3>
      <a name="Topology" id="Topology">Hypertext Link topology</a>
    </h3>
    <p>
      A fundamental compromise which allows the Web to scale (but
      created the dangling link problem) was the architectural
      decision that links should be fundamentally mono-directional.
      Links initially had three parameters: the source (implicit
      when in the source document), destination and type. The third
      parameter, intended to add semantics, has not been heavily
      used, but XLINK activity has as one goal to reintroduce this
      to add richness especially to large sets of Web pages. Note
      however that the Resource Description Framework , introduced
      below, is a model (based on an equivalent 3-component
      assertion onto which a link maps directly), and so link
      relationships, like any other relation in future Web
      architecture, must be expressible in RDF. In this way, link
      relationships in HTML, and in future XML hypertext languages,
      should migrate to becoming first class objects.
    </p>
    <p>
      XLINK will also define more sophisticated link topologies,
      and address the human interface questions related to them,
      still using the same URI space and using RDF as the defining
      language for relationship specification. (It may be
      appropriate for information based on the RDF model to be
      actually transferred in a different syntax for some reason,.
      but the specification must define the mapping, so that common
      processing engines can be used to process combinations of
      such information with other information in the RDF model.)
    </p>
    <h3>
      <a name="Style" id="Style">Style Sheets</a>
    </h3>
    <p>
      The principle of modular design implies that when form and
      content are separated the choice of languages for each, if
      possible, be made an independent choice. HTML has dominated
      the text markup (content) language, but the introduction of
      XML opened the door for the use of new XML markup languages
      between parties which share them. (See the <a href=
      "../Style/Overview.html">Style</a> activity at W3C).
    </p>
    <p>
      Style essentially is the mapping between the abstract content
      of a document and the physical form in which it is displayed,
      spoken, performed or in general presented, to its recipient.
    </p>
    <p>
      For graphic style, Cascading Style Sheets (<a href=
      "/Style/CSS/">CSS</a>) provide a way of declaring the form in
      which elements of a document should be presented. It has the
      advantage of being declarative and so reversible: one can
      make an editor which edits a document with the style sheet
      applied to it. <a href="/Style/XSL/">XSL</a>T, by contrast,
      is a transformation language which can make an arbitrarily
      complex mapping from input to output structure. This allows
      more powerful processing, but is not in general reversible.
      For graphic presentation, XSLT can be used to map to a set of
      Formating Objects (XSL-FO) whose formatting properties are to
      be a superset of those of CSS.
    </p>
    <p>
      The fact that CSS is not an XML language is largely
      historical, as it preceeded XML: A namespace of CSS
      formatting properties to allow CSS to be added intuitively to
      any XML document would be a natural development, but may be
      make unnecessary by XSL-FO.
    </p>
    <h3>
      <a name="Collaboration" id="Collaboration">Collaboration</a>
    </h3>
    <p>
      The original idea of the Web being a creative space for
      people to work together in ("intercreative") seems to be
      making very slow progress
    </p>
    <p>
      <i><a href="../Collaboration/Workshop/Overview.html">See W3C
      Collaboration Workshop</a></i>
    </p>
    <p>
      This field is very broad and can be divided into areas:
    </p>
    <ol>
      <li>Asynchronous collaboration tools
        <ul>
          <li>Discussion forums
          </li>
          <li>Workflow automata
          </li>
          <li>Annotation systems (see Annotea)
          </li>
          <li>Endorsement (see PICS)
          </li>
          <li>Collaborative filtering
          </li>
        </ul>
      </li>
      <li>Integration of real-time audio video collaboration and
      the Web (, integration of video in HTML, co-presence)
        <ul>
          <li>addressing for video - callto: etc
          </li>
          <li>integration of video in HTML (SMIL etc)
          </li>
          <li>Co-presence systems
          </li>
        </ul>
      </li>
      <li>Group editors (synchronous hypertext editors, whiteboards
      etc)
      </li>
      <li>Asynchronous distributed editing. (Amaya, Jigsaw,
      Jigedit, WebDAV)
      </li>
    </ol>
    <p>
      A precursor to much collaborative work is the establishment
      of an environment with sufficient confidentiality to allow
      trust among its members. Therefore the Consortium's work on a
      semantic web of trust addressed below may be a gating factor
      for much of the above.
    </p>
    <p>
      Many of the above areas are research areas, and some are
      areas in which products exist. It is not clear that there is
      a demand among w3C members to address common specifications
      in this area right now but suggestions are welcome.. The
      Consortium tries to use whatever web-based collaborative
      techniques are available, including distributed editing of
      documents in the web, and automatic change tracking. The Live
      early Adoption and Demonstration (LEAD) philosophy of W3C was
      introduced specifically for areas like this where many small
      pieces need to be put together to make it happen, but one
      will never know how large any remaining problems are until
      one tries. Still, largely, this section in the architecture
      is left as a place-holder for later expansion. It may not be
      the time yet, but collaborative tools are a requirement for
      the Web and the work is not done until a framework for them
      exists.
    </p>
    <h2>
      <a name="SemanticWeb" id="SemanticWeb">Machine-Understandable
      information: Semantic Web</a>
    </h2>
    <p>
      The Semantic Web is a web of data, in some ways like a global
      database. The rationale for creating an infrastructure is
      given elsewhere [Web future talks etc] here I only outline
      the architecture as I see it.
    </p>
    <p>
      See:
    </p>
    <ul>
      <li>
        <a href="Semantic.html">The Semantic Web Roadmap</a> in
        Design Issues
      </li>
      <li>
        <a href="../RDF/Overview.html">The RDF home page</a>
      </li>
      <li>
        <a href="../TR/WD-rdf-syntax/">RDF Model and Syntax
        Specification</a>
      </li>
    </ul>
    <p>
      When looking at a possible formulation of a universal Web of
      semantic assertions, the principle of minimalist design
      requires that it be based on a common model of great
      generality. Only when the common model is general can any
      prospective application be mapped onto the model. The general
      model is the <a href="../RDF/Overview.html">Resource
      Description Framework</a>.
    </p>
    <h3>
      <a name="Semantic" id="Semantic">Semantic Web: the
      pieces.</a>
    </h3>
    <p>
      The architecture of RDF and the semantic web build on it is a
      plan but not yet all a reality. There are various pieces of
      the puzzle which seem to fall into a chronological order,
      although the turn of events may change that. (Links below are
      into the <a href="Semantic.html#Signature">Semantic Web
      roadmap</a>)
    </p>
    <ol>
      <li>XML provides a basic format for structured documents,
      with no particular semantics.
      </li>
      <li>The <a href="Semantic.html#Assertion">basic assertion
      model</a> provides the concepts of assertion (property) and
      quotation. (This is provided by the <a href=
      "../TR/WD-rdf-syntax/">RDF Model and Syntax
      Specification</a>). This allows an entity-relationship-like
      model to be made for the data, giving it the semantics of
      assertions propositional logic. See the <a href=
      "/TR/1999/NOTE-schema-arch-19991007">Cambridge
      Communiqu&eacute;</a> about the XML-RDF relationship) The RDF
      syntax was considered in need of a change.
      </li>
      <li>The <a href="Semantic.html#Schema">schema language</a>
      provides data typing and allows document structure to be
      constrained to allow predictable computable processing. XML
      schema's datatypes are used.
      </li>
      <li>The Ontology layer (WebOnt working group) provides more
      powerful schema concepts, such as inverse, transitivity, and
      so on. Uniqueness and/or unambiguousness of properties, when
      know, allow a system to spot different identifiers which in
      fact are talking about the same thing.
      </li>
      <li>A <a href="Semantic.html#Conversion">conversion
      language</a> allows the expression of inference rules
      allowing information in one schema to be inferred from a
      document in another. This is part of rules layer.
      </li>
      <li>An <a href="Semantic.html#Inference">evolution rules
      language</a> allows inference rules to be given which allow a
      machine with a certain algorithm to do convert documents from
      one RDF schema into another. This is a fundamental key to
      <a href="Evolution.html">evolution</a> of the technology.
      There may be more than one rules standard, as different class
      of rule-based system have different capabilities.
        <p>
          <a href="Semantic.html#Query">Query languages</a> assume
          different forms of query engine, but are basically the
          same problem space as rule systems. (The antecedent of a
          rule is a query).. One can imagine standardizing both
          certain query engines and a language for defining query
          engines. See the RDF Interest Group for discussion of
          querying logically.
        </p>
      </li>
      <li>The <a href="Semantic.html#Logical">logical layer</a>
      turns a limited declarative language into a Turing-complete
      logical language, with inference and functions. This is
      powerful enough to be able to define all the rest, and allow
      any two RDF applications to be connected together. However,
      without being profiled for use, it does not address specific
      applications. One can see this language as being a universal
      language to unify all data systems just as HTML was a
      language to unify all human documentation systems.
      </li>
      <li>A proof language is a form of RDF which allows one agent
      to send to another an assertion, together with the inference
      path to that assertion from assumptions acceptable to the
      receiver. This allows applications such as access control to
      use a generic validation engine as the kernel, with very
      case-specific tools for producing proofs of access according
      to whatever social rules have been devised for the case. A
      W3C Recommendation for the language and capabilities of a
      standard proof engine would be very appropriate. Onc could
      see this engibe as being based on the logic layer, or being
      based on a less experssive rules layer - esepcially if the
      logic layer remains a research issue when proofs in terms of
      rules are practical need for interchange.
      </li>
    </ol>
    <p>
      Once one has a proof language, then the introduction of
      <a href="Semantic.html#Signature">digital signature</a> turns
      what was a web of reason into a web of trust. The development
      of digital signature functionality in the RDF world can in
      principle happen in parallel with the stages above. As more
      expressive logical languages become available, then but it
      requires that the logical layer be defined as a basis for
      defining the new primitives which describe signature and
      inference in a world which includes digital signature.
    </p>
    <p>
      A single digital signature format for XML documents is
      important. The power of the RDF logical layers will allow
      existing certificate schemes to be converted into RDF, and a
      trust path to be verified by a generic RDF engine.
    </p>
    <h2>
      <a name="Metadata" id="Metadata">Metadata applications</a>
    </h2>
    <p>
      The driver for the semantic web at level 1 above is
      information about information, normally known as metadata.
      The following areas are examples of technologies which should
      use RDF, and which are or we expect to be developed within
      the W3C.
    </p>
    <ul>
      <li>Information practice labels (<a href=
      "../P3P/Overview.html">P3 Project</a>)
      </li>
      <li>
        <a href="../AudioVideo/Overview.html">Synchronized
        MultiMedia</a> (SMIL)
      </li>
      <li>Intellectual Property Rights - Distribution Rights
      languages
      </li>
      <li>
        <a href=
        "../ECommerce/Micropayments/Overview.html">Micropayment</a>s
        (link labeled as "must pay to follow")
      </li>
      <li>Digital Libraries: Catalog: (e.g. <a href=
      "http://purl.oclc.org/metadata/dublin_core/">Dublin Core</a>)
      </li>
    </ul>
    <p>
      This is by no means an exclusive list. Any technology which
      involves information about web resources should express it
      according to the RDF model The plan is that HTML LINK
      relationships be transitioned into RDF properties. We can
      continue the examples for which RDF is clearly appropriate.
    </p>
    <ul>
      <li>Version control information
      </li>
      <li>Relationships between <a href="Generic.html">generic</a>
      and specific URIs
      </li>
      <li>Access control information
      </li>
      <li>Structural information in complex works of many component
      resources
      </li>
      <li>Relationships between a document and its style sheet
      </li>
    </ul>
    <h3>
      <a name="Indexes" id="Indexes">Indexes of terms</a>
    </h3>
    <p>
      Given a worldwide semantic web of assertions, the search
      engine technology currently (1998) applied to HTML pages will
      presumably translate directly into indexes not of words, but
      of RDF objects. This itself will allow much more efficient
      searching of the Web as though it were one giant database,
      rather than one giant book.
    </p>
    <p>
      The Version A to Version B translation requirement has now
      been met, and so when two databases exist as for example
      large arrays of (probably virtual) RDF files, then even
      though the initial schemas may not have been the same, a
      retrospective documentation of their equivalence would allow
      a search engine to satisfy queries by searching across both
      databases.
    </p>
    <h3>
      <a name="Engines" id="Engines">Engines of the Future</a>
    </h3>
    <p>
      While search engines which index HTML pages find many answers
      to searches and cover a huge part of the Web, then return
      many inappropriate answers. There is no notion of
      "correctness" to such searches. By contrast, logical engines
      have typically been able to restrict their output to that
      which is provably correct answer, but have suffered from the
      inability to rummage through the mass of intertwined data to
      construct valid answers. The combinatorial explosion of
      possibilities to be traced has been quite intractable.
      However, the scale upon which search engines have been
      successful may force us to reexamine our assumptions here. If
      an engine of the future combines a reasoning engine with a
      search engine, it may be able to get the best of both worlds,
      and actually be able to construct proofs in a certain number
      of cases of very real impact. It will be able to reach out to
      indexes which contain very complete lists of all occurrences
      of a given term, and then use logic to weed out all but those
      which can be of use in solving the given problem. So while
      nothing will make the combinatorial explosion go away, many
      real life problems can be solved using just a few (say two)
      steps of inference out on the wild web, the rest of the
      reasoning being in a realm in which proofs are give, or there
      are constrains and well understood computable algorithms. I
      also expect a string commercial incentive to develop engines
      and algorithms which will efficiently tackle specific types
      of problem. This may involve making caches of intermediate
      results much analogous to the search engines' indexes of
      today.
    </p>
    <p>
      Though there will still not be a machine which can guarantee
      to answer arbitrary questions, the power to answer real
      questions which are the stuff of our daily lives and
      especially of commerce may be quite remarkable.
    </p>
    <hr />
    <p>
      <a href="Overview.html">Up to Design Issues</a>;
    </p>
    <h4 id="Footnote:">
      Footnote: Universal or Uniform?
    </h4>
    <p>
      Historically, the original term the author used was
      <em>Universal Document Identifier</em> in the WWW
      documentation. In discussions in the IETF, there was a view
      expressed by several people that <em>Universal</em> was too
      strong, in that it could or should not be a goal to make an
      identifier which could be applied to all things. The author
      disagreed and disagrees with this poisition. However, in the
      interest of expediency at the time he bowed to peer pressure
      and allowed <em>Uniform</em> to be substituted for
      <em>Universal</em> in <a href=
      "http://www.ietf.org/rfc/rfc2396.txt">RFC2306</a>, he has
      since decided that that did more harm than good, and he now
      uses <em>Universal</em> to indicate the importance to the Web
      architecture of the single universal information space.
    </p>
    <hr />
    <p>
      Tim Berners-Lee<br />
      Created: September 1998.
    </p>
    <p>
      <br />
      $Id: Architecture.html,v 1.66 2009/08/27 21:38:06 timbl Exp $
    </p>
  </body>
</html>