Evolution.html 53.4 KB

Raw Blame History Permalink

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="generator" content=
    "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
    <title>
      The Evolution of a specification -- Commentary on Web
      architecture
    </title>
    <link rel="stylesheet" href="di.css" type="text/css" />
    <meta http-equiv="Content-Type" content=
    "text/html; charset=us-ascii" />
  </head>
  <body bgcolor="#DDFFDD" text="#000000" lang="en" xml:lang="en">
    <address>
      Tim Berners-Lee
      <p>
        Date: March 1998. Last edited: $Date: 2009/08/27 21:38:07 $
      </p>
      <p>
        Status: . Editing status: incomplete first draft. This
        explains the rationale for XML namespaces and RDF schemas,
        and derives requirement on them from a discussion of the
        process by which we arrive at standards.
      </p>
    </address>
    <p>
      <a href="./">Up to Design Issues</a>
    </p>
    <h3>
      Commentary
    </h3>
    <p>
      <i>(These ideas were mentioned in a <a href=
      "../Talks/1998/0415-Evolvability/slide1-1.htm">keynote on
      "Evolvability"</a> at WWW7 and this text follows closely
      enough for you to give yourself the talk below using those
      slides. More or less. If and when we get a video from WWW7 of
      the talk, maybe we'll be able to serve that up in
      parallel.)</i>
    </p>
    <hr />
    <h1>
      Evolvability
    </h1>
    <h3>
      <a name="Introduction" id="Introduction">Introduction</a>
    </h3>
    <p>
      The World Wide Web Consortium was founded in 1994 on the
      mandate to lead the <b>Evolution</b> of the Web while
      maintaining its <b>Interoperability</b> as a universal space.
      "Interoperability" and "Evolvability" were two goals for all
      W3C technology, and whilst there was a good understanding of
      what the first meant, it was difficult to define the second
      in terms of technology.
    </p>
    <p>
      Since then W3C has had first hand experience of the tension
      beween these two goals, and has seen the process by which
      specifications have been advanced, fragmented and later
      reconverged. This has led to a desire for a technological
      solution which will allow specifications to evolve with the
      speed and freedom of many parallel deevlopments, but also
      such that any message, whether "standard" or not, at least
      has a well defined meaning.
    </p>
    <p>
      There have been technologies dubbed "futureproof" for years
      and years, whether they are languages or backplane busses.
      &nbsp;I expect you the reader to share my cynicism when
      encountering any such claim. &nbsp;We must work though
      exactly what we mean: what we expect to be able to do which
      we could not do before, and how that will make evolution more
      possible and less painfull.
    </p>
    <h2>
      <a name="Free" id="Free">Free extension</a>
    </h2>
    <p>
      A rule explicit or implcit in all the email-like Internet
      protocols has always been that if you found a mail header (or
      something) which you did not understand, you should ignore
      it. This obviously allows people to add all sorts of records
      to things in a very free way, and so we can call it the rul
      of free extension. It has its advatage of rapid prototyping
      and incremental deployment, and the disadvantage of
      ambiguity, confusion, and an inability to add a mandatory
      feature to an existing protocol. I adopeted the rule for HTML
      when initially designing it - and used it myself all the
      time, adding elements one by one. This is one way in which
      HTML was unlike a conventional SGML application, but it
      allowed the dramatic development of HTML.
    </p>
    <h3>
      <a name="cycle" id="cycle">The HTML cycle</a>
    </h3>
    <p>
      The development of HTML between 1994 and 1998 took place in a
      cycle, fuelled by the tension between the competitive urge of
      companies to outdo each other and the common need for
      standards for moving forward. The cycle starts simply simply
      bcause the HTML standard is open and usable by anyone: this
      means that any engineer, in any company or waiting for a bus
      can think of new ways to extend HTML, and try them out.
    </p>
    <p>
      The next phase is that some of these many ideas are tried out
      in prototypes or products, using the fact free extension rule
      that any unrecongined extensiosn will be ignored by
      everything which does not understand them. The result is a
      drmatic growth in features. Some of these become product
      differentiators, during which time their originators are loth
      to discuss the technology with the competition. Some features
      die in the market and diappear from the products. Those
      successful features have a fairly short lifetime as product
      differetiators, as they are soon emulated in some equivalent
      (though different) feature in competeing products.
    </p>
    <p>
      After this phase of the cycle, there are three or four ways
      of doing the same thing, and engineers in each company are
      forced to spend their time writing three of four different
      versions of the same thing, and coping with the software
      architectural problems which arise from the mix of different
      models. This wastes program size, and confuses users. In the
      case for example, of the TABLE tag, a browser meeting one in
      a document had no idea which table extension it was, so the
      situation could become ambiguous. If the interpretation of
      the table was important for the safe interpretation ofthe
      document, the server would never know whether it had been
      done, as an unaware client would blithely ignore it in any
      case. This internal software mess resulting from having to
      implement multiple models also threatens future deevlopment.
      It turns the stable consistent base for future development
      into something fragmented and inconsistent: it is difficult
      to design new features in such an environment.
    </p>
    <p>
      Now the marketting pressure is off which prevented
      discussions, and there is a strong call for the engineers to
      get around the W3C table, and iron out a common way of doing
      things. As this happens, a system is designed which puts
      together the best aspects of each other system, plus a few
      weeks experience, so everyone is in the end happier with the
      result. The companies all go away making public promises to
      implement it, even though the engineering staff will be under
      pressure to add the next feature and startthe next cycle. The
      result is published as a common specification opene to anyone
      to implement. And so the cycle starts again.
    </p>
    <p>
      This is not the way all W3C activities have worked, but it
      was the particular case with HTML, and it illustrates some of
      the advantages and disadvantages with the free extenstion
      rule.
    </p>
    <h3>
      <a name="Breaking" id="Breaking">Breaking the cycle</a>
    </h3>
    <p>
      The HTML cycle as a method of arriving at consensus on a
      document has its drawbacks. By 1998, there were reasons to
      change the cycle.The work in the W3C, which had started off
      in 1994 with several years backlog of work, had more or less
      caught up, and was begining to lead, rather than trail,
      developments. The work was seen less as fire fighting and
      more as consolitation. By this time the spec was growing to a
      size where the principle of modularity was seriously
      flaunted. Any new developments clearly had to be seperate
      modules. Already style information had been moved out into
      the Cascading Style Sheets language, the programming
      interface work was a seperate Document Object Model activity,
      and guidelines for accessibility were tackled by a seperate
      group.
    </p>
    <p>
      Inthe future it was clear that we needed somehow to set up a
      modular system which would allow one to add to HTML new
      standard modules. At the same time, it was clear that with
      XML available as a manageble version of SGML as a base for
      anyone to define their own tag sets, there was likely to be a
      deluge of application-specific and industry-specific XML
      based languages. The idea of all this happening underthefree
      extension rule was frightening. Most applications would
      simply add new tags to HTML. If we continued the process of
      retrospectively roping into a new bigger standard, the
      document would grow without limit and become totally
      unmanageble. The rule of free extesnion was no longer
      appropriate.
    </p>
    <h1>
      <a name="wdi" id="wdi">Well defined interfaces</a>
    </h1>
    <p>
      Now let us compare this situation with the way development
      occus in the world of distributed computing, specifically
      remote rpocedure call (RPC) and distributed object oriented
      systems. In these systems, the distributed system (equivalent
      to the server plus the client for the web) is viewed as a
      single software system which happens to be spread over
      several physical machines. [nelson - courier, etc]
    </p>
    <p>
      The network protocols are defined automatically as a function
      of the software interfaces which happen to end up being
      between modules on different machines. Each interface, local
      or remote, has a well documented structure, and the list of
      functions (procedures, methods or whatever) and parameters
      are defined in machine-processable form. As the system is
      built, the compiler checks that the interfaces required by
      one module is exactly provided by another module. The
      interface, in each version of its development, typically has
      an identifying (typically very long) unique number.
    </p>
    <p>
      The interface defines the parameters of a remote call, and
      therefore defines exactly what can occur in a message from
      one module to another. There is no free extension. If the
      interface is changed, and a new module made, any module on
      the other side of the interface will have to be changed too,
      or you can't build the system.
    </p>
    <p>
      The great advantage of this is that when the system has been
      built, you expect it to work. There is no wondering wether a
      table is being displayed - if you have called the table
      module, you know exactly what the module is supposed to do,
      and there is no way the system could be without that module.
      Given the chaos of the HTML devleopment world, you can
      imagine that many people were hankering after the well
      defined interfaces of the distributed computing technology.
    </p>
    <p>
      With well-defined interfaces, either everything works, or
      nothing. This was in fact at least formally the case with
      SGML documents. Each had a document type definition (DTD)
      refered to at the the top, which defiend in principle exactly
      what could and could not be in the document. PICS labels were
      similar in that thet are self-describing: they actually have
      a URI atthe top which points to a machine-readable
      description of what can and can't be in athat PICS label.
      When you see one of these documents, as when you get an RPC
      mesaage with an interface number on it, you can check whether
      you understand the interface or not. Another intersting thing
      you can do, if you don't have a way of processing it, is to
      look it up in some index and dynamically download the code to
      process it.
    </p>
    <p>
      The existence of the Web makes all this much smoother:
      instead of inventing arbitrary names for inetrfaces, tyou can
      use a real URI which can be dereferenecd and return the
      master definition of the interface in real time. The Web can
      become a decentralised registray of interfaces (languages)
      and code modules.
    </p>
    <p>
      The need was clearly for the best of both worlds. One must be
      able to freely extend a language, but do so with an extension
      language which is itself well defined. If for example,
      documents which were HTML 2.0 plus Netscape's version of
      tables version 2.01 were identified as such, mcuh o the
      problem of ambiguity would have been resolved, but the rest
      ofthe world left free to make their own table extensions.
      This was the goal of the namespaces work in XML.
    </p>
    <h3>
      <a name="ModularityInHTML" id="ModularityInHTML">Modularity
      in HTML</a>
    </h3>
    <p>
      To be able to use the namespaces work in the extension of
      HTML, HTML has to transition from being an SGML application
      (with certain constraints) to being an XML based langauge.
      This will not only give it a certain ease of parsing, but
      allow it to build on the modularity introduced by namespaces.
    </p>
    <p>
      In fact, already in April of 1998 there was a W3C
      Recommendation for "MathML", defined as as XML langauge and
      obviously aimed at being usable in the context of an HTML
      document, but for which there was no defined way to write a
      combined HTML+MathML document. MathML was already waiting for
      XML namespaces.
    </p>
    <p>
      XML namespaces will allow an author (or authoring tool,
      hopefully) to declare exactly waht set of tags he orshe is
      using in a document. Later, schemas should allow a browser to
      decide what to do as a fall back when finding vocabulary
      which it does not understand.
    </p>
    <p>
      It is expected that new extensions to HTML be introduced as
      namespaces, possibly languages in their own right. The intent
      is that the new languages, where appropriate, will be able to
      use the existing work on style sheets, such as CSS, and the
      existing DOM work which defines a programming interface.
    </p>
    <h2>
      <a name="Mixing" id="Mixing">Language mixing</a>
    </h2>
    <p>
      Language mixing is an important facility, for HTML, for the
      evolution of all other Web and application technology. It
      must allow, in a mixed labguage document, for both langauges
      to be well defined. A mixed langage document is quiote
      analogous to a program which makes calls to two runtime
      libraries, so it is not rocket science. It is not like an RPC
      message, which in most systems is very strongly ytped froma
      single rigid definition. (An RPC message can be represented
      as a structured document but not, in general, vice-versa)
    </p>
    <p>
      Language mixing is a realtity. Real HTML pages are often HTML
      with Javascript, or HTML plus CSS, or both. They just aren't
      declared as such. In real life, many documents are made from
      multiple vocabularies, only some of which one understands. I
      don't understand half the information in the tax form - but I
      know enough to know what applies to me. The invoice is a good
      example. Many differet coloured copies of the same document
      used to serve as a packing list, restocking sheet, invoice,
      and delivery note. Different parts of a company would
      understand different bits: the financial dividion woul dcheck
      amounts and signatures, the store would understand the part
      numbers, and the sales and marketting would define dthe
      relationship betwene the part numbers and prices.
    </p>
    <p>
      No longer can the Web tolerate the laxness which HTML and
      HTTP have been extended. However, it cannot constrain itself
      to a system as rigid as a classical disributed object
      oriented system.
    </p>
    <p>
      The <a href="Extensible.html">note on namespaces</a> defines
      some requirements of a language framework which allows new
      schmata to be developed quite independently, and mixed within
      one document. This note elaborates on the sorts of things
      which have to be possible when the evolution occurs.
    </p>
    <h3>
      <a name="power" id="power">The Power of schema languages</a>
    </h3>
    <p>
      You may notice than nowhere in the architecture do XML or RDF
      specify what language the schema should be written in. This
      is because much of the future power of the system will lie in
      the power of the schema and related documents, so it
      isimportant to leave that open as a path for the future. In
      the short term, yo can think of a schema being written in
      HTML and english. Indeed, this is enough to tie the
      significance of documents written in the schema to the law of
      the land and mkae the document an effective part of serious
      commercial or other social interaction. You can imagine a
      schema being in a sort of SGML DTD language which tells a
      computer program what constraints there are on the structure
      of documents, but nothing about their meaning. This allows a
      certain crude validity check to be made on a document but
      little else.
    </p>
    <p>
      Now let us imagine further power which we could put into a
      schema language.
    </p>
    <h2>
      <a name="PartialUnderstanding" id=
      "PartialUnderstanding">Partial Understanding</a>
    </h2>
    <p>
      A crucial first milestone for the system is partial
      understanding. Let's use the scenario of an invoice, like the
      <a href="Extensible.html#Scenario">scenario in the
      "Extensible languages" note</a>. An invoice refers to two
      schemata: one is a well-known invoice schema and the other a
      proprietory part number schema. The requirement is that an
      invoice processing program can process the invoice without
      needing to understand the part description.
    </p>
    <p>
      Somehow the program must find out that the invoice is from
      its point of view just as valid as an invoice with the
      details fo the part description stripped out.
    </p>
    <h3>
      <a name="Optional" id="Optional">Optional parts</a>
    </h3>
    <p>
      One possibility is to mark the part description as "optional"
      on the text. We could imagine a well-known way of doing this.
      It could be done in the document itself [as usual, using an
      arbitrary syntax:]
    </p>
    <pre>
&lt;item&gt;
&lt;partnumber&gt;8137498237&lt;/&gt;
&lt;optional&gt;
 &lt;xml:using href="http://aeroco.com/1998/03/sy4" as="A"&gt;<br />
   &lt;a:partdesc&gt;
        ...
   &lt;a:partdesc&gt;
 &lt;/xml:using&gt;<br />
&lt;/opional&gt;
&lt;/item&gt;
</pre>
    <p>
      There are problems with this. One is that we are relying on
      the invoice schema to define what in invoice is and isn't and
      what it means. It would be nice if the designer of the
      invoice could say whether the item should contain a part
      description of not, or whether it is possible to add things
      into the item description or not. But in general if there is
      something to be said we like to allow it to be said anywhere
      (like metadata). But for the optionalness to be expressed
      elsewhere would save the writer of every invoice the bother
      of having to explicitly.
    </p>
    <h3>
      <a name="Partial" id="Partial">Partial Understanding</a>
    </h3>
    <p>
      The other more fundamental problem is that the notion of
      "optional" is subjective. We can be more precise about
      "partial understanding" by saying that the invoice processing
      system needs to convert the document which contains things it
      doesn't understand into a document which it does completely
      understand: a valid invoice. However, another agent may which
      to convert the same detailed invoice into, say, a delivery
      note: in this case, quite different information would be
      "optional".
    </p>
    <p>
      To be more specific, then, we need to be able to describe a
      transformation from one document to another which preserves
      "valididy" in some sense. A simple form of transformation is
      the removal of sections, but obviously there can be all kinds
      of level of transformation language ranging from the cudest
      to theturing complete. Whatever the language, statement that
      given a document x, that some f(x) can be deduced.
    </p>
    <h3>
      <a name="Least" id="Least">Principle of Least Power</a>
    </h3>
    <p>
      In practice, this suggest that one should leave the actual
      choice of the transformation language as a flexibility point.
      However, as with most choices of computer language, the
      general "principle least power" applies:
    </p>
    <table border="1" cellpadding="2">
      <tbody>
        <tr>
          <td>
            When expressing something, use the least powerful
            language you can.
          </td>
        </tr>
      </tbody>
    </table>
    <p>
      <i>(@@justify in greater depth in footnote)</i>
    </p>
    <p>
      While being able to express a very complex function may feel
      good, the result will in general be less useful. As Lao-Tse
      puts it, "<a href="Evolution.html#within">Usefulness from
      what is not there</a>". From the point of view of translation
      algorithms, one usefulness is for them to be reversible. In
      the case in which you are trying to prove something (such as
      access to a web site or financial credibility) you need to be
      able to derive a document of a given form. The rules you use
      are the pieces of the web of trust and you are looking for a
      path through the web of trust. Clearly, one approach is to
      enumerate all the things which can be deduced from a given
      document, but it is faster to have an idea of which
      algorithms to apply. Simple ones have input and output
      patterns. A deletion rule is a very simple case
    </p>
    <p align="center">
      s/.*foo.*/\1\2/
    </p>
    <p>
      This is stream editor languge for "Remove "foo" from any
      string leaving what was on either side". If this rule is
      allowed, it means that "foo"is optional. @@@ to be continued
    </p>
    <p>
      Optional features and Partial Understanding
    </p>
    <ul>
      <li>Goal: V1 software partially understands V2 document
      </li>
      <li>Optional features visible as such
      </li>
      <li>Example: "Mandatory" Internet Draft
      </li>
      <li>Example: SMIL (P.Rec. 1998/4/9)
      </li>
      <li>Conversion from unknown language to known language.
      </li>
    </ul>
    <h1>
      <a name="ToII" id="ToII">Test of Independent Invention</a>
    </h1>
    <p>
      The test of independent invention is a thought experiment
      which tests one aspect of the quality of a design. When you
      design something, you make a number of important
      architectural decisions, such as how many wheels a car has,
      and that an arch will be used between the pillas of the
      vault. You make other arbitrary decisions such as the color
      of the car, the side of the road everyone will drive, whether
      to open the egg at the big end or the little end.
    </p>
    <p>
      Suppose it just happens that another group is designing the
      same sort of thing, tackling the same problem, somewhere
      else. They are quite unknown to you and you to them, but just
      suppose that being just as smart as you, they make all the
      same important archietctural decisions. This you can expect
      if you believe hat these decisions make logical sense.
      Imagine that they have the same philosophy: it is largely the
      philosophy which we are testing. However, imagine that they
      make all the arbitrary decisions differently. They complement
      bit 7. They drive on the other other side of the road. They
      use red buoys on the starbord side, and use 575 lines per
      screen on their televisions.
    </p>
    <p>
      Now imagine that the two systems both work (locally), and
      being usccessful, grow and grow. After a while, they meet.
      Suddenly you discover each other. Suddenly, people want to
      work across both systems. They want to connect two road
      systems, two telephone systems, two networks, two webs. What
      happens?
    </p>
    <p>
      I tried originally to make WWW pass the test. Suppose someone
      had (and it was quite likely) invented a World Wide Web
      system somewhere else with the same principles. Suppose they
      called it the Multi Media Mesh <sup>(tm)</sup> and based it
      on Media Resource Identifiers<sup>(tm)</sup>, the MultiMedia
      Transport Protocol<sup>(tm)</sup>, and a Multi Media Markup
      Language<sup>(tm)</sup>. After a few years, the Web and the
      Mesh meet. What is the damage?
    </p>
    <ul>
      <li>A huge battle, involving the abandonment of projects,
      conversion or loss of data?
      </li>
      <li>Division of the world by a border commission into two
      separate communities?
      </li>
      <li>Smooth integration with only incremental effort?
      </li>
    </ul>
    <p>
      (see also <a href="../People/Berners-Lee/UU.html">WWW and
      Unitarian Universalism</a>)
    </p>
    <p>
      Obviously we are looking for the latter option. Fortunately,
      we could immediately extend URIs to include "mmtp://" and
      extend MRIs to include "http:\\". We could make gateways, and
      on the better browsers immediately configure them to go
      through a gateway when finding a URI of the new type. The URI
      space is universal: it covers all addresses of all accessible
      objects. But it does not have to be the only universal space.
      Universal, but not unique. We could add MMML as a MIME type.
      And so on. However, if we required all Web servers to
      synchronise though one and only one master lock server in
      Waltdorf, we would have found the Mesh required
      synchronisation though a master server in Melbourne. It would
      have failed.
    </p>
    <p>
      No system completely passes the ToII - it is always some
      trouble to convert.
    </p>
    <h3>
      <a name="real" id="real">Not just a thought experiment</a>
    </h3>
    <p>
      As the Web becomes the basis for many many applications to be
      build on top of it, the phenomenon of independent invention
      will recur again and again. We have to build technology so as
      to make it easy for systems to pass the test, and so survive
      real life in an evolving world.
    </p>
    <p>
      If systems cannot pass the TOII, then we can only achieve
      worldwide interoperability when one original design has
      originally beaten the others. This can happen if we all sit
      down together as a worldwide committee and do a "top
      down"design of the whole thing before we start. This works
      for a new idea but not for the automation of something which,
      like pharmacy or trade, has been going on for centuries and
      is just being represented in the Semantic Web. For example,
      the library community has had endless trouble trying to agree
      on a single library card format (MARC record) worldwide.
    </p>
    <p>
      Another way it can happen is if one system is dropped
      completely, leading to a complete loss of the effport put
      into it. When in the late 1980s Europe eventually abandoned
      its suite of ISO protocols for networking because they just
      could not interwork with the Internet, a huge amount of work
      was lost. Many problems, solved in Europe but not in the US
      (including network addresses of more than 32 bits) had to be
      solved again on the Internet at great cost. Sweden actually
      changed from driving on the left to driving on the right. All
      over the world, people have changed word processor formats
      again and again but only at the cost of losing access to huge
      amounts of legacy information. The test of independent
      invention is not just a thought experiment, it is happening
      all the time.
    </p>
    <h1>
      <a name="requirements" id="requirements">From philosophy to
      requirement</a>
    </h1>
    <p>
      So now let us get more specific about what we really need in
      the underlying technology of the Semantic Web to allow
      systems in the future to pass the test of independent
      invention.
    </p>
    <h3>
      <a name="smarter" id="smarter">We will be smarter</a>
    </h3>
    <p>
      Our first assumption is that we will be smarter in the
      future. This means that we will produce better systems. We
      will want to move on from version 1 to version 2, from
      version n to version n+1.
    </p>
    <p>
      What happens now? A group of people use version 4 of a word
      process and share some documents. One touches a document
      using a new version 5 of the same program. Oen of the other
      people tries to load it using version 4 of the software. The
      version 4 program reads the file, and find it is a version5
      file. It declares that there is no way it can read the
      file,as it was produced in the future, and there is no way it
      can predict the future to know how to read a version 5 file.
      A flag day occurs: everyone in the group has to upgrade
      immediately - and often they never even planned to.
    </p>
    <p>
      So the first requirement is for a version 4 program to be
      able to read a version 5 file. Of course there will be some
      features in version 5 that the version 4 program will not be
      able to understand. But most of the time, we actually find
      that what we want to achieve can be done by partial
      understanding - understanding those parts of the document
      which correspond to functions which exist in version 4. But
      even though we know partial understanding would be
      acceptable, with most systems we don't know how to do even
      that.
    </p>
    <h3>
      <a name="others" id="others">We are not the smartest</a>
    </h3>
    <p>
      The philosophical assumption that we may not be smarter than
      everyone else (a huge step for some!) leads us to realise
      that others will have gret ideas too, and will independently
      invent the same things. It forces us to consider the test of
      independent invention.
    </p>
    <p>
      The requirement for the system to pass the ToII is for one
      program which we write to be able to read somehow (partially
      if not totally) data written by the program written by the
      other folks. This simple operation is the key to
      decentralised evolution of our technology, and to the whole
      future of the Web.
    </p>
    <p>
      So we have deduced two requirements for the system from our
      simple philosophical assumptions:
    </p>
    <ul>
      <li>We will be smarter in the future
        <ul>
          <li>Technology: Moving Version 1 to Version 2
          </li>
        </ul>
      </li>
      <li>We are not smarter than everyone else
        <ul>
          <li>Decentralized evolution
          </li>
          <li>Technology: Moving between parallel Version A and
          Version B
          </li>
        </ul>
      </li>
    </ul>
    <h3>
      <a name="sofar" id="sofar">The story so far</a>
    </h3>
    <p>
      We are we with the requirements for evolvability so far? We
      are looking for a tecnology which has free but well defined
      extension. We want to do it by allowing documents to use
      mixed vocabularies. We have already found out (from PICS work
      for example) that we need to be abl eto know whether
      extension vocabulary is mandatory or can be ignored. We want
      to use the Web for any registry, rather than any central
      point. The technology has to be allow an application to be
      able to convert the output of a future version of itself, or
      the output of an equivalent program written indpendently,
      into something it can process, just by looking up schema
      information.
    </p>
    <h2>
      <a name="data" id="data">Evolution of data</a>
    </h2>
    <p>
      Now let us look at the world of data on the Web, the <a href=
      "Semantic.html">Semantic Web</a>, which I expect we expect to
      become a new force in the next few years. By "data" as
      opposed to "documents", I am talking about information on the
      Web in a form specifically to aid automated processing rather
      than human browsing. "Data" is characterised by infomation
      with a well defined strcuture, where the atomic parts have
      wel ldefined types, such as numbers and choices from finite
      sets. "Data", as in a relational database, normally has well
      defined meaning which has rarely been written down. When
      someone creates a new databse, they have to give the data
      type of each column, but don't have to explain what the field
      name actually means in any way. So there is a well defined
      semantics but not one which can be accessed. In fact, the
      only time you tells the machine anything about the semantics
      is when you define which two columns of different tables are
      equivalent in some way, so that they can be used for example
      as the basis for joining the two databases. (That the meaning
      of data is only defined relative to the meaning of other data
      is of course quite normal - we don't expect machines to have
      any built in understanding of what "zip code" might mean
      apart from where you can read it and write it and what you
      can compare it with). Notice that what happens with real
      databases is that they are defined by users one day, and they
      evolve. They are rarely the result of a committee sitting
      down and deciding on a set of concepts to use across a
      company or an industry, and then designing the data schema.
      The schema is craeted on the fly by the user.
    </p>
    <p>
      We can distinguish two ways in which tha word "schema" has
      been used:
    </p>
    <table border="1" cellpadding="2">
      <tbody>
        <tr>
          <td>
            Syntactic Schema: A document, real or imagined, which
            constrains the structure and/or type of data. <i>(pl.:
            Schemata)</i>.
          </td>
        </tr>
      </tbody>
    </table>
    <table border="1" cellpadding="2">
      <tbody>
        <tr>
          <td>
            Semantic schema: A document, real or imagined, which
            defines the infereneces from one schema to another,
            thus defining the semantics of one syntactic schema in
            terms of another.
          </td>
        </tr>
      </tbody>
    </table>
    <p>
      I will use it for the first only. In fact, a syntactic schema
      dedfines a class of document, and often is accompanied by
      human documentation which provides some rough semantics.
    </p>
    <p>
      There is a huge amount ("legacy" would unfairly suggest
      obsolescence) of data in relational databases. A certain
      amount of it is being exported onto the web as virtual
      hypertext. There are many applications which allow one to
      make hypertext views of difeferent aspects of a database, so
      that each server request is met by performing adatabse query,
      and then formatting the result as a report in HTML, with
      appropriate style and decoration.
    </p>
    <h2>
      Data about data: Metadata
    </h2>
    <p>
      Information about information is interesting in two ways.
      Firstly, it is interesting because the Web society
      desperately needs it to be able to manage social aspects of
      information such as endorsement (PICS labels, etc), ownership
      and access rights to information, privacy policies (P3P,
      etc), structuring and cataloguing information and a hundred
      otehr uses which I will not try to ennumerate. This first
      aspect is discussed elsewhere. (See <a href=
      "http://www.w3.org/DesignIssues/Metadata.html">Metadata
      architecture</a> about general treatment of metadata and
      labels, and the <a href="../TandS/Overview.html">Technology
      and Society domain</a> for overveiw of many of the social
      drivers and related projects and technology)
    </p>
    <p>
      The second interest in metadata is that it is data. If we are
      looking for a language for putting data onto the Web, in a
      machine understandable way, then metadata happens to be a
      first application area. Also, because metadat ais fundamental
      to most data on eth web, it is the focus of W3C effort, while
      many other forms of data are regarded as applications rather
      than core Web archietcure, and so are not.
    </p>
    <h3>
      Publishing data on the web
    </h3>
    <p>
      Suppose for example that you run a server which provides
      online stock prices. Your application which today provides
      fancy web pages with a company's data in text and graphs (as
      GIFs) could tomorrow produce the same page as XML data, in
      tabular form, for machine access. The same page could even be
      produced at the same URL in two formats using content
      negotiation, or you could have a typed link between the
      machine-understandable and person-understandable versions.
    </p>
    <p>
      The XML version contains at the top (or soemewhere) a pointer
      to a schema document. This poiner makes the document
      "self-describing". It is this pointer which is the key to any
      machine "understanding" of the page. By making the schema a
      first class object, in other words by giving its URL and
      nothing else, we are leaving the dooropen to many
      possibilities. Now it is time to look at the various sorts of
      schema document which it could point to.
    </p>
    <h2>
      Levels of schema language
    </h2>
    <p>
      Computer languags can be classified into various types, with
      various capabilities, and the sort we chose for the schema
      document, and information we allow the schema fundamentally
      affects not just what the semantic web can be but, more
      importantly, how it can grow.
    </p>
    <p>
      The schema document can, broadly, be one of the following:
    </p>
    <ol>
      <li>Notional only: imaginary, non-existent but named.
      </li>
      <li>Human readable
      </li>
      <li>Machine-understandable and defining structure
      </li>
      <li>Machine-understandable and slo which are optional parts
      </li>
      <li>A Turing-complete recipe for conversion into othr
      langauges
      </li>
      <li>A logical model of document
      </li>
    </ol>
    <p>
      We'll go over the pros and cons of each, because none of
      these should be overlooked, but some are often way better
      than others.
    </p>
    <h3>
      Schema 1: URI only
    </h3>
    <ul>
      <li>No supporting documentation
      </li>
      <li>Allows compatibility yes/no test
      </li>
    </ul>
    <p>
      This may sound like a silly trivial example, but like many
      trival examples, it is not silly. If you just name your
      schema somewhere in URI space, then you have identified it.
      This deosn't offer a lot of help to anyone to find any
      documentation online, but one fundamental function is
      possible. Anyone can check compatability: They can compare
      the schema against a list of schemata they do understand, and
      return yes or no.
    </p>
    <p>
      In fact, they can also se an idnex to look up information
      about the schema, including ifnromation about suitable
      software to download to add understanding of the document. In
      fact this level is the level which many RPC systems use: the
      interface is given a unique but otherwise random number which
      cannot be dereferenced directly.
    </p>
    <p>
      So this is the level of machine-understanding typical of
      distributed ocmputing systems and should not be
      underestimated. There are lot sof parts of URI space you can
      use for this: yo might own some http: space (but never
      actually serve the document at that point) , but if you
      don't, you can always generate a URI in a mid: ro cid: space
      or if desperate in one of the hash spaces.
    </p>
    <h3>
      Schema option 2: Human readable
    </h3>
    <p>
      The next step up from just using the Schema identifier as a
      document tyope identifier is to make that URI one which will
      dereference to a human-readable document. If you're a
      computer, big deal. But as well as allowing a strict
      compatiability test (test for equality of the schema URI),
      this also allows human beings to get involed if ther is any
      argument as to what a document means. This can be signifiant!
      For example, the schema could point to a complete technical
      spec which is crammed with legalese about what the document
      does and does not imply and commit to. At the end of the day,
      all machine-understandable descriptions of documents are all
      very well, but until the day that they bootstrap themselves
      into legality, they must all in the end be defined in terms
      of human-readable legalese to have social effect. Human
      legalese is the schema language of our society. This is level
      2.
    </p>
    <h3>
      Schema option 3: Define structure
    </h3>
    <p>
      Now we move into the meat of the schema system when we start
      to discuss schema documents which are machine readable. now
      we are satrting to enable some machine understanding and
      automatic processing of document types which have not been
      pre-programmed by people. &Ccedil;a commence.
    </p>
    <p>
      The next level we conside is that when your brower (agent,
      whatever) dereferences the namespace URI, it find a schema
      which defines the structure of the document. this is a bit
      like an SGML Doctument type Definition (DTD). It allows you
      to do everything which the levels 1 and 2 allowed, if it has
      sufficient comments in it to allow human arguments to be
      settled.
    </p>
    <p>
      In addition, a system which has a way of defineing structure
      allows everyone to have one and only one parser to handle all
      manner of documents. Any document coming across the threshold
      can be parse into a tree.
    </p>
    <p>
      More than that, it allows a document o be validated against
      allowed strctures. If a memeo contains two subject fields, it
      is not valid. Tjis is one fo the principal uses of DTDs in
      SGML.
    </p>
    <p>
      In some cases, there maybe another spin-off. You canimagine
      that if the schema document lists the allwoed structrue of
      the document, and the types (and maybe names) of each
      element, then this would allow an agent to construct on the
      fly a graphic user interafce for editing such a document.
      This was theintent with PICS rating systems: at least, a
      parent coming across a new rating system would be be given a
      ahuman-readable descriptoin of the various parameters and
      would be able to select
    </p>
    <h3>
      Schema option 4: Structure + Optional flags
    </h3>
    <p>
      The "optional" flag is a term I use here for a common crucial
      step which can make the difference between chaos and smooth
      evolution. All you need to do is to mark in the schema of a
      new version of the language which elements of the langauge
      can be ignored if you don't understand them. This simple step
      allows a processor which handled the old language, giventhe
      schema of the new langauge, to filter it so as to produce a
      document it can legitimately understand.
    </p>
    <p>
      Now we have a technology which ahs all the benefits to date,
      plus it can handle that elusive <strong>version 2 to version
      1 conversion</strong> problem!
    </p>
    <h3>
      Schema option 5: Turning complete language
    </h3>
    <p>
      Always in langauges there is the balance between the
      declarative limited langauge, whose foprmulae can be easily
      manipulated, and the powerful programming language whose
      programs cannot be analyzed in general, but which have to be
      left to run to see what they do. Each end of the spectrum has
      its benefits. In describing a lanuage in terms of another,
      one way is to provide a black box program, say in Java or
      Javascript, which will convert from one to the other.
    </p>
    <p>
      Filters written in turing-complete languages generally have
      to be trusted, as you can't see what rules they are based on
      by looking at them. But they can do weird and wonderful
      things. (They can also crash and loop forever of course!).
    </p>
    <p>
      A good language for conversion from one XML-based language to
      another is XSL. It lstarted off as a template-like system for
      building one document from another (and can be very simple)
      but is in fact Turning-complete.
    </p>
    <p>
      When you do publish a program to convert language A to
      language B, then anyone who trusts it has that capability. A
      disadvantage is that they never know how it works. You can't
      deduce things about the individual components of the
      languages. You can't therefore infer much indirectly about
      relationships to other languages. The only way such a filter
      can be used is to get whatever you have into language A and
      then put it though the filter. This might be useful. But it
      isn't as fascinating as the option of blowing language A
      open.
    </p>
    <h3>
      Schema option 6: Expose logic of document
    </h3>
    <p>
      What is fundamentally more exciting is to write down as
      explicitly as posible wahteth new language means. Sorry, let
      me take that back, in case you think that I am talking about
      some absulte meaning of meaning. If you know me, I am not.
      All I mean is that we write in a machine-processable logical
      way the equivalences and conversions which are possible in
      and out of language A from other languages. And other
      languages.
    </p>
    <p>
      A specific case of course, is when we document the
      relationship betwen version 2 and version 1. The schema
      document for version 2 could explain that all the terms are
      synonyms, except for some new terms which can be converted to
      nothing (ie are optional) and some which affect the meaning
      of the document completely and so if you don't understand
      them you are stuck.
    </p>
    <p>
      In a more general case, take a language like iCalendar in RDF
      (were it in RDF), which is for describing events as would be
      in a personal organizer. A schema for the language might
      declare equivalences betwen a calendar's concept of group
      MEMBER ship and an access control system's concept of group
      membership; it might declare the equivalence of eth concept
      of LOCATION to be the text description of a Geographical
      Information Systems standard's location, and it may declare
      an INDIVIDUAL to be a superset of the HR department's concept
      of employee. These bits of information of the stuff of the
      semantic web, as they allow inference to stretch across the
      gloabe and conclude things which we knew as whole but no one
      person knew. This is what RDF and the Semnatic Web logic
      built on top of it is all about.
    </p>
    <hr />
    <p>
      So, what will semantic web engine be able to do? They will
      not all have the same inference abilities or algorithms. They
      will share a core concept of an RDF statement - an assertion
      that a given <em>resource</em> has a <em>property</em> with a
      given <em>value</em>. They will use this as a common way of
      exchanging data even when their inference rules are not
      compatible. An agent will be able to read a document in a new
      version of a language, by looking up on the web the
      relationship with the old version that it can natively read.
      It will be able to combine many documents into a single graph
      of knowledge, and draw deductions from the combination. And
      even though it might not be able to find a proof of a given
      hypothesis, when faced with an elaborated proof it will be
      able to check its veracity.
    </p>
    <p>
      At this stage (1998) we need relational database experts in
      the XML and RDF groups, [2000 -- include ontology and
      conceptual graph and knowledge representation experts].
    </p>
    <h2 id="maps">
      Evolvability in the real world
    </h2>
    <p>
      Examples abound of language mixing and evolution in the real
      world which make the need for these capabilities clear. There
      is a great and unused overlap in the concepts used by, for
      example, personal information managers, email systems, and so
      on. These capabilities would allow information to flow
      between these applications.
    </p>
    <p>
      You just have to look at the history of a standard such as
      MARC record for library information to see that the tension
      between agreeing on a standard (difficult and only possible
      for a common subset) and allowing variations (quick by not
      interoperable) would be eased by allowing language mixing. A
      card could be written out in a mixture of standard and local
      terms.
    </p>
    <p>
      The real world is full of times when conventions have been
      developed separately and the relationships have been deduced
      afterward: hence the market for third party converters of
      disk formats, scheduler files, and so on.
    </p>
    <h1>
      <a name="Engines" id="Engines">Engines of the future</a>
    </h1>
    <p>
      I have left open the discussion as to what inference power
      and algorithms will be useful on the semantic web precisely
      because it will always be an open question. When a language
      is sufficiently expressive to be able to express teh state of
      the real world and real problems then there will be no one
      query engine which will be able to solve real problems.
    </p>
    <p>
      We can, however, guess at how systems might evolve. No one at
      the beginning of the Web foresaw the search engines which
      could index almost all the web, so these guesses may be very
      inaccurate!
    </p>
    <p>
      We note that logical systems provide provably good answers,
      but don't scale to large problems. We see that search
      engines, remarkably, do scale - but at the moment produce
      very unreliable answers. Now, on a semantic web we can
      imagine a combination of the two. For example, a search
      engine could retrieves all the documents which reference the
      terms used in the query, and then a logical system act on
      that closed finite world of information to determine a
      reliable solution if one exists.
    </p>
    <p>
      In fact I thing we will see a huge market for interesting new
      algorithms, each to take advantage of particular
      characteristics of particular parts of the Web. New
      algorithms around electronic commerce may have directly
      beneficial business models, to there will be incentive for
      their development.
    </p>
    <p>
      Imagine some questions we might want to ask an engine of the
      future:
    </p>
    <ul>
      <li>Can Joe access the party photos?
      </li>
      <li>Who are all the people who can?
      </li>
      <li>Is there a green car for sale for around $15000 in
      Queensland?
      </li>
      <li>Did someone driving a blue car send us an invoice for
      over $10000?
      </li>
      <li>What was the average temperature in 1997 in Brisbane?
      </li>
      <li>Please fill in my tax form!
      </li>
    </ul>
    <p>
      All these involve bridging barriers between domains of
      knowledge, but they do not involve very complex logic --
      except for the tax form, that is. And who knows, perhaps in
      the future the tax code will have to be presented as a
      formula on the semantic web, just as it is expected now that
      one make such a public human-readable document available on
      the Web.
    </p>
    <h2 id="Conclusion">
      Conclusion
    </h2>
    <p>
      There are some requirements on the Semantic Web design which
      must be upheld if the technology is to be able to evolve
      smoothly. They involve both the introduction of new versions
      of one language, and also the merging of two originally
      independent languages. XML Namespaces and RDF are designed to
      meet these requirements, but a lot more thought and careful
      design will be needed before the system is complete.
    </p>
    <hr />
    <blockquote>
      <h4>
        <a name="within" id="within">The Space Within</a>
      </h4>
      <p>
        Thirty spokes share the wheel's hub;<br />
        It is the center hole that makes it useful.<br />
        Shape clay into a vessel;<br />
        It is the space within that makes it useful.<br />
        Cut doors and windows for a room;<br />
        It is the holes that make it useful.<br />
        Therefore profit comes from what is there;<br />
        Usefulness from what is not there.
      </p>
    </blockquote>
    <address>
      Lao-Tse
    </address>
    <p>
      (UU-STLT#600)
    </p>
    <p>
      ...
    </p>
    <p>
      Imagine that the EU and the US independently define RDF
      schemata for an invoice. Invoice are traded around Europe
      with a schema pointer at the top which identifies the smema.
      Indeed, the schema may be found on the web.
    </p>
    <hr />
    <hr />
    <p>
      <a href="Metadata.html">Next: &nbsp;Metadata architecture</a>
    </p>
    <p>
      <a href="Overview.html">Up to Design Issues</a>
    </p>
    <p>
      <a href="../People/Berners-Lee">Tim BL</a>
    </p>
  </body>
</html>