Semantic 26.7 KB

Raw Blame History Permalink

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="generator" content=
    "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
    <title>
      Semantic Web roadmap
    </title>
    <meta http-equiv="Content-Type" content=
    "text/html; charset=us-ascii" />
    <link />
    <link href="di.css" rel="stylesheet" type="text/css" />
  </head>
  <body bgcolor="#DDFFDD" text="#000000" lang="en" xml:lang="en">
    <address>
      Tim Berners-Lee
      <p>
        <small>Date: September 1998. Last modified: $Date:
        1998/10/14 20:17:13 $</small>
      </p>
      <p>
        Status: An attempt to give a high-level plan of the
        architecture of the Semantic WWW. Editing status: Draft.
        Comments welcome
      </p>
    </address>
    <p>
      <a href="Overview.html">Up to Design Issues</a>
    </p>
    <hr />
    <h1>
      Semantic Web Road map
    </h1>
    <p>
      <i>A road map for the future, an architectural plan untested
      by anything except thought experiments.</i>
    </p>
    <p>
      This was written as part of a requested road map for future
      Web design, from a level of 20,000ft. It was spun off from an
      Architectural overview for an area which required more
      elaboration than that overview could afford.
    </p>
    <p>
      Necessarily, from 20,000 feet, large things seem to get a
      small mention. It is architecture, then, in the sense of how
      things hopefully will fit together. So we should recognize
      that while it might be slowly changing, this is also a living
      document.
    </p>
    <p>
      This document is a plan for achieving a set of connected
      applications for data on the Web in such a way as to form a
      consistent logical web of data (semantic web).
    </p>
    <h3>
      <a name="Introduction" id="Introduction">Introduction</a>
    </h3>
    <p>
      The Web was designed as an information space, with the goal
      that it should be useful not only for human-human
      communication, but also that machines would be able to
      participate and help. One of the major obstacles to this has
      been the fact that most information on the Web is designed
      for human consumption, and even if it was derived from a
      database with well defined meanings (in at least some terms)
      for its columns, that the structure of the data is not
      evident to a robot browsing the web. Leaving aside the
      artificial intelligence problem of training machines to
      behave like people, the Semantic Web approach instead
      develops languages for expressing information in a machine
      processable form.
    </p>
    <p>
      This document gives a road map - a sequence for the
      incremental introduction of technology to take us, step by
      step, from the Web of today to a Web in which machine
      reasoning will be ubiquitous and devastatingly powerful.
    </p>
    <p>
      It follows the note on the <a href=
      "Architecture.html">architecture</a> of the Web, which
      defines existing design decisions and principles for what has
      been accomplished to date.
    </p>
    <h2>
      <a name="SemanticWeb" id="SemanticWeb">Machine-Understandable
      information: Semantic Web</a>
    </h2>
    <p>
      The Semantic Web is a web of data, in some ways like a global
      database. The rationale for creating such an infrastructure
      is given elsewhere [Web future talks &amp;c] here I only
      outline the architecture as I see it.
    </p>
    <h2>
      <a name="Assertion" id="Assertion">The basic assertion
      model</a>
    </h2>
    <p>
      When looking at a possible formulation of a universal Web of
      semantic assertions, the principle of minimalist design
      requires that it be based on a common model of great
      generality. Only when the common model is general can any
      prospective application be mapped onto the model. The general
      model is the Resource Description Framework.
    </p>
    <p>
      <i>See the</i> <a href="../TR/WD-rdf-syntax/"><i>RDF Model
      and Syntax Specification</i></a>
    </p>
    <p>
      Being general, this is very simple. Being simple there is
      nothing much you can do with the model itself without
      layering many things on top. The basic model contains just
      the concept of an <b>assertion</b>, and the concept of
      <b>quotation</b> - making assertions about assertions. This
      is introduced because (a) it will be needed later anyway and
      (b) most of the initial RDF applications are for data about
      data ("metadata") in which assertions about assertions are
      basic, even before logic. (Because for the target
      applications of RDF, assertions are part of a description of
      some resource, that resource is often an implicit parameter
      and the assertion is known as a <b>property</b> of a
      resource).
    </p>
    <p>
      As far as mathematics goes, the language at this point has no
      negation or implication, and is therefore very limited. Given
      a set of facts, it is easy to say whether a proof exists or
      not for any given question, because neither the facts nor the
      questions can have enough power to make the problem
      intractable.
    </p>
    <p>
      Applications at this level are very numerous. Most of the
      <a href="Architecture.html#Metadata">applications for the
      representation of metadata</a> can be handled by RDF at this
      level. Examples include card index information (the Dublin
      Core), Privacy information (P3P), associations of style
      sheets with documents, intellectual property rights labeling
      and PICS labels. We are talking about the representation of
      data here, which is typically simple: not languages for
      expressing queries or inference rules.
    </p>
    <p>
      RDF documents at this level do not have great power, and
      sometimes it is less than evident why one should bother to
      map an application in RDF. The answer is that we expect this
      data, while limited and simple within an application, to be
      combined, later, with data from other applications into a
      Web. Applications which run over the whole web must be able
      to use a common framework for combining information from all
      these applications. For example, access control logic may use
      a combination of privacy and group membership and data type
      information to actually allow or deny access. Queries may
      later allow powerful logical expressions referring to data
      from domains in which, individually, the data representation
      language is not very expressive. The purpose of this document
      is partly to show the plan by which this might happen.
    </p>
    <h2>
      <a name="Schema" id="Schema">The Schema layer</a>
    </h2>
    <p>
      The basic model of the RDF allows us to do a lot on the
      blackboard, but does not give us many tools. It gives us a
      model of assertions and quotations on which we can map the
      data in any new format.
    </p>
    <p>
      We next need a schema layer to declare the existence of new
      property. We need at the same time to say a little more about
      it. We want to be able to constrain the way it used.
      Typically we want to constrain the types of object it can
      apply to. These meta-assertions make it possible to do
      rudimentary checks on a document. Much as in SGML the "DTD"
      allows one to check whether elements have been used in
      appropriate positions, so in RDF a schema will allow us to
      check that, for example, a driver's license has the name of a
      person, and not a model of car, as its "name".
    </p>
    <p>
      It is not clear to me exactly what primitives have to be
      introduced, and whether much useful language can be defined
      at this level without also defining the next level. There is
      currently a <a href="http://www.w3.org/RDF/Group/Schema/">RDF
      Schema working group</a> in this area. The schema language
      typically makes simple assertions about permitted
      combinations. If the SGML DTD is used as a model, the schema
      can be in a language of very limited power. The constraints
      expressed in the schema language are easily expanded into a
      more powerful logical layer expressions (the next layer), but
      one chose at this point, in order to limit the power, not to
      do that. For example: one can say in a schema that a property
      foo is unique. Expanded, that is that for any x, if y is the
      foo of x, and z is the foo of x, then y equals z. This uses
      logical expressions which are not available at this level,
      but that is OK so long as the schema language is, for the
      moment, going to be handled by specialized schema engines
      only, not by a general reasoning engine.
    </p>
    <p>
      When we do this sort of thing with a language - and I think
      it will be very common - we must be careful that the language
      is still well defined logically. Later on, we may want to
      make inferences which can only be made by understanding the
      semantics of the schema language in logical terms, and
      combining it with other logical information.
    </p>
    <h2>
      <a name="Conversion" id="Conversion">Conversion language</a>
    </h2>
    <p>
      A requirement of namespaces work for <a href=
      "Evolution.html">evolvability</a> is that one must, with
      knowledge of common RDF at some level, be able to follow
      rules for converting a document in one RDF schema into
      another one (which presumably one has an innate understanding
      of how to process).
    </p>
    <p>
      By the principle of least power, this language can in fact be
      made to have implication (inference rules) without having
      negation. (This might seem a fine point to make, when in fact
      one can easily write a rule which defines inference from a
      statement A of another statement B which actually happens to
      be false, even though the language has no way of actually
      stating "False". However, still formally the language does
      not have the power needed to write a paradox, which comforts
      some people. In the following, though, as the language gets
      more expressive, we rely not on an inherent ability to make
      paradoxical statements, but on applications specifically
      limiting the expressive power of particular documents.
      Schemas provide a convenient place to describe those
      restrictions.)
    </p>
    <p>
      <img src="diagrams/zipcode.png" alt=
      "Links between the table for Emp" align="left" />A simple
      example of the application of this layer is when two
      databases, constructed independently and then put on the web,
      are linked by semantic links which allow queries on one to
      converted into queries on another. Here, someone noticed that
      "where" in the <em>friends</em> table and "zip" in a
      <em>places</em> table mean the same thing. Someone else
      documented that "zip" in the <em>places</em> table meant the
      same things as "zip" in the <em>employees</em> table, and so
      on as shown by arrows. Given this information, a search for
      any employee called Fred with zip 02139 can be widened from
      <em>employees</em> to in include <em>friends</em>. All that
      is needed some RDF "equivalent" property.
    </p>
    <h2>
      <a name="Logical" id="Logical">The logical layer</a>
    </h2>
    <p>
      The next layer, then is the logical layer. We need ways of
      writing logic into documents to allow such things as, for
      example, rules the deduction of one type of document from a
      document of another type; the checking of a document against
      a set of rules of self-consistency; and the resolution of a
      query by conversion from terms unknown into terms known.
      Given that we have quotation in the language already, the
      next layer is predicate logic (not, and, etc) and the next
      layer quantification (for all x, y(x)).
    </p>
    <p>
      The applications of RDF at this level are basically limited
      only by the imagination. A simple example of the application
      of this layer is when two databases, constructed
      independently and then put on the web, are linked by semantic
      links which allow queries on one to converted into queries on
      another. Many things which may have seemed to have needed a
      new language become suddenly simply a question of writing
      down the right RDF. Once you have a language which has the
      great power of predicate calculus with quotation, then when
      defining a new language for a specific application, two
      things are required:
    </p>
    <ul>
      <li>One must settle on the (limited) power of the reasoning
      engine which the receiver must have, and define a subset of
      full RDF which will be expected to be understood;
      </li>
      <li>One will probably want to define some abbreviated
      functions to efficiently transmit expressions within the set
      of documents within the constrained language.
      </li>
    </ul>
    <p>
      <i>See also, if unconvinced:</i>
    </p>
    <ul>
      <li>
        <a href="RDFnot.html"><i>What the Semantic Web is
        not</i></a> - answering some FAQs
      </li>
    </ul>
    <p>
      The metro map below shows a key loop in the semantic web. The
      Web part, on the left, shows how a URI is, using HTTP, turned
      into a representation of a document as a string of bits with
      some MIME type. It is then parsed into XML and then into RDF,
      to produce an RDF graph or, at the logic level, a logical
      formula. On the right hand side, the Semantic part, shows how
      the RDF graph contains a reference to the URI. It is the
      trust from the key, combined with the meaning of the
      statements contained in the document, which may cause a
      Semantic Web engine to dereference another URI.
    </p>
    <p>
      <img src="diagrams/loop.gif" alt=
      "URI gets document which a parse" />
    </p>
    <h3>
      <a name="Validation" id="Validation">Proof Validation - a
      language for proof</a>
    </h3>
    <p>
      The RDF model does not say anything about the form of
      reasoning engine, and it is obviously an open question, as
      there is no definitively perfect algorithm for answering
      questions - or, basically, finding proofs. At this stage in
      the development of the Semantic Web, though, we do not tackle
      that problem. Most applications construction of a proof is
      done according to some fairly constrained rules, and all that
      the other party has to do is validate a general proof. This
      is trivial.
    </p>
    <p>
      For example, when someone is granted access to a web site,
      they can be given a document which explains to the web server
      why they should have access. The proof will be a chain [well,
      DAG] of assertions and reasoning rules with pointers to all
      the supporting material.
    </p>
    <p>
      The same will be true of transactions involving privacy, and
      most of electronic commerce. The documents sent across the
      net will be written in a complete language. However, they
      will be constrained so that, if queries, the results will be
      computable, and in most cases they will be proofs. The HTTP
      "GET" will contain a proof that the client has a right to the
      response. the response will be a proof that the response is
      in deed what was asked for.
    </p>
    <h3>
      <a name="Inference" id="Inference">Evolution rules
      Language</a>
    </h3>
    <p>
      RDF at the logical level already has the power to express
      inference rules. For example, you should be able to say such
      things as "If the zipcode of the organization of x is y then
      the work-zipcode of x is y". As noted above, just scattering
      the Web with such remarks will in the end be very
      interesting, but in the short term won't produce repeatable
      results unless we restrict the expressiveness of documents to
      solve particular application problems.
    </p>
    <p>
      Two fundamental functions we require RDF engines to be able
      to do are
    </p>
    <ol>
      <li>for a version <i>n</i> implementation to be able to read
      enough RDF schema to be able to deduce how to read a version
      <i>n+1</i> document;
      </li>
      <li>for a type A application developed quite independently of
      a type B application which has the same or similar function
      to be able to read and process enough schema information to
      be able to process data from the type B application.
      </li>
    </ol>
    <p>
      (See <a href="Evolution.html">evolvability article</a>)
    </p>
    <p>
      The RDF logic level is sufficient to be usable as a language
      for making inference rules. Note it does not address the
      heuristics of any particular reasoning engine, which which is
      an open field made all the more open and fruitful by the
      Semantic Web. In other words, RDF will allow you to write
      rules but won't tell anyone at this stage in which order to
      apply them.
    </p>
    <p>
      Where for example a library of congress schema talks of an
      "author", and a British Library talks of a "creator", a small
      bit of RDF would be able to say that for any person x and any
      resource y, if x is the (LoC) author of y, then x is the (BL)
      creator of y. This is the sort of rule which solves the
      evolvability problems. Where would a processor find it? In
      the case of a program which finds a version 2 document and
      wants to find the rules to convert it into a version 1
      document, then the version 2 schema would naturally contain
      or point to the rules. In the case of retrospective
      documentation of the relationship between two independently
      invented schemas, then of course pointers to the rules could
      be added to either schema, but if that is not (socially)
      practical, then we have another example of the the annotation
      problem. This can be solved by third party indexes which can
      be searched for connections between two schemata. In practice
      of course search engines provide this function very
      effectively - you would just have to ask a search engine for
      all references to one schema and check the results for rules
      which like the two.
    </p>
    <h3>
      <a name="Query" id="Query">Query languages</a>
    </h3>
    <p>
      One is a query language. A query can be thought of as an
      assertion about the result to be returned. Fundamentally, RDF
      at the logical level is sufficient to represent this in any
      case. However, in practice a query engine has specific
      algorithms and indexes available with which to work, and can
      therefore answer specific sorts of query.
    </p>
    <p>
      It may of course in practice to develop a vocabulary which
      helps in either of two ways:
    </p>
    <ol>
      <li>It allows common powerful query types to be expressed
      succinctly with fewer pages of mathematics, or
      </li>
      <li>It allows certain constrained queries to be expressed,
      which are interesting because they have certain computability
      properties.
      </li>
    </ol>
    <p>
      SQL is an example of a language which does both.
    </p>
    <p>
      It is clearly important that the query language be defined in
      terms of RDF logic. For example, to query a server for the
      author of a resource, one would ask for an assertion of the
      form "x is the author of p1" for some x. To ask for a
      definitive list of all authors, one would ask for a set of
      authors such that any author was in the set and everyone in
      the set was an author. And so on.
    </p>
    <p>
      In practice, the diversity of algorithms in search engines on
      the web, and proof-finding algorithms in pre-web logical
      systems suggests that there will in a semantic web be many
      forms of agent able to provide answers to different forms of
      query.
    </p>
    <p>
      One useful step the specification of specific query engines
      for for example searches to a finite level of depth in a
      specified subset of the Web (such as a web site). Of course
      there could be several alternatives for different occasions.
    </p>
    <p>
      Another metastep is the specification of a query engine
      description language -- basically a specification of the sort
      of query the engine can return in a general way. This would
      open the door to agents chaining together searches and
      inference across many intermediate engines.
    </p>
    <h2>
      <a name="Signature" id="Signature">Digital Signature</a>
    </h2>
    <p>
      Public key cryptography is a remarkable technology which
      completely changes what is possible. While one can add a
      digital signature block as decoration on an existing
      document, attempts to add the logic of trust as icing on the
      cake of a reasoning system have to date been restricted to
      systems limited in their generality. For reasoning to be able
      to take trust into account, the common logical model requires
      extension to include the keys with which assertions have been
      signed.
    </p>
    <p>
      Like all logic, the basis of this, may not seem appealing at
      first until one has seen what can be built on top. This basis
      is the introduction of keys as first class objects (where the
      URI can be the literal value of a public key), and a the
      introduction of general reasoning about assertions
      attributable to keys.
    </p>
    <p>
      In an implementation, this means that reasoning engine will
      have to be tied to the signature verification system .
      Documents will be parsed not just into trees of assertions,
      but into into trees of assertions about who has signed what
      assertions. Proof validation will, for inference rules, check
      the logic, but for assertions that a document has been
      signed, check the signature.
    </p>
    <p>
      The result will be a system which can express and reason
      about relationships across the whole range of public-key
      based security and trust systems.
    </p>
    <p>
      Digital signature becomes interesting when RDF is developed
      to the level that a proof language exists. However, it can be
      developed in parallel with RDF for the most part.
    </p>
    <p>
      In the W3C, input to the digital signature work comes from
      many directions, including experience with DSig1.0 signed
      "pics" labels, and various submissions for digitally signed
      documents.
    </p>
    <h3>
      <a name="Indexes" id="Indexes">Indexes of terms</a>
    </h3>
    <p>
      Given a worldwide semantic web of assertions, the search
      engine technology currently (1998) applied to HTML pages will
      presumably translate directly into indexes not of words, but
      of RDF objects. This itself will allow much more efficient
      searching of the Web as though it were one giant database,
      rather than one giant book.
    </p>
    <p>
      The Version A to Version B translation requirement has now
      been met, and so when two databases exist as for example
      large arrays of (probably virtual) RDF files, then even
      though the initial schemas may not have been the same, a
      retrospective documentation of their equivalence would allow
      a search engine to satisfy queries by searching across both
      databases.
    </p>
    <h2>
      <a name="Engines" id="Engines">Engines of the Future</a>
    </h2>
    <p>
      While search engines which index HTML pages find many answers
      to searches and cover a huge part of the Web, then return
      many inappropriate answers. There is no notion of
      "correctness" to such searches. By contrast, logical engines
      have typically been able to restrict their output to that
      which is provably correct answer, but have suffered from the
      inability to rummage through the mass of intertwined data to
      construct valid answers. The combinatorial explosion of
      possibilities to be traced has been quite intractable.
    </p>
    <p>
      However, the scale upon which search engines have been
      successful may force us to reexamine our assumptions here. If
      an engine of the future combines a reasoning engine with a
      search engine, it may be able to get the best of both worlds,
      and actually be able to construct proofs in a certain number
      of cases of very real impact. It will be able to reach out to
      indexes which contain very complete lists of all occurrences
      of a given term, and then use logic to weed out all but those
      which can be of use in solving the given problem.
    </p>
    <p>
      So while nothing will make the combinatorial explosion go
      away, many real life problems can be solved using just a few
      (say two) steps of inference out on the wild web, the rest of
      the reasoning being in a realm in which proofs are give, or
      there are constrains and well understood computable
      algorithms. I also expect a string commercial incentive to
      develop engines and algorithms which will efficiently tackle
      specific types of problem. This may involve making caches of
      intermediate results much analogous to the search engines'
      indexes of today.
    </p>
    <p>
      Though there will still not be a machine which can guarantee
      to answer arbitrary questions, the power to answer real
      questions which are the stuff of our daily lives and
      especially of commerce may be quite remarkable.
    </p>
    <hr />
    <p>
      In this series:
    </p>
    <ul>
      <li>
        <a href="RDFnot.html"><i>What the Semantic Web is
        not</i></a> - answering some FAQs of the unconvinced.
      </li>
      <li>
        <a href="Evolution.html">Evolvability</a>: properties of
        the language for evolution of the technology
      </li>
      <li>
        <a href="Architecture.html">Web Architecture from 50,000
        feet</a>
      </li>
    </ul>
    <h2>
      <a name="References" id="References">References</a>
    </h2>
    <p>
      <a href="http://www.cyc.com/tech.html#cycl" name="cyc" id=
      "cyc">The CYC Representation Language</a>
    </p>
    <p>
      <a href="http://logic.stanford.edu/kif/kif.html" name="kif"
      id="kif">Knowledge Interchange Format (KIF)</a>
    </p>
    <p>
      @@
    </p>
    <p>
      Acknowledgements
    </p>
    <p>
      This plan is based in discussions with the W3C team, and
      various W3C member companies. Thanks also to David Karger and
      Daniel Jackson of MIT/LCS.
    </p>
    <hr />
    <p>
      <a href="Overview.html">Up to Design Issues</a>
    </p>
  </body>
</html>