HTTP-URI.html 45.5 KB

Raw Blame History Permalink

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="generator" content=
    "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
    <title>
      What do HTTP URIs Identify? - Design Issues
    </title>
    <link rel="Stylesheet" href="di.css" type="text/css" />
    <meta http-equiv="Content-Type" content=
    "text/html; charset=us-ascii" />
  </head>
  <body bgcolor="#DDFFDD" text="#000000" lang="en" xml:lang="en">
    <address>
      Tim Berners-Lee<br />
      Date: 2002-07-27, last change: $Date: 2007/01/15 20:05:15
      $<br />
      Status: personal view only. Editing status: first draft. This
      was a result of my being in a minority with this opinion on
      the Technical Architecture Group, and yet finding it the only
      one I could accept. This is related to TAG issue
      HTTPRange-14.
    </address>
    <p>
      <a href="./">Up to Design Issues</a>
    </p>
    <p>
      <strong>Note: (2006). This architectural question has now
      been <a href=
      "http://lists.w3.org/Archives/Public/www-tag/2005Jun/0039.html">
      decided</a> by the W3C TAG, in a compromise which I think
      works quite well, and is described in a <a href=
      "HTTP-URI2">later short note</a> and a TAG finding.</strong>
    </p>
    <hr />
    <h1>
      What do HTTP URIs Identify?
    </h1>
    <h3>
      Background Note
    </h3>
    <p>
      This question has been addressed only vaguely in the
      specifications. However, the lack of very concise logical
      definition of such things had not been a problem, until the
      formal systems started to use them. There were no formal
      systems addressing this sort of issue (as far as I know,
      except for Dan Connolly's Larch work [@@]), until the
      <a href="/2001/sw">Semantic Web</a> introduced languages such
      as RDF which have well-defined logical properties and are
      used to describe (among other things) web operations.
    </p>
    <p>
      The efforts of the <a href="/2001/tag">Technical Architecture
      Group</a> to create an architecture document with common
      terms highlighted this problem. (It demonstrates the
      ambiguity of natural language that no significant problem had
      been noticed over the past decade, even though the original
      author or HTTP , and later co-author of HTTP 1.1 who also did
      his PhD thesis on an analysis of the web, and both of whom
      have worked with Web protocols ever since, had had
      conflicting ideas of what the various terms actually mean.)
    </p>
    <p>
      This document explains why the author find it difficult to
      work in the alternative proposed philosophies. If it
      misrepresents those others' arguments, then it fails, for
      which I apologize in advance and will endeavor to correct.
    </p>
    <h2>
      1. Web Concepts as here proposed
    </h2>
    <p>
      The WWW is a space of information objects. The URI was
      originally called a UDI, and originally all URIs identified
      information objects. Now, URI schemes exist which identify
      more or less anything (e.g. UUIDs) or electronic mailboxes
      (mailto:) but is we look purely at HTTP URIs, they define a
      web of information objects. Information objects -- perhaps in
      Cyc terms <a href="">ConceptualWorks</a> -- are normally
      things which
    </p>
    <ul>
      <li>Carry some sort of message, and
      </li>
      <li>Can be represented, to a greater or lesser authenticity,
      in bits
      </li>
    </ul>
    <p>
      I want to make it clear that such things are generic (See
      <a href="/DesignIssues/Generic">Generic Resources)</a> --
      while they are documents, they generally are abstractions
      which may have many different bit representations, as a
      function of, for example:
    </p>
    <ul>
      <li>Time -- the contents can vary with revision --
      </li>
      <li>Content-type in which the bits are encoded
      </li>
      <li>Natural language in which a human-readable document is
      written
      </li>
      <li>Machine language in which a machine-processable document
      is written
      </li>
      <li>and a few more
      </li>
    </ul>
    <p>
      but the philosophy is that an HTTP URI may identify something
      with a vagueness as to the dimensions above, but it still
      must be used to refer to a unique conceptual object whose
      various representations have a very large a mount in common.
      Formally, it is the publisher which defines the what an HTTP
      URI identifies, and so one should look to the publisher for a
      commitment as to the exact nature of the identity along these
      axes.
    </p>
    <p>
      I'm going to refer to this as a <strong>document</strong>,
      because it needs a term and that is the best I have to date,
      but the reader should be sure to realize that this does not
      mean a conventional office document, it can be for example
    </p>
    <ul>
      <li>A poem
      </li>
      <li>An order for ball bearings
      </li>
      <li>A painting
      </li>
      <li>A Movie
      </li>
      <li>A review of a movie
      </li>
      <li>A sound clip
      </li>
      <li>A record of the temperature of the furnace
      </li>
      <li>An array a million integers, all zero
      </li>
    </ul>
    <p>
      and so on, as limited only by our imagination.
    </p>
    <p>
      The Web works because, given an HTTP URI, one can in a large
      number of cases, get a representation of the document. For a
      human readable document, the person is presented with the
      information by virtue of some gadget which is given the bits
      of a representation. In the case of a hypertext document, a
      reference to another document is encoded such that, upon user
      request, the referenced document can in turn be automatically
      presented. In the case of a machine-readable document,
      identifiers of concepts, being HTTP URIs, will often allow
      definitive reference information about those concepts to be
      pulled in to guide further actions.
    </p>
    <p>
      The web, then, is made of documents as the internet is made
      of cables and routers. The documents can be about anything,
      so when we move to talk about the contents of documents we
      break away from talking about information space and the whole
      universe of human -- and machine -- discourse is open to us.
      Web pages can compare a renaissance choral works with jazz
      pop hits, and discuss whether pigs have wings.
      Machine-processable documents can encode information about
      shoes, and ships, and sealing-wax. Until recently, the
      Internet protocol standards out of which the Web is built had
      little to say about such things. They were concerned only
      with the human-readable side, so it was people, reading
      natural language (not internet specs) who formed and
      communicated the concepts at this level. Nowadays, however,
      semantic web languages allow information to be expressed not
      only about URIs, TCP ports and documents, but also about
      arbitrary concepts - the shoes, and ships and sealing wax,
      and whether pigs have wings. Simple semantic web application
      allow one to order shoes and travel on ships, and determine
      that, given the data, pigs do not have wings.
    </p>
    <p>
      For these purposes it is of course quite essential to
      distinguish between something described by a document and the
      document itself. Now that we -- for the first time -- have
      not only internet protocols which can talk about document but
      also those which talk about real world things, we must either
      distinguish or be hopelessly fuzzy.
    </p>
    <p>
      And is this bad, is it an inhibition to have to work our way
      though documents before we can talk about whatever we desire?
      I would argue not, because it is very important not to lose
      track of the reasons for our taking and processing any piece
      of information. The process of publishing and reading is a
      real social process between social entities, not mechanical
      agents. To be socially responsible, to be able to handle
      trust, and so on, we must be aware of these operations. The
      difference between a car and what some web page says about it
      is crucial - not only when you are buying a car.
    </p>
    <p>
      Some have opined that the abstraction of the document is
      nonsense, and all that exists, when a web page describes a
      car, is the car and various representations of it, the HTML,
      PNG and GIF bit streams. This is however very weak in my
      opinion. The various representations have much more in common
      than simply the car. And the relationship to the car can be
      many and varied: home page, picture, catalog entry, invoice,
      remote control panel, weblog, and so on. The document itself
      is an important part of society - to dismiss its existence is
      to prevent us being aware of human and aspects of information
      without which we are impoverished. By contrast, the
      difference between different representations of the document
      (GIF or PNG image for example) is very small, and the
      relationship between versions of a document which changes
      through time a very strong one.
    </p>
    <h2>
      2. Trying out the Alternatives
    </h2>
    <p>
      The folks who disagree with the model do so for a number of
      different arguments. This article, therefore will have to
      take them one by one but the ones which come to mind are as
      follows:
    </p>
    <ol>
      <li>
        <a href="#L728">Every web page (or many of therm) are in
        fact themselves representations of some abstract thing, and
        the URI really identifies that</a> thing, not a document at
        all.
      </li>
      <li>
        <a href="#L876">There are many levels of identification
        (representation as a set of bits, document, car which the
        web page is about) and the URI publisher, as owner of the
        URI, has the right to define it to mean whatever he or she
        likes;</a>
      </li>
      <li>
        <a href="#L883">Actually the URI has to, like in English,
        identify these different things ambiguously. Machines have
        to disambiguate using common sense and logic</a>
      </li>
      <li>
        <a href="#L890">Actually the URI has to, like in English,
        identify these different things ambiguously. Machines have
        to disambiguate using the fact that different properties
        will refer to different levels</a>.
      </li>
      <li>
        <a href="#L897">Actually the URI has to, like in English,
        identify these different things ambiguously. Machines have
        to disambiguate using extra information which will be
        provided in other ways along with the URI</a>
      </li>
      <li>
        <a href="#L909">Actually the URI has to, like in English,
        identify these different things ambiguously. Machines have
        to disambiguate them by context: A catalog card will talk
        about a document. A car catalog will talk about a car</a>.
      </li>
      <li>
        <a href="#L920">They may have been used to identify
        documents up till now, but for RDF and the Semantic Web, we
        should change that and start to use them as the Dublin Core
        and RDF Core groups have for abstract concepts</a>.
      </li>
    </ol>
    <h3 id="L728">
      2.1 Identify abstract things not documents
    </h3>
    <p>
      Let's take the alternatives in order. These alternatives all
      make sense. Each one, however, has problems I can't see any
      way around when we consider them as a basis as
    </p>
    <p>
      The first was,
    </p>
    <blockquote>
      <p>
        Every web page (or many of them) are in fact themselves
        representations of some abstract thing, and the URI really
        identifies that thing, not a document at all.
      </p>
    </blockquote>
    <p>
      Well, that wasn't the model I had when URIs were invented and
      HTTP was written. However, let's see how it flies. If we
      stick with the principle that a URI (or URIref) must
      unambiguously identify the same thing in any context, then we
      come to the conclusion that URIs can not identify the web
      page. If a web page is about a car, then the URI can't be
      used to refer to the web page.
    </p>
    <h4>
      2.1.1 <a name="s2.1.1" id="s2.1.1">Same URI can identify a
      web page and a car</a>
    </h4>
    <p>
      What, a web page can't be a car? At this point a pedantic
      line reasoning suggests that we should allow web pages and
      cars to conceptually overlap, so that something can be both.
      This is counterintuitive, as a web page is in common sense,
      not a concrete object whereas a car is. But sure, we could
      construct a mathematics in which we use the terms rather
      specially and something can be at the same time a web page
      and a car.
    </p>
    <p>
      Frankly, this doesn't serve the social purpose of the
      semantic web, to be able to deal with common sense concepts
      and objects. A web page about a car and a car are in most
      people's minds quite distinct (as I argue further below). A
      philosophy in which they are identical does not allow me to
      distinguish between them. not only conflicts with reality as
      I see it, but also leaves us no way to make statements
      individually about the two things.
    </p>
    <h4>
      <img alt=
      "A car has a different identifier -- and very different properties."
      src="diagrams/http-uri-1.png" />
    </h4>
    <h4>
      2.1.2 <a name="identifies" id="identifies">The URI identifies
      the car, not the web page</a>
    </h4>
    <p>
      So lets fall back on the idea that the URI identifies the
      <em>subject</em> of the web page, but not the web page
      itself. This makes sense. We can build the semantic web on
      top of that easily.
    </p>
    <p>
      The problem with this is that there are a large number of
      systems which already do use URIs to identify the document.
      This is the whole metadata world. Think of a few:
    </p>
    <ul>
      <li>The Dublin Core
      </li>
      <li>RSS
      </li>
      <li>The HTTP headers
      </li>
      <li>The Adobe XML system
      </li>
      <li>Access control systems
      </li>
    </ul>
    <p>
      (I'm sticking with the machine-processable languages as
      examples because human-processable ones like HTML have a
      level of ambiguity traditional in human natural language but
      quite out of place in the WWW infrastructure -- or the
      Semantic Web. You can argue that people say "I work for
      w3.org" or "http://www.amazon.com/shrdlu?asin=314159265359"
      is a great book, just as they happily say "<em>Moby Dick</em>
      weighs over three thousand tons", "<em>Moby Dick</em> was
      finished over a century ago" and "I left <em>Moby Dick</em>
      on the beach" without expecting to be misunderstood. So we
      won't use human language as a guide when defining
      unambiguously the question of what a URI identifies. If we
      want to do that on the Semantic Web, we will say "I work for
      <em>the organization whose home page is</em>
      http://www.ww3.org.)
    </p>
    <p>
      Some argue the the URI which I associate with someone's home
      page actually identifies that person. They argue that
      conventionally people use the identifier to identify the
      person. However, consider another page put together by
      friends who found a photograph of the same person. A lot of
      content filtering systems would collect that URI and put put
      into their list. Even though the photo had many
      representations which different devices could download using
      content negotiation and/or CC/PP (color or black and white
      and versions of different resolutions) the URI itself would
      be listed as containing nudity. The public are very aware of
      different works on the web, even though they have the same
      topic.
    </p>
    <h4>
      2.1.3 <a name="Indirect" id="Indirect">Indirect
      identification</a>
    </h4>
    <p>
      You can argue that a web page <em>indirectly</em> identifies
      something, of course, and I am quite happy with that. If you
      identify an organization as that which has home page
      http://www.w3.org, then you are not saying that
      http://www.w3.org/ itself is that organization. This scenario
      is very very common, just as we identify people and things by
      their "unambiguous properties": books by ISBN, people by
      email address, and so forth. So long as we don't think that
      the person <em>is</em> an email address, we are fine. Some
      people have thought that in saying "An HTTP URI can't
      identify an organization" I was ruling out this indirect
      identification, but not so: I am very much in favor of it.
      The whole SQL world, after all, only identified things
      indirectly by a key property. This causes no contradiction.
      Perhaps I should say "An HTTP URI can't directly identify an
      organization". But by "identify" I mean "directly identify",
      and "identity" is a fairly direct word and concept, so I will
      stick with it.
    </p>
    <p>
      Conclusion so far: the idea that a URI identifies the thing
      the document is about doesn't work because we can only use a
      URI to identify one thing and we have and already do use it
      to identify documents on the web.
    </p>
    <h4>
      2.1.4 <a name="argument" id="argument">The argument for HTTP
      URIs identifying a Conceptual Work</a>
    </h4>
    <p>
      So what's wrong with the URI being taken to identify whatever
      the owner says?
    </p>
    <p>
      Let's look at what we mean by <em>identifies</em>. When we
      say there is identity, that means that there is some form of
      sameness that we associate with the identifier. Now, for all
      the philosophical argument, we can never test the identity of
      an abstract thing. What we can test is a representation which
      has been returned by the server when given that URI. When we
      use aURI, and get back several possible representations of
      it, then what expectation do we have about those
      representations?
    </p>
    <p>
      Take the test case that I see the web page which has a
      picture of a car, and I see in the URI in the URI bar in the
      browser. I email you the URI, "you see, the car is a
      Toyota?". You click on the link. Your browser shows the same
      URI as mine in the "URL bar" but you see a table of the car's
      weight, length, height, color, and registration number. We
      are confused. The web didn't work because you didn't get the
      same information as me. I expected you to get the same
      information, basically. That is how the Web works. That is
      the expectation behind every hypertext link - that the
      follower of the link should get basically the same
      information as the person who made the link. I say,
      "basically" because I would not have cared whether you saw or
      JPEG or a GIF. It probably wouldn't have mattered if you had
      seen a lower resolution or even black-and-white copy of the
      picture. If you are visually impaired, you may have been able
      to manage with a well-written description of the picture. But
      the the essential information is the same, not just the
      subject of the page.
    </p>
    <p>
      So now we have put the four corners on the expectation we
      have of a URI -- that all representations have essentially
      the same <em>information content</em>. And what we mean by
      "essentially" allows in fact some wriggle room, and in the
      end it rests on a common understanding between publisher of
      the information and quoter of the URI. The sameness we are
      after is the sameness of information content. <em>That</em>
      is what is identified by the URI. That is why we say that the
      URI identifies that conceptual information content,
      irrespective of its particular representation: the
      <em>conceptual work</em>. Without that common understanding,
      the web does not work.
    </p>
    <p>
      Some people have said, "If we say that URIs identify people,
      nothing breaks". But all the time they, day to day, rely on
      sameness of the information things on the web, and use URIs
      with that implicit assumption. As we formalize how the web
      works, we have to make that assumption explicit.
    </p>
    <h3 id="L876">
      2.2 Author definition
    </h3>
    <p>
      So how can we break free of that line of reasoning? We can
      try throwing away the rule that a URI identifies only one
      thing.
    </p>
    <blockquote>
      <p>
        There are many levels of identification (representation as
        a set of bits, document, car which the web page is about)
        and the URI publisher, as owner of the URI, has the right
        to define it to mean whatever he or she likes.
      </p>
    </blockquote>
    <p>
      Well, this one is tempting from the point of view that the
      owner of an identifier should reign supreme when it comes to
      saying what it identifies. It is quite a logically consistent
      position to take. After all, isn't this the case with
      <code>uuid</code>'s? And for a new scheme, this would be
      interesting. How can we do it though, with HTTP? the problem
      is an engineering one: I can't in practice use a URI until I
      have some definitive information from the publisher as to
      what it identifies.
    </p>
    <p>
      2.2.1 Default
    </p>
    <p>
      Why can't a URI default to identifying a web page until you
      know otherwise? Because the web is open and you will never
      know when you might lean some other information which will
      make the default incorrect. (You can't use such "closed
      world" reasoning).
    </p>
    <p>
      2.2.2 Web operation
    </p>
    <p>
      Why can't a URI identify a web page until you have done some
      well-defined operation -- such as HTTP HEAD or GET -- and
      checked for information in that? Well, that would certainly
      work logically. Suppose we we define a return code or HTTP
      header which means "abstract object requested". It would mean
      that every web application which deals with web pages as web
      pages would actually be working under an ambiguity, and RDF
      processors could be programmed to look for that special
      information. We can't retrofit the millions of web servers
      out there, I assume.
    </p>
    <p>
      I feel that there is a great benefit to fixing this question
      at the spec level. Otherwise, what happens? I read a web
      page, I like it and I am going to annotate it as being a
      great one -- but first I have to find out whether the URI my
      browser is used, conceptually by the author of the page, to
      represent some abstract idea? Before I recommend the
      <em>Vietnam War</em> page, I have to be careful I am not
      recommending the Vietnam War.
    </p>
    <p>
      There has been no way to do this before RDF, but then
      similarly no real need for it. (What, is this just a problem
      with RDF? No, it will happen with any webized knowledge
      representation system.). We really need to have communication
      in which two people use the same URI to mean the same thing.
      If there
    </p>
    <p>
      We could fix HTTP so that it would return me some extra
      semantic headers explaining the whole thing. And in the case
      that the URI was deemed to be some abstract thing, I would
      not have the option of recommending the web page. Too bad: it
      has no URI.
    </p>
    <p>
      The authors of document
      &lt;http://www.w3.org/2000/10/rdf-tests/rdfcore/Manifest.rdf&gt;
      certainly thought that they could use
      "http://www.w3.org/2000/10/rdf-tests/TestSchema/NegativeParserTest"
      to identify an abstract thing which is a type of software
      test. Now they have a choice as to what to make the server
      return for them when I ask for it. It returns 404 "doesn't
      match anything we have available". It can't really, because
      HTTP doesn't allow one to return a class, only a document.
      And if it were to return a document, then I wouldn't be able
      to refer to that document without accidentally referring to
      the class of negative parser tests.
    </p>
    <p>
      So, we could change HTTP to make this work. We could make a
      new form of redirect, <em>343 Abstract Object, please see . .
      .</em>, which would tell the client that the thing requested
      was abstract, and would suggest a document to read about it.
      This avenue of argument is still outstanding. We could take
      it. It isn't the status quo, but we could make changes in
      HTTP if the community felt that this was they way to go.
    </p>
    <h3 id="L883">
      2.3 Logic disambiguates
    </h3>
    <p>
      Otherwise,we have to try another way of letting the URI mean
      sometimes one thing and sometimes another. Here is another.
    </p>
    <blockquote>
      <p>
        Actually the URI has to, like in English, identify these
        different things ambiguously. Machines have to disambiguate
        using common sense and logic
      </p>
    </blockquote>
    <p>
      This is possible in theory. It is a mess. It fails
      particularly spectacularly when a URI is used ambiguously to
      refer to a web page and the thing that web page is about,
      which happens to be another web page. <em>Anyone can write
      anything about anything</em> is a Web motto, but here it
      falls down. <em>Anyone can write anything about anything
      except those things which might get confused with the
      document they are writing</em>. It breaks the axiom that we
      mean the same thing by a URI - in all contexts. (And RDF has
      a model theory in which necessarily in any interpretation, a
      symbol always denotes one thing).
    </p>
    <h3 id="L890">
      2.4 Different Properties
    </h3>
    <blockquote>
      <p>
        Actually the URI has to, like in English, identify these
        different things ambiguously. Machines have to disambiguate
        using the fact that different properties will refer to
        different levels.
      </p>
    </blockquote>
    <p>
      One way of getting here is to start by considering that HTTP
      headers can be divided into those which refer to the
      representation (or the document) and those that refer to,
      say, a car or a donkey. We can look at all RDF properties and
      other attributes in other languages and divide them in in
      such a way. So, when I say "http://example.com/albert is a
      color photo", I am referring to the representation; when I
      say "http://example.com/albert used to work down the mill" I
      am referring to the person; when I say
      "http://example.com/albert was taken on a rainy day" I am
      revering to the original photograph, which is basically the
      representation of Albert.
    </p>
    <p>
      This one has the problem when a web page refers to a web
      page. It can still be pursued, by having different verbs for
      talking about ownership of the web page and ownership of the
      car. This is a classic example of the 2-level syndrome (see
      also <em>Dictionaries in the Library</em>). The basic fallacy
      is that you can make the system general by introducing a
      second level - a new set of attributes, properties, or
      whatever, which allow you to refer to the metadata of
      something separately from the thing itself. These systems
      either turn out to be just limited 2-level systems (like XML
      and DTDs) or have to be extended to be recursive in some way
      later on such that in fact the two levels become unnecessary.
    </p>
    <h3 id="L897">
      2.5 Extra info with URI
    </h3>
    <blockquote>
      <p>
        Actually the URI has to, like in English, identify these
        different things ambiguously. Machines have to disambiguate
        using extra information which will be provided in other
        ways along with the URI
      </p>
    </blockquote>
    <p>
      This twist now relies on sending extra information with a
      URI. Effectively, the URI scheme has now failed to identify
      anything by itself. Those most familiar URIs as used by HTML
      sometimes suggested adding new attributes to the anchor tags
      of HTML documents to disambiguate a reference. I guess it
      would work if HTML anchors were the only uses of URIs. By
      contrast, they are used in thousands of places and way, many
      of which I am unaware. The architecture, however, is not that
      way: the architecture of the WWW is that a URI is a global
      unambiguous identifier. Not a URI and something else.
    </p>
    <p>
      (The various designs such a WebDav's propfind which use HTTP
      methods apart from GET to retreive information suffer from
      this same problem. the information does not have a URI: it is
      not on the web.)
    </p>
    <h3 id="L909">
      2.6 Different meaning in different context
    </h3>
    <blockquote>
      <p>
        Actually the URI has to, like in English, identify these
        different things ambiguously. Machines have to disambiguate
        them by context: A catalog card will talk about a document.
        A car catalog will talk about a car.
      </p>
    </blockquote>
    <p>
      This works in the short term, when the two contexts are
      disjoint groups who do not need to communicate. It is in fact
      the current state: the groups of people who use HTTP URIs to
      talk about documents, and those who have just started to use
      them to talk about abstract concepts haven't collided yet.
      (Well, they have in my code. I need to be able to model the
      metadata about an HTTP URI as that about a document, and it
      being a class at the same time doesn't jive.)
    </p>
    <p>
      It doesn't work in the long term because it breaks the axiom
      that a URI must identify one thing,
    </p>
    <h3 id="L920">
      2.7 Change it for the Semantics Web
    </h3>
    <blockquote>
      <p>
        They may have been used to identify documents up till now,
        but for RDF and the Semantic Web, we should change that and
        start to use them as the Dublin Core and RDF Core groups
        have for abstract concepts.
      </p>
    </blockquote>
    <p>
      I think that we would have to design a new URI scheme before
      we change things that much. That is tempting of course. But
      then -- building a semantic web out of what we have is
      tempting too. It was tempting to rehash TCP a little when
      making HTTP. It wasn't practical, and we would have lost a
      lot more than we would have gained. There is a lot to be said
      for using common technology. We've got an infrastructure of
      documents. We want to build an infrastructure of knowledge.
      Let's build it using the documents. We might find that the
      commonality with the web of human-readable information is a
      boon.
    </p>
    <h3 id="L735">
      2.8 Abandon any identification of abstract things
    </h3>
    <p>
      An argument which surprised me is that yes, HTTP URIs
      identify documents, but in fact the frgament identifier must
      only be used to identify parts -- fragments -- of documents.
      This means that RDF cannot in fact use HTTP URI schemes at
      all. A completely different system would have to be put
      together -- either a new set of URIs, or RDF conventions in
      which the relationship to the part of a document in which
      something was described became explicit. In N3 this would
      like like
    </p>
    <p>
      [ is rdf:referent of &lt;#fmyCar&gt; ] [ is rdf:referent of
      &lt;#color&gt; ] [ is rdf:referent of &lt;#blue&gt; ]
    </p>
    <p>
      Of course, languages would quickly generate special syntax
      for this. Alternatively, the RDF system would built entirely
      on the understanding that we were referring always to that
      denoted by a given bit of document, not the bit of document
      itself. This would mean that there would be no way for the
      RDF system to refer to documents themselves directly.
    </p>
    <p>
      This is actually a consistent way of working. It would be a
      change only for those people who use RDF to talk about
      documents as documents. We could change.
    </p>
    <h2>
      <a name="L409" id="L409">3. Conclusion</a>
    </h2>
    <p>
      I didn't have this thought out a few years ago. It has only
      been in actually building a relatively formal system on top
      of the web infrastructure that I have had to clarify these
      concepts my own mind. I am forced to conclude that modeling
      the HTTP part of the web as a web of abstract documents if
      the only way to go which is practical and, by the
      philosophical underpinnings of the WWW, tenable.
    </p>
    <p>
      I apologize again if I have misunderstood or misrepresented
      other's arguments in this process of this explanation of my
      own position.
    </p>
    <p>
      Tim Berners-Lee
    </p>
    <p>
      2002-07-28Z
    </p>
    <hr />
    <h3>
      FAQ
    </h3>
    <p>
      <em>Q: But surely, if a document is identified by a namespace
      URI, then when we look up an RDF namespace will millions of
      words in it we will have too long a document to be
      practical!</em>
    </p>
    <p>
      A: It is arguable, for such as situation, whether the
      namespace itself is more cumbersome to manage than the
      document is to deliver. You can make an analogy with
      hypertext: Isn't the model of retrieving a document going to
      be inefficient when the documents are huge? Answers are
      twofold in each case,
    </p>
    <p>
      Firstly, yes it is likely to be less convenient, but that is
      no reason to skew which is a good engineering design for the
      vast proportion of namespaces (or hypertext documents) which
      are not huge.
    </p>
    <p>
      Secondly, the HTTP protocol actually does have methods of
      retrieving parts of a large document.
    </p>
    <p>
      <em>Q: It seems strange that an HTTP URI should be limited to
      referring to documents, but that all one has to add is this
      little hash mark and suddenly you say it can be used to
      identify anything.</em>
    </p>
    <p>
      A: The hash is not a minor appendage to the URI: It is the
      most significant piece of punctuation in the whole URIref.
      The hash adds a whole new level of abstraction and
      specification! It is true that in a hypertext page and that
      page scrolled to a given point seem very similar. The same
      applies to a graphic chart and an object within that chart,
      especially when it is displayed in the context of the
      original document. So I suppose it may be a shock when the
      technique is used with a semantic web language to refer to
      not the document, but something which the document discusses.
      That does allow it to break out of the whole concept of
      documents and into -- anything. But no one promised the
      Semantic Web would be boring. :-)
    </p>
    <p>
      <em>Q: I thought you said "anything should be able to have a
      URI"?</em>
    </p>
    <p>
      Yes, and it should. There is nothing in the URI spec to say
      what an individual scheme should or should not be created to
      identify. A new URI scheme could for example be ale to
      identify anything. But here we are talking about HTTP URIs.
      And remember that with semantic web languages, you can use a
      URIref (very different from a URI) to identify anything, for
      example with HTTP and RDF.
    </p>
    <p>
      <em>Q: But what about CGI scripts? Surely you don't mean the
      HTTP URI identifies the script?</em>
    </p>
    <p>
      A: Of course not. When we talk about the "document"
      identified by a URI it is very often an virtual document
      produced by, for example, a CGI script. The URI identifies
      the document on the web, with no regard to the process which
      causes representations of it to be served.
    </p>
    <p>
      <em>Q: Some HTTP URIs can be POSTed to. Can you still say
      they identify documents?</em>
    </p>
    <p>
      A: Well, some HTTP URIs can't be accessed at all, and some
      access is not allowed, and yes, some URIs are not only
      documents but also can be posted to. So they object is more
      complex than simply a document. But that it has this extra
      functionality doesn't make it any less a HTTP document
      formally. Something can have extra features and still remain
      in the same class of things.
    </p>
    <p>
      <em>Q: What do you mean by "identify", anyway, in Model
      Theory terms? (2003)</em>
    </p>
    <p>
      The closest term used in Model Theory to the way I am using
      <em>identify</em> is <em>denote</em>. Model theory analyses
      communication and understanding by imagining a set of
      <em>interpretations</em>, where an interpretation is a
      mapping from a symbol to that which it denotes. Model
      theorists and linguists tend to complain that one cannot talk
      about the meaning of a term, as you can never know what
      anyone means by anything, you can only see how they react. A
      given agent may have many possible interpretations, but new
      information the agent believes which mentions a symbol will
      rule out interpretations with which are inconsistent with the
      symbol. By the process of exchange of a lot of information,
      one arrives at a state in which one behaves as though other
      agents has the effectively the same set of interpretations.
      Under these conditions, one can think of the thing
      <em>identified</em> by the symbol in the community as being
      the set of things denoted by the symbol in the
      interpretations which agents in the community are left with.
      There has been much more discussion of this process (which is
      the essence of the writing of a standard and the purpose of
      documents like this) in email on www-tag with Pat Hayes and
      others in 2003.
    </p>
    <address>
      The rest are from Aaron Swartz
    </address>
    <p>
      <em>Q: Can you point to something in the spec that says HTTP
      URIs must identify a document?</em>
    </p>
    <p>
      There are many answers. I can point to things which could be
      interpreted to say that. The HTTP spec defines resources as
      <em>network data objects</em>. To me that "data" indicates
      the information nature of the thing. It precludes, in most
      people's minds, a car or the Andromeda Galaxy.
    </p>
    <p>
      I could explain that, as I originally wrote the HTTP spec,
      that was the author's intent.
    </p>
    <p>
      But I think the fairest thing is to say that the spec was
      written it was not sufficiently clear about this particular
      ambiguity, and for reasons mentioned above, this hasn't been
      a problem until now.
    </p>
    <p>
      <em>Q: Isn't it a little weird to start making pronouncements
      about the entire HTTP Web when neither the spec nor the other
      TAG members agree?</em>
    </p>
    <p>
      Pronouncements about the whole Web are really important where
      they are needed. In that case the TAG has a duty to make
      them. And so do I. It seems to me that this assumption is one
      we have been implicitly making and are now breaking, in a way
      which will make the semantic web either inconsistent or much
      less efficient. The TAG members do not agree on this: that is
      why they asked my to write this document. It is written as a
      TAG action item about tag issue HTTPRange-14. Things get a
      lot weirder than that. ;-)
    </p>
    <p>
      <em>Q: Why do we need to use URI-refs to identify abstract
      concepts in a protocol where we can get more information
      about them? .I thought URIs were doing just fine. If we have
      to resort to UUIDs to identify things, I'll get annoyed
      because I won't be able to put them in my browser.</em>
    </p>
    <p>
      Well, there you are... you want to be able to put something
      in your browser, then you must have a representation of it.
      So somewhere in the picture, representations aside, is a
      ConceptualWork. If the ConceptualWork is important, then it
      needs a URI, in my opinion. The alternatives are attractive
      when you start to look at them, but each has a different
      snag. I have tried to explain above.
    </p>
    <p>
      <em>Q: How can you say that the Semantic Web can use the hash
      mark to make a URI-ref identify anything when the URI RFC is
      very clear that hash marks only work when you dereference the
      document.</em>
    </p>
    <p>
      I wouldn't say that hash marks "only work when you deference
      a document" any more than your street address "only works
      when I visit you", or your date of birth "only worked when
      you were born". I can use your street address -- or your data
      of birth -- to help identify you. What the spec defines is a
      way of using this particular URI to get some information over
      the Internet. The whole web works by what someone recently
      referred to as a "confusion" between name and address. It
      isn't a confusion. It is a connection between two pieces of
      architecture without which the web would not be. Rethink. It
      is primarily a name. We have made a way of looking it up. So
      you don't have to look it up for the name to "work" as an
      identifier. Just as you don't go and look it up when someone
      quotes the RDF namespace -- it works because the same
      identifier identifies the same thing in any context. Looked
      up or not. The same thing is true for foo#bar. If the
      document foo is never served, one can still (if one owns it)
      talk about foo#bar with authority. It is of course good
      practice to serve documents.
    </p>
    <p>
      <em>Q: Are all Semantic Web agents going to start
      dereferencing every document they hear about?</em>
    </p>
    <p>
      No, any more than you have to dereference every hypertext
      link you see.
    </p>
    <p>
      <em>Q: Isn't the Semantic Web broken if we have to start
      disagreeing with major specifications like this?</em>
    </p>
    <p>
      This philosophy is quite consistent with the HTTP spec as it
      is.
    </p>
    <h3>
      Exercises
    </h3>
    <p>
      1) What does "<a href=
      "http://www.amazon.com/exec/obidos/ASIN/0679600108/qid=1027958807/sr=2-3/ref=sr_2_3/103-4363499-9407855">http://www.amazon.com/exec/obidos/ASIN/0679600108/qid=1027958807/sr=2-3/ref=sr_2_3/103-4363499-9407855</a>"
      identify?
    </p>
    <ol>
      <li>A whale
      </li>
      <li>"Moby Dick or the Whale" by Herman Melville
      </li>
      <li>A web page on Amazon offering a book for sale
      </li>
      <li>A URI string
      </li>
      <li>All the above
      </li>
    </ol>
    <p>
      When was the thing it identified last changed?
    </p>
    <p>
      Have you read the thing it identifies?
    </p>
    <p>
      2) What does "<a href=
      "http://www.vrc.iastate.edu/magritte.gif">http://www.vrc.iastate.edu/magritte.gif</a>"
      identify?
    </p>
    <ol>
      <li>A pipe
      </li>
      <li>I don't know, but whatever it is it isn't not a pipe.
      </li>
      <li>A contradiction
      </li>
      <li>
        <strong>A picture by Magritte</strong>
      </li>
      <li>
        <strong>A photograph of a picture by Magritte</strong>
      </li>
      <li>
        <strong>A representation as a series of 341632 bits in of a
        photo of a painting</strong>
      </li>
      <li>Validly 4, 5 and 6 but not 1
      </li>
    </ol>
    <p>
      <img alt="Hint: This is not a pipe" src=
      "http://www.vrc.iastate.edu/magritte.gif" />
    </p>
    <p>
      3) What does "<a href=
      "http://dm93.org/2002/03/dans-car-23423423">http://dm93.org/2002/03/dans-car-23423423"</a>
      identify?
    </p>
    <ol>
      <li>An inaccessible web page
      </li>
      <li>A black Toyota
      </li>
    </ol>
    <p>
      4) What does "<a href=
      "http://dm93.org/y2002/myCar-232">http://dm93.org/y2002/myCar-232</a>"
      identify?
    </p>
    <ol>
      <li>A black toyota
      </li>
      <li>A web page
      </li>
    </ol>
    <p>
      When was the thing identified last changed?
    </p>
    <p>
      What does the writing on Dan's car say?
    </p>
    <p>
      Answers: 1:3. 2:7 Note here the web tolerates vagueness along
      the axis of different representations of the same image, but
      not of semantic level between the image and the pipe. 3:1;
      4:2
    </p>
    <h3>
      References
    </h3>
    <p>
      @@@links
    </p>
    <ul>
      <li>The huge discussion of this issue on www-tag@w3.org
      </li>
      <li>
        <a href="http://www.textuality.com/tag/s1.1.html">Tim
        Bray's text</a>
      </li>
      <li>RFC 1634 and points west
      </li>
      <li>Roy Fielding's short history of URI specifications
      </li>
      <li>Weaving the Web
      </li>
      <li>
        <a href=
        "http://www.cyc.com/cycdoc/vocab/info-vocab.html">Cyc's
        page about Conceptual Works</a> cyc:ConceptualWork <a href=
        "http://ilrt.org/discovery/chatlogs/rdfig/2002-07-31.html#T15-56-58-1">
        proposed as what I mean by document by DanC</a>.
      </li>
    </ul>
    <hr />
    <p>
      <a href="Overview.html">Up to Design Issues</a>
    </p>
    <p>
      <a href="../People/Berners-Lee">Tim BL</a>
    </p>
  </body>
</html>