Axioms.html 43.6 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="generator" content=
    "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
    <title>
      Univeral Resource Identifiers -- Axioms of Web architecture
    </title>
    <link href="di.css" rel="stylesheet" type="text/css" />
    <meta http-equiv="Content-Type" content=
    "text/html; charset=us-ascii" />
  </head>
  <body bgcolor="#DDFFDD" text="#000000" lang="en" xml:lang="en">
    <address>
      Tim Berners-Lee
      <p>
        Date: December 19, 1996
      </p>
      <p>
        Status: personal view. Editing status: Italic text is
        rough. Reques complete edit and possibly massaging, but
        content is basically there. Words such as "axiom" and
        "theorem" are used with gay abandon and the reverse of
        rigour here..
      </p>
    </address>
    <p>
      <a href="Overview.html">Up to Design Issues</a>
    </p>
    <h3>
      Universal Resource Identifiers -- Axioms of Web Architecture
    </h3>
    <ul>
      <li>
        <a href="#uri">Universal Resource Identifiers</a>
        <ul>
          <li>
            <a href="#Universality">Universality</a>
          </li>
          <li>
            <a href="#unique">Global uniqueness</a>
          </li>
          <li>
            <a href="#same">Sameness</a>
          </li>
          <li>
            <a href="#identity">Identity</a>
          </li>
          <li>
            <a href="#canonicalization">Canonicalization - when is
            a URI the saem URI?</a>
          </li>
          <li>
            <a href="#abuse">Identity abuse</a>
          </li>
          <li>
            <a href="#nonunique">Not a unique space, just
            universal</a>
          </li>
        </ul>
      </li>
      <li>
        <a href="#state">Identity, State and GET</a>
        <ul>
          <li>
            <a href="#opaque">Opacity</a>
          </li>
          <li>
            <a href="#Query">Query strings</a>
          </li>
        </ul>
      </li>
      <li>
        <a href="#relative">Relative URIs</a>
        <ul>
          <li>
            <a href="#matrix">Matrix spaces</a>
          </li>
        </ul>
      </li>
      <li>
        <a href="Axioms.html#Properties">The properties of
        different URI schemes</a>
      </li>
    </ul>
    <hr />
    <h1>
      Universal Resource Identifiers
    </h1>
    <p>
      The operation of the World Wide Web, and its interoperability
      between platforms of differing hardware and software
      manufacturers, depend on the specifications of protocols such
      as HTTP, data formats such as HTML, and other syntaxes such
      as the URL or, more generally, URI specifications. Behind
      these specifications lie some important rules of behavior
      which determine the foundation of the properties of the Web.
      These are rules and principles upon which new designs of
      programs and the behavior of people must rely. And it is that
      reliance which makes the Web both an information space which
      works now, and the foundation for future applications,
      protocols, and extensions. The more essential of these I
      refer to loosely as axioms, and the most basic of these have
      to do with URI.<br />
    </p>
    <p>
      The aim of thes article is to summarize in one place the
      axioms of Web architecture: those invariant aspects of Web
      design which are implied or stated in various specifications
      or in some cases simply part of the folk law of how the Web
      ought to be used. Especially for these latter cases, this
      article is designed to tie together the Web community in a
      common understanding of how we can progress, extend, and
      evolve the Web protocols. <i>Terms such as "axiom", and
      "theorem" are used with gay abandon rather than precision as
      this not a mathematical treatise.</i><br />
    </p>
    <h2>
      <a name="uri" id="uri">Universal Resource
      Identifiers</a><br />
    </h2>
    <p>
      The Web is a universal information space. It is a space in
      the sense that things in it have an address. The "addresses",
      "names", or as we call them here identifiers, are the subject
      of this article. &nbsp;They are called <b>Universal Resource
      Identifiers</b> (URIs).
    </p>
    <p>
      An information object is "on the web" if it has a URI.
      &nbsp;Objects which have URIs are sometimes known as "First
      Class Objects" (FCOs). &nbsp;The Web works best when any
      information object of value and identity is a first class
      object. &nbsp;If something does not have a URI, you can't
      refer to it, and the power of the Web is the less for that.
    </p>
    <p>
      By <em>Universal</em> I mean that the web is declared to be
      able to contain in principle every bit of information
      accessible by networks. It was designed to be able to include
      existing information systems such as FTP, and to be able
      simply in the future to be extendable to include any new
      information system.
    </p>
    <p>
      The URI schemes identify things various different types of
      information object, wich play different roles in the
      protocols. Some identify services, connection end points, and
      so on, but a fundamental underlying architectural notion is
      of information objects - otherwise known as generic
      <strong>documents</strong>. These can be
      <strong>represented</strong> by strings of bits. An
      information object conveys something - it may be art, poetry,
      sensor values or mathematical equations.
    </p>
    <p>
      The Semantic Web allows an information objects to give
      information about anything - real objects, abstract concepts.
      In this case, by combining the identifier of a document with
      the identifier, within that document, of something it
      describes, one forms an idenifier for anything. This is done
      with "#" and fragment identifiers, discussed later.
    </p>
    <h4>
      <a name="Universality" id="Universality">Axiom 0:
      Universality</a> 1
    </h4>
    <p class="axiom">
      Any resource anywhere can be given a URI
    </p>
    <h4>
      <a name="Universality2" id="Universality2">Axiom 0a:
      Universality 2</a>
    </h4>
    <p class="axiom">
      Any resource of significance should be given a URI.
    </p>
    <p>
      (What sorts of things can be resources? A very wide variety.
      The URI concept istelf puts no limits on this. However, URIs
      are divided into schemes, such as http: and telenet:, and the
      specification of each scheme determines what sort of things
      can be resources in that scheme. Schemes are discussed
      later.)
    </p>
    <p>
      This means that no information which has any significance and
      persistence should be made available in a way that one cannot
      refer to it with a URI.
    </p>
    <p>
      In fact, we take care before extending the URIs to include
      any old system, because URIs of any form must also be
      understood anywhere in the world.
    </p>
    <p>
      When you specify a URI, a Universal Resource Identifier,
      (often people use the more restricted term "URL", Uniform
      Resource Locator), first axiom is:
    </p>
    <h4>
      <a name="unique" id="unique">Axiom 1: Global scope</a>
    </h4>
    <p class="axiom">
      It doesn't matter to whom or where you specify that URI, it
      will have the same meaning.
    </p>
    <p>
      So, this means that there is no scope within which a URI must
      be placed for it to hold. All you need say is that something
      is "on the Web" and that is enough. Anyone can follow that
      hypertext link, anyone can look up that URI. Now, the sorts
      of URI that we find typically start "<code>http:</code>"
      indicating that this URI points into a space of objects
      accessed using the hypertext transfer protocol. But there are
      many other sorts of URI, and a key to the Universality is
      that this universal space of identifiers whether you call
      them names addresses or locators, is universal through the
      range of pre-existing protocols such as SMTP, and NNTP,
      through protocols designed for the Web specifically (HTTP)
      through, in principal, to new protocols yet to be invented.
      So, there is a theorem, if you like, of URIs that:
    </p>
    <p class="axiom">
      Any new space of identifiers or address space can be
      represented as a subset of URI space.
    </p>
    <p>
      You can prove this easily because there is no limit to the
      length of a URI and any new name system or address system can
      be incorporated simply by encoding the value of names or
      addresses into an acceptable, printable string and prefacing
      that string with a standard prefix for that new scheme. So,
      you could replace http:, for example with ISBN: or X500:
      depending on the new scheme.<br />
    </p>
    <p>
      There is a second axiom of URIs which is difficult to
      characterize exactly but accepted in some form by everyone
      who uses the Web in some form and that is that:
    </p>
    <h4>
      <a name="same" id="same">Axiom 2a: sameness</a>
    </h4>
    <p class="axiom">
      a URI will repeatably refer to "the same" thing
    </p>
    <p>
      The same identifier string is expected from one day to the
      next to point to, in some sense, the same object. That is a
      very important axiom, and it leaves open the "in some sense"
      behind which is a very complicated discussion of the concept
      of identity. When are two things "in some sense" the
      same.<br />
    </p>
    <h4>
      <a name="identity" href="Axioms.html#Universality" id=
      "identity">Axiom 2b: identity</a>
    </h4>
    <p>
      of URIs clears up the vagueness of 2a and is that
    </p>
    <p class="axiom">
      the significance of identity for a given URI is determined by
      the person who owns the URI, who first determined what it
      points to.
    </p>
    <p>
      We do not discuss in detail here the definition of owner
      because the mechanism by which a person or agent comes to get
      or create or be allocated a new URI varies from scheme to
      scheme. But in every scheme in practice there is a way of
      making a new URI. In many schemes, the scheme itself implies
      or requires some properties of identity. The scheme, if you
      like, imposes constraints within which the owner of a URI is
      free to define identity.
    </p>
    <p>
      The implication here is that we will need protocols for
      exchanging any guarantees of the properties of given URIs:
      they are not simply laid down in the specification of the
      web. This is in tune with a general philosophical principle
      of design (after Bob Scheifler and others):
    </p>
    <blockquote>
      The technology should define <i>mechanisms</i> wherever
      possible without defining <i>policy</i>.
    </blockquote>
    <p>
      because we recognize here that many properties of URIs are
      social rather than technical in origin.
    </p>
    <p>
      Therefore, you will find pointers in hypertext which point to
      documents which never change but you will also find pointers
      to documents which change with time. You will find pointers
      to documents which are available in more than one format. You
      will find pointers to documents which look different
      depending on who is asking for them. There are ways to
      describe in a machine or human readable way exactly what sort
      of repeatability you would expect from a URI, but the
      architecture of the Web is that that is for something for the
      owner of the URI to determine.
    </p>
    <h4>
      <a name="abuse" id="abuse">Identity abuse</a>
    </h4>
    <p>
      <i>All the same, a word of caution is appropriate about the
      indiscriminate or deliberately misleading abuse of the
      identity of the object refered to by a URI. A web server is
      often in a position to know a lot of context about a request.
      This can include for example, the person who is asking, the
      document they were reading last from which they followed the
      link. &nbsp;It is possible to use this information to
      dramatically change the content of the document refered to.
      &nbsp;This undermines the concept of identity and of
      reference in general. &nbsp;To do that without making it
      clear is misleading both to anyone who quotes the URI
      of&nbsp;a page or who follows the link.</i>
    </p>
    <p>
      <i>Unless it is clearly indicated on the page (or using a
      future protocol) , to return differing information for the
      same URI must be considered a form of deception. &nbsp;It
      also of course messes up caches. Note the HTTP 1.1 "Vary"
      header allows this indication to be passed.</i>
    </p>
    <h4>
      <a name="canonicalization" id="canonicalization">When is a
      URI "the same URI"?</a>
    </h4>
    <p>
      Two URIs are the same if (and only if) they are the same
      character for character.
    </p>
    <p>
      Two URIs which are different may in fact be equivalent, in
      that they may refer to the same thing, and give the same
      result in all operations. In some cases any agent looking at
      two URIs can deduce, from knowledge of the various web
      standards, that they must be equivalent, in that they must
      refer to the same thing. For example, HTTP URIs contain
      domain names, and the Domain Name System is case-insensitive.
      Therefore, while it is normal practice to use lower case for
      domain names, any agent which comes across two URIs which
      differ only in the case of the domain name can conclude that
      they must refer to the same thing. In another case, a client
      agent may use out-of-band information about a web site to
      know that its URI paths are case-invariant, or that URIs
      ending in "/" and "/index.html" are equivalent. It is bad
      engineering practice to make new protocols require such
      processing.
    </p>
    <p>
      There are a long series of such algorithms. Which ones an
      agent can apply depends on what information it has to hand,
      and depend on what knowledge of which protocols has been
      programmed into it. New schemes may be defined in the future,
      for which different forms of canonicalization can be done.
      There is, therefore, <strong>no definitive
      canonicalization</strong> algorithm for URIs. Generic URI
      handling code should handle URIs as case-sensitive character
      strings. It is <strong>not</strong> recommended that, for
      example, encryption and signature algorithms attempt to
      canonicalize URIs before signature, because of the
      arbitrariness of any attempt to define a canonicalization
      algorithm.
    </p>
    <p>
      The only canonicalization one could insist upon would be that
      defined by the algorithms in the URI specifications. This
      incldues the generation of an absolute URI from and the
      hex-encoding or decoding of all non-reserved characters.
    </p>
    <h3>
      URIs and the Test of Independent Invention
    </h3>
    <p>
      The concept of a web as a "space" is based on these axioms of
      design. As a result, the web behaves to a certain extent as a
      system with state, and an important part of the work of the
      system is the distribution of visible state rather than the
      execution of invisible remote operations.
    </p>
    <h4>
      <a name="nonunique" id="nonunique">Axiom 3: non unique</a>
    </h4>
    <p class="axiom">
      URI space does not have to be the only universal space
    </p>
    <p>
      The assertion that the space of URIs is a universal space
      sometimes encounters opposition from those who feel there
      should not be one universal space. These people need not
      oppose the concept because it is not of a single universal
      space: Indeed, the fact that URIs form universal space does
      not prevent anyone else from forming their own universal
      space, which of course by definition would be able to envelop
      within it as a subset the universal URI space. Therefore the
      web meets the "independent design" test, that if a similar
      system had been concurrently and independently invented
      elsewhere, in such a way that the arbitrary design decisions
      were made differently, when they met later, the two systems
      could be made to interoperate.
    </p>
    <p>
      There may be in the world many universal spaces, and there
      need not be any particular quarrel about one particular one
      having a special status. (Of course, having very many may not
      be very useful, and in the World Wide Web, the URI space
      plays a special role by being the universal space chosen in
      that design.)
    </p>
    <p>
      For example, it would be possible to map all international
      telephone numbers into URI space very easily, by inventing a
      new URI "phone:" after which was the phone number. It would
      in fact also conversely be possible to map URIs into
      international phone numbers by allocating a special phone
      number not used by anyone else, perhaps a special country
      code for URI space, and then converting all URIs into a
      decimal representation. In that case, both URIs and phone
      numbers would be universal spaces. Identifiers in one space
      would be consisting only of numbers, and in the other of
      alphanumeric characters. One would be shorter than the other,
      but there is no reason why, in principle, the two could not
      co-exist, allowing you to dial any Web object from a
      telephone as a telephone number, and point to any phone from
      a hypertext document.<br />
    </p>
    <p>
      So, on this last axiom rests not specifically the operation
      of the web, but its acceptance as a non-domineering
      technology, and therefore our trust in its future
      evolvability.
    </p>
    <h3>
      <a name="state" id="state">Identity, State and GET</a>
    </h3>
    <p>
      From the fact that the thing referenced by the URI is in some
      sense repeatably the same suggests that in a large number of
      cases the result of de-referencing the URI will be exactly
      the same, especially during a short period of time. This
      leads to the possibility of caching information. It leads to
      the whole concept of the Web as an information space rather
      than a computing program. It is a very fundamental concept.
      Not only do the concepts of navigation around the space
      remembering "places" in the Web and other humanly visible
      aspects of the nature of the Web depend on it, but also many
      technical architectural properties depend on it. For example,
      the implication is that the GET operation in HTTP is an
      operation which is expected to repeatably return the same
      result. As a result of that, anyone may know that under
      certain circumstances that they may instead of repeating an
      HTTP operation, use the result of a previous operation. The
      operation is "idempotent". This, in turn, allows software to
      use previously fetched copies of documents and it requires
      that the HTTP GET operation should have no <em>side
      effects</em>. For example, GET should never be used to
      initiate another operation which will change state. In
      general (see the HTTP 1.1 spec) the notion of side-effects is
      that of any significant communication between the parties. A
      user can never be held accountable to anything as a result of
      doing a GET. The server may for example log the number of
      requests, but the client user cannot be held responsible for
      that: it does not constitute communication between the two
      parties.
    </p>
    <p>
      It is wrong to represent the user doing a GET as committing
      to something or putting themselves on a mailing list, doing
      any operation which effects the state of the Web or the state
      of the users relationship with the information provider or
      the server. To ignore this rule can be to introduce a serious
      security problem in a website.
    </p>
    <p>
      So, from this principal, we have a principal of the http
      protocol that :
    </p>
    <h4>
      <a name="get" id="get">Axiom</a>
    </h4>
    <p style="axiom" class="axiom">
      In HTTP, GET must not have side effects.
    </p>
    <p>
      The introduction of any other method apart from GET which has
      no side effects is also incorrect, because the results of
      such an operation effectively form a separate address space,
      which violates the universality. A pragmatic symptom would be
      that hypertext links would have to contain the method as well
      as the URI in order to able to address the new space, which
      people would soon want to do.
    </p>
    <h4>
      <a name="GET2" id="GET2">Axiom</a>
    </h4>
    <p class="axiom">
      In HTTP, anything which does not have side-effects should use
      GET
    </p>
    <p>
      This means that for people implementing systems in which
      users request information and execute operations using forms,
      when the form simply requests information it must result in a
      GET operation. Indeed this is very much to be favored over a
      post operation because the result of a GET operation has a
      URI and may be leaked to, for example, may be put into a
      bookmark. This violates the <a href=
      "Axioms.html#Universality2">axiom of universality</a> above.
    </p>
    <p>
      However, when the result of a form is to execute an
      operation, which changes the Web or a relationship of a user
      to anyone else, then the GET operation may not be used and
      POST or other method either through HTTP or mail must be
      used. Only by sticking to this rule can such systems
      interoperate with caches and other agents which exploit the
      repeatability of HTTP GET of URI dereferencing in the
      future.<br />
    </p>
    <p>
      The axiom above about URIs pointing in principle to
      conceptually the same thing has a corollary which says that
      URIs do not always have to point to <i>exactly</i> the same
      set of bits. This means that URIs can be "generic". See the
      <a href="Generic.html">discussion of generic URIs</a>.<br />
    </p>
    <h3>
      <a name="opaque" id="opaque">The Opacity Axiom</a><br />
    </h3>
    <p>
      The concept of an identifier referring to a resource is very
      fundamental in the World Wide Web. Identifiers will refer to
      resources all different sorts. Any addressable thing will
      have an identifier. There are mechanisms we have just
      discussed for extending the spaces of identifiers into name
      spaces which have different properties. Different spaces may
      address different sorts of objects, and the relationship
      between the identifier and the object, such as the uniqueness
      of the object and the concept of identity, may vary. A very
      important axiom of the Web is that in general:
    </p>
    <h4>
      Axiom: Opacity of URIs
    </h4>
    <p class="axiom">
      The only thing you can use an identifier for is to refer to
      an object. When you are not dereferencing, you should not
      look at the contents of the URI string to gain other
      information.
    </p>
    <p>
      For the bulk of Web use URIs are passed around without anyone
      looking at their internal contents, the content of the string
      itself. This is known as the <b>opacity</b>. Software should
      be made to treat URIs as generally as possible, to allow the
      most reuse of existing or future schemes.
    </p>
    <p>
      For example, within an HTTP identifier, even when access is
      made to the object, the client machine looks at the first
      part of the identifier to determine which server machine to
      talk to and from then on the rest of the string is defined to
      be opaque to the client. That is the client does not look
      inside it, it can not deduce an information from the
      characters in that identifier. It has been very tempting from
      time to time for people to write software in which a client
      will look at a string such as ".html" on the end of an
      identifier, and come to a conclusion that it might be
      hypertext markup file when dereferenced. But these thoughts
      of breaking of the rule could lead to a broken architecture
      in which the generality of URIs is something one can no
      longer depend on.
    </p>
    <p>
      Opacity of the URIs opens the door to new URI schemes, it
      opens the door to excitingly different interpretations of
      HTTP URI spaces. For example, servers can use the opaque
      string to carry all kinds of parameters to spaces with new
      topologies.
    </p>
    <p>
      As a result of this axiom the many parts of metadata, that
      information about the object that a client might be tempted
      to infer from the actual sting value of the URI but can't,
      have to be made available through the HTTP protocol. That is
      the purpose of some of the headers of the HTTP protocol and
      is discussed in the next section.
    </p>
    <p>
      Another example of a reason for keeping the URI opaque is
      that other address spaces for example within an HTTP servers
      address space the rest of the URI can be used as a coded
      representation of a name in some local space. Typically, when
      that is done, when the server serves as a gateway into an
      existing space, then it is extremely useful to be able to use
      the string in any way consistent with the URI syntax rules to
      represent coded names from the other space. The server can
      encode, within the URI, complex locations in some legacy
      system which is being mapped into an information space for
      the first time. So, for example, names which come from names
      in some sort of a database might by coincidence end up with
      .html with no implication that there is a hypertext markup
      language document involved, just that the particular encoding
      used happened to produce that string of bits.
    </p>
    <h4>
      <a name="Query" id="Query">Query strings</a>
    </h4>
    <p>
      An important case is the treatment of the question mark in
      HTML forms. There is a convention that infformation returned
      from HTML forms is returned by encoding it and appending it
      to the URI. The question mark within the URI is used to
      separate the basic URI from parameters which are appended to
      it to perform an operation. A typical use is for a search,
      and the string following the question mark is often known as
      a query string.
    </p>
    <p>
      When a query string and fragment identifier are used, the
      function evaluated on dereferencing a URL
    </p>
    <pre>
         http://foo/bar?baz#frag
 Is
         select(get( "foo", "query("bar","baz")), "frag") 
 
</pre>
    <p>
      where
    </p>
    <ul>
      <li>query (resource, querystring) is evaluated by the
      resource "bar"
      </li>
      <li>"bar" is opaque to all except the server "foo" ;
      </li>
      <li>"baz" is a format understood by client and by the
      resource "foo/bar";
      </li>
      <li>get(server, restofuri) is executed by the client engine
      which understands "foo" but not "bar"
      </li>
      <li>select(fragmentid, resource) is evaluated on the client
      by the resource's handling code
      </li>
    </ul>
    <p>
      Query strings are clearly not opaque to the client. However,
      they should be opaque to (for example) proxies.
    </p>
    <p>
      Apart from searches, other operations are performed, for
      example by those filling out HTML forms which are set up to
      have an HTTP "GET" action. This is done in situations in
      which the results of that operation of the URI are
      quasi-static. In other words, the resource referred to by the
      complete URI (including the query mark and the query string
      after it) follows the axiom of slow change above: the result
      of performing the operation is repeatable in some fashion.
    </p>
    <p>
      It is tempting and often done to assume that the result of
      such an operation will be more transient than that of a URI
      with less or without a query string. To make this assumption
      breaks the Opacity rule in general. Not only that, but this
      is in many cases a completely wrong assumption. For example,
      the query string is sometimes used to indicate parameters
      such as a personalized sub-space which is being browsed.
      Unfortunately, because the Opacity rule has been broken by
      clients and caches which don't cache documents whose URIs
      contain question marks, the question mark is sometimes been
      deliberately inserted in order to defeat caches. This
      creeping use of non-standard and axiom breaking conventions
      could clearly be damaging to other systems which use the
      question mark for other reasons for perfectly cacheable
      documents.
    </p>
    <h2>
      <a name="relative" id="relative">Hierarchies and Relative
      URIs</a>
    </h2>
    <p>
      While discussing the universality of Universal Resource
      Identifiers, it is as well to discuss the place of the
      Universal Syntax as this has been the source of some
      misunderstanding as to the intent and advantages behind this.
      The URI Syntax, now famous through its HTTP form uses slashes
      to indicate a hierarchical structured name or address. Apart
      from that, the strings between the slashes are opaque. There
      is nothing to say that the string between a double slash and
      a slash must be in all URI schemes a fully qualified domain
      name; there is nothing to relate strings between single
      slashes to parts of a unix file name. The reason that the
      slashes have been instituted as common universal syntax for a
      hierarchical boundary is that hierarchical schemes are common
      and that relative naming within a hierarchical space has many
      advantages.
    </p>
    <p>
      Relative naming allows small groups of documents which are
      located close within a tree to refer to each other without
      being aware of their absolute position within any absolute
      tree. It turns out that for scalability of the creation of
      material on the Web, this is essential. This has been found
      both for file names in most modern operating systems and for
      HTTP URLs, and one can also reasonably assume that it will be
      true for any other hierarchical scheme. Therefore, it is
      important that the generic concept of a hierarchical scheme
      is kept separate for future use from specific schemes which
      involve possibly to be outdated forms such as fully qualified
      domain names.<br />
    </p>
    <h4>
      An example in using relative URIs
    </h4>
    <p>
      Let us take, for example, the exercise of mapping an
      international telephone number onto the URL. International
      telephone numbers are hierarchical. For example, the meaning
      of and the format of a telephone number depends on the
      country, but there is a universal format for a telephone
      number in the world which can be understood everywhere. This
      format is, in fact, a plus sign indicating that one starts at
      the top of the hierarchy: that this is an absolute
      international telephone number. It is followed by the country
      code, the area code (if any) and the telephone number.
      Mapping this onto the URL syntax, the double slash would be
      used to indicate that one is starting from the top of the
      tree, so the number
    </p>
    <p>
      +1 (617) 253-5708
    </p>
    <p>
      would be written with a double slash instead of the plus and
      then slashes at the other hierarchical boundaries.:
    </p>
    <pre>
        phone://1/617/253-5708
 
</pre>
    <p>
      (The dash here is used for decoration. In practice people
      like to break telephone numbers up in various ways for
      readability, even though the punctuation has no hierarchical
      significance.) Of course, there could be other mappings used,
      but let us look at how this particular mapping using the
      slashes would be used in relative URIs. Suppose we are in a
      context in which that telephone number is the default.
      Suppose, for example we have declared that that number is the
      absolute base telephone number ("base URL") within a
      conversation: typically, we are talking to somebody who lives
      in the same area code.
    </p>
    <p>
      Using the relative URL pausing rules, we can refer to another
      local telephone number simple as, for example, 861-5000 with
      no punctuation. This is just what we do in practice. We can
      refer to a telephone number within the same country as
      /800/123-4567. Although these are not quite the conventions
      currently used for telephone numbers, they are just as
      compact as the various conventions of putting brackets around
      the area code, and would probably be parsed correctly by a
      human.
    </p>
    <p>
      To indicate an international number we simply start with a
      double slash. For example,
    </p>
    <pre>
        phone://41/22/767-6111. 
 
</pre>
    <p>
      Now, suppose instead we had used another system. We had just
      decided that for consistency, we would simply use the plus
      sign and for example, parenthesis around an area code. This
      would mean that whereas you can use the conventions of simply
      omitting the area code for the local telephone number, and
      you could use a plus sign to indicate an international
      telephone number. If you put <code>phone:</code> in front of
      it, to have it correctly parsed by a URL parser, you would
      always have to use the full international form. Now, there
      may be some who would prefer to always see the full
      international form in telephone numbers because telephone
      numbers are of fairly limited length. However, the principle
      of relative names or local telephone numbers being useful is
      established beyond question. There is also perhaps as much
      public use of the double slash in URIs as there is of the
      plus sign in international telephone numbers. (Within the
      United States of America there seem to be relatively few
      people who understand the significance of the plus sign or
      for that matter know what their country code is!).
    </p>
    <p>
      So, in general when looking at new naming schemes which may
      have a hierarchical nature we should regard the slash and the
      double slash as common syntax. It may be that we can
      transition to a shorter form in which, for example, a double
      slash is assumed after the colon in a fully qualified URL in
      order to address the worry that the URL syntax is clumsy when
      you include the scheme name prefix. <i><br /></i>
    </p>
    <h4>
      <a name="myth1" id="myth1">Myth:</a>
    </h4>
    <p class="axiom">
      Myth: "The // must only be used to introduce a fully
      qualified domain name."
    </p>
    <h4>
      Grandfathering hierarchies: generalizing the scheme
    </h4>
    <p>
      It is worth noting that the syntax with the double slash can
      in fact be extended for use with a triple slash if one wanted
      to be able to start at any level in a much more complicated
      hierarchical structure. For example, suppose international
      telephone numbers were to be extended to cover a planetary
      code in the future. Then the planetary code could be attached
      to the front of the international code. The triple slash
      could introduce the interplanetary code, and the double slash
      would introduce the international code. Indeed, this is how
      the double slash came to be: when hierarchical naming schemes
      such as those in unix file systems was extended to a networks
      file system on the Apollo domain the extra slash was
      introduced. Similarly, Microsoft NT networking now uses
      double backslash in exactly the same way.
    </p>
    <p>
      RFC1630 is an information RFC I wrote about URIs in WWW
      because getting consensus on the philosophy of all this in
      open forum ws going to take a long time at best. It contains
      an algorithm for parsing relative URIs which in fact would
      pause a relative URI in an environment with any arbitrary
      number of consecutive slashes. (The only problem with this
      scheme is, like others which use the same delimiter for
      beginning and ending strings or that one cannot represent an
      empty string. This is already a problem with the file syntax
      when an empty string is used for the host name resulting in
      three consecutive slashes.)<i><br /></i>
    </p>
    <p>
      To quote RFC1630:
    </p>
    <blockquote>
      If the scheme parts are different, the whole absolute URI
      must be given. Otherwise, the scheme is omitted, and:
      <p>
        If the partial URI starts with a non-zero number of
        consecutive slashes, then everything from the context URI
        up to (but not including) the first occurrence of exactly
        the same number of consecutive slashes which has no greater
        number of consecutive slashes anywhere to the right of it
        is taken to be the same and so prepended to the partial URL
        to form the full URL. Otherwise:
      </p>
      <p>
        The last part of the path of the context URI (anything
        following the rightmost slash) is removed, and the given
        partial URI appended in its place, and then:
      </p>
      <p>
        Within the result, all occurrences of "xxx/../" or "/." are
        recursively removed, where xxx, ".." and "." are complete
        path elements.
      </p>
    </blockquote>
    <p>
      The algorithm may not be perfect in its handling of "." and
      "..", but it applies to any numbler of slashes.
    </p>
    <h3>
      <a name="matrix" id="matrix">Matrix spaces and Semicolons</a>
    </h3>
    <p>
      There are a lot of web sites in which documents -- often
      virtual document -- vary along several dimensions. They are
      naturally arranged not on a tree but on a matrix. The URI for
      a map, for example, might be:<i><br /></i>
    </p>
    <pre>
<i>         //moremaps.com/map/color;lat=50;long=20;scale=32000<br />
</i>
</pre>
    <p>
      (I had an idea to make special form of relative URIs for
      these. See <a href="MatrixURIs.html">Matrix URIs</a> for the
      idea, not a feature of the web as of 2001.)
    </p>
    <h2>
      <a name="Properties" id="Properties">The properties of
      different URI schemes</a>
    </h2>
    <p>
      As noted above, the concept of a URI itself does not define
      the particular identity properties which exist between a URI
      and the resource associated with it. The <a href=
      "Axioms.html#Universality">axiom above</a> leaves the owner
      of the URI to define it. However, different URI schemes are
      defined and implemented in different ways, and this itself
      can impose restrictions on the mapping.
    </p>
    <p>
      Some of the properties of URI to resource mappings which vary
      from space to space were discussed in RFC1630. Some schemes
      (such as HTTP) leave answers up to the information publisher
      (URI owner).
    </p>
    <p>
      There is a lot of flexibility and growth to be gained by
      allowing any sort of URI, not one from a particular scheme,
      in most circumstances. Similarly, one should not make
      assumptions about the schemes involved. This is a facet of
      the particular parameters about how the technology is used.
      The choice of type URI in a pracical use of a language is an
      important flexibility point.
    </p>
    <table border="1">
      <caption>
        Comparison of some URI schemes
      </caption>
      <tbody>
        <tr>
          <th>
            Scheme prefix
          </th>
          <th>
            Identity relationship: what does the URI correspond to?
          </th>
          <th>
            Reuse
          </th>
          <th>
            Persistence
          </th>
        </tr>
        <tr>
          <td>
            http
          </td>
          <td>
            Geneneric document as :defined by publisher. Generic
            URIs possible with content negotiation
          </td>
          <td>
            defined by publisher
          </td>
          <td>
            defined by publisher
          </td>
        </tr>
        <tr>
          <td>
            ftp:
          </td>
          <td>
            sequence of bits
          </td>
          <td>
            defined by publisher
          </td>
          <td>
            defined by publisher
          </td>
        </tr>
        <tr>
          <td>
            uuid:
          </td>
          <td>
            expectation of uniqueness has to be upheld by publisher
          </td>
          <td>
            defined by publisher
          </td>
          <td>
            (no dereference)
          </td>
        </tr>
        <tr>
          <td>
            sha1:
          </td>
          <td>
            sequence of bits.
          </td>
          <td>
            mathematically extremely unlikely
          </td>
          <td>
            (no dereference)
          </td>
        </tr>
        <tr>
          <td>
            mid:
          </td>
          <td>
            Email message. Should be 1:1 modulo recoding, and
            header addition/deletion
          </td>
          <td>
            Can happen after 2 years according to the spec, but
            absolutely not recommended
          </td>
          <td>
            (no derefernce)
          </td>
        </tr>
        <tr>
          <td>
            mailto:
          </td>
          <td>
            mailbox as used in email protocols
          </td>
          <td>
            Socially unacceptable
          </td>
          <td>
            (no dereference)
          </td>
        </tr>
        <tr>
          <td>
            telnet:
          </td>
          <td>
            connection endpoint for interactive login service
          </td>
          <td>
            defined by publisher
          </td>
          <td>
            (no dereference)
          </td>
        </tr>
      </tbody>
    </table>
    <h3 id="it">
      How not to do it
    </h3>
    <p>
      Typical URI abuse by breaking this rule is occurs when a
      document format provides one URI space for a "name"and one
      for a "location".
    </p>
    <p>
      &lt;a href="uri1" urn="foo"&gt;
    </p>
    <p>
      or for example the SGML reference to a "public identifier"
      and a "system identifier".
    </p>
    <p>
      The Web way is to have a reference to one URI. If in the same
      document you want to incldue information such as other in
      some ways equivalent identifiers, then you embed that in your
      document as <a href="Metadata.html">metadata</a>, to be
      discussed later.
    </p>
    <p>
      That allows the exatct relationship to be expressed without
      ambiguity, with much more pwer and generality, and with
      consistency across applications.
    </p>
    <h3 id="also">
      See also
    </h3>
    <ul>
      <li>
        <a href="/Provider/Style/URI.html">Cool URIs don't
        change</a> - persistence in the HTTP space
      </li>
      <li>
        <a href="AxiomsHAF">Hall of Flame</a> -- how not to do it
      </li>
    </ul>
    <hr />
    <p>
      <small>$Id: Axioms.html,v 1.36 2009/03/17 17:25:35 timbl Exp
      $</small>
    </p>
    <p>
      <a href="Fragment.html">Next: Fragmement Identifiers</a>
    </p>
    <p>
      <a href="Overview.html">Up to Design Issues</a>
    </p>
  </body>
</html>