HTTP-URI
45.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
<title>
What do HTTP URIs Identify? - Design Issues
</title>
<link rel="Stylesheet" href="di.css" type="text/css" />
<meta http-equiv="Content-Type" content=
"text/html; charset=us-ascii" />
</head>
<body bgcolor="#DDFFDD" text="#000000" lang="en" xml:lang="en">
<address>
Tim Berners-Lee<br />
Date: 2002-07-27, last change: $Date: 2007/01/15 20:05:15
$<br />
Status: personal view only. Editing status: first draft. This
was a result of my being in a minority with this opinion on
the Technical Architecture Group, and yet finding it the only
one I could accept. This is related to TAG issue
HTTPRange-14.
</address>
<p>
<a href="./">Up to Design Issues</a>
</p>
<p>
<strong>Note: (2006). This architectural question has now
been <a href=
"http://lists.w3.org/Archives/Public/www-tag/2005Jun/0039.html">
decided</a> by the W3C TAG, in a compromise which I think
works quite well, and is described in a <a href=
"HTTP-URI2">later short note</a> and a TAG finding.</strong>
</p>
<hr />
<h1>
What do HTTP URIs Identify?
</h1>
<h3>
Background Note
</h3>
<p>
This question has been addressed only vaguely in the
specifications. However, the lack of very concise logical
definition of such things had not been a problem, until the
formal systems started to use them. There were no formal
systems addressing this sort of issue (as far as I know,
except for Dan Connolly's Larch work [@@]), until the
<a href="/2001/sw">Semantic Web</a> introduced languages such
as RDF which have well-defined logical properties and are
used to describe (among other things) web operations.
</p>
<p>
The efforts of the <a href="/2001/tag">Technical Architecture
Group</a> to create an architecture document with common
terms highlighted this problem. (It demonstrates the
ambiguity of natural language that no significant problem had
been noticed over the past decade, even though the original
author or HTTP , and later co-author of HTTP 1.1 who also did
his PhD thesis on an analysis of the web, and both of whom
have worked with Web protocols ever since, had had
conflicting ideas of what the various terms actually mean.)
</p>
<p>
This document explains why the author find it difficult to
work in the alternative proposed philosophies. If it
misrepresents those others' arguments, then it fails, for
which I apologize in advance and will endeavor to correct.
</p>
<h2>
1. Web Concepts as here proposed
</h2>
<p>
The WWW is a space of information objects. The URI was
originally called a UDI, and originally all URIs identified
information objects. Now, URI schemes exist which identify
more or less anything (e.g. UUIDs) or electronic mailboxes
(mailto:) but is we look purely at HTTP URIs, they define a
web of information objects. Information objects -- perhaps in
Cyc terms <a href="">ConceptualWorks</a> -- are normally
things which
</p>
<ul>
<li>Carry some sort of message, and
</li>
<li>Can be represented, to a greater or lesser authenticity,
in bits
</li>
</ul>
<p>
I want to make it clear that such things are generic (See
<a href="/DesignIssues/Generic">Generic Resources)</a> --
while they are documents, they generally are abstractions
which may have many different bit representations, as a
function of, for example:
</p>
<ul>
<li>Time -- the contents can vary with revision --
</li>
<li>Content-type in which the bits are encoded
</li>
<li>Natural language in which a human-readable document is
written
</li>
<li>Machine language in which a machine-processable document
is written
</li>
<li>and a few more
</li>
</ul>
<p>
but the philosophy is that an HTTP URI may identify something
with a vagueness as to the dimensions above, but it still
must be used to refer to a unique conceptual object whose
various representations have a very large a mount in common.
Formally, it is the publisher which defines the what an HTTP
URI identifies, and so one should look to the publisher for a
commitment as to the exact nature of the identity along these
axes.
</p>
<p>
I'm going to refer to this as a <strong>document</strong>,
because it needs a term and that is the best I have to date,
but the reader should be sure to realize that this does not
mean a conventional office document, it can be for example
</p>
<ul>
<li>A poem
</li>
<li>An order for ball bearings
</li>
<li>A painting
</li>
<li>A Movie
</li>
<li>A review of a movie
</li>
<li>A sound clip
</li>
<li>A record of the temperature of the furnace
</li>
<li>An array a million integers, all zero
</li>
</ul>
<p>
and so on, as limited only by our imagination.
</p>
<p>
The Web works because, given an HTTP URI, one can in a large
number of cases, get a representation of the document. For a
human readable document, the person is presented with the
information by virtue of some gadget which is given the bits
of a representation. In the case of a hypertext document, a
reference to another document is encoded such that, upon user
request, the referenced document can in turn be automatically
presented. In the case of a machine-readable document,
identifiers of concepts, being HTTP URIs, will often allow
definitive reference information about those concepts to be
pulled in to guide further actions.
</p>
<p>
The web, then, is made of documents as the internet is made
of cables and routers. The documents can be about anything,
so when we move to talk about the contents of documents we
break away from talking about information space and the whole
universe of human -- and machine -- discourse is open to us.
Web pages can compare a renaissance choral works with jazz
pop hits, and discuss whether pigs have wings.
Machine-processable documents can encode information about
shoes, and ships, and sealing-wax. Until recently, the
Internet protocol standards out of which the Web is built had
little to say about such things. They were concerned only
with the human-readable side, so it was people, reading
natural language (not internet specs) who formed and
communicated the concepts at this level. Nowadays, however,
semantic web languages allow information to be expressed not
only about URIs, TCP ports and documents, but also about
arbitrary concepts - the shoes, and ships and sealing wax,
and whether pigs have wings. Simple semantic web application
allow one to order shoes and travel on ships, and determine
that, given the data, pigs do not have wings.
</p>
<p>
For these purposes it is of course quite essential to
distinguish between something described by a document and the
document itself. Now that we -- for the first time -- have
not only internet protocols which can talk about document but
also those which talk about real world things, we must either
distinguish or be hopelessly fuzzy.
</p>
<p>
And is this bad, is it an inhibition to have to work our way
though documents before we can talk about whatever we desire?
I would argue not, because it is very important not to lose
track of the reasons for our taking and processing any piece
of information. The process of publishing and reading is a
real social process between social entities, not mechanical
agents. To be socially responsible, to be able to handle
trust, and so on, we must be aware of these operations. The
difference between a car and what some web page says about it
is crucial - not only when you are buying a car.
</p>
<p>
Some have opined that the abstraction of the document is
nonsense, and all that exists, when a web page describes a
car, is the car and various representations of it, the HTML,
PNG and GIF bit streams. This is however very weak in my
opinion. The various representations have much more in common
than simply the car. And the relationship to the car can be
many and varied: home page, picture, catalog entry, invoice,
remote control panel, weblog, and so on. The document itself
is an important part of society - to dismiss its existence is
to prevent us being aware of human and aspects of information
without which we are impoverished. By contrast, the
difference between different representations of the document
(GIF or PNG image for example) is very small, and the
relationship between versions of a document which changes
through time a very strong one.
</p>
<h2>
2. Trying out the Alternatives
</h2>
<p>
The folks who disagree with the model do so for a number of
different arguments. This article, therefore will have to
take them one by one but the ones which come to mind are as
follows:
</p>
<ol>
<li>
<a href="#L728">Every web page (or many of therm) are in
fact themselves representations of some abstract thing, and
the URI really identifies that</a> thing, not a document at
all.
</li>
<li>
<a href="#L876">There are many levels of identification
(representation as a set of bits, document, car which the
web page is about) and the URI publisher, as owner of the
URI, has the right to define it to mean whatever he or she
likes;</a>
</li>
<li>
<a href="#L883">Actually the URI has to, like in English,
identify these different things ambiguously. Machines have
to disambiguate using common sense and logic</a>
</li>
<li>
<a href="#L890">Actually the URI has to, like in English,
identify these different things ambiguously. Machines have
to disambiguate using the fact that different properties
will refer to different levels</a>.
</li>
<li>
<a href="#L897">Actually the URI has to, like in English,
identify these different things ambiguously. Machines have
to disambiguate using extra information which will be
provided in other ways along with the URI</a>
</li>
<li>
<a href="#L909">Actually the URI has to, like in English,
identify these different things ambiguously. Machines have
to disambiguate them by context: A catalog card will talk
about a document. A car catalog will talk about a car</a>.
</li>
<li>
<a href="#L920">They may have been used to identify
documents up till now, but for RDF and the Semantic Web, we
should change that and start to use them as the Dublin Core
and RDF Core groups have for abstract concepts</a>.
</li>
</ol>
<h3 id="L728">
2.1 Identify abstract things not documents
</h3>
<p>
Let's take the alternatives in order. These alternatives all
make sense. Each one, however, has problems I can't see any
way around when we consider them as a basis as
</p>
<p>
The first was,
</p>
<blockquote>
<p>
Every web page (or many of them) are in fact themselves
representations of some abstract thing, and the URI really
identifies that thing, not a document at all.
</p>
</blockquote>
<p>
Well, that wasn't the model I had when URIs were invented and
HTTP was written. However, let's see how it flies. If we
stick with the principle that a URI (or URIref) must
unambiguously identify the same thing in any context, then we
come to the conclusion that URIs can not identify the web
page. If a web page is about a car, then the URI can't be
used to refer to the web page.
</p>
<h4>
2.1.1 <a name="s2.1.1" id="s2.1.1">Same URI can identify a
web page and a car</a>
</h4>
<p>
What, a web page can't be a car? At this point a pedantic
line reasoning suggests that we should allow web pages and
cars to conceptually overlap, so that something can be both.
This is counterintuitive, as a web page is in common sense,
not a concrete object whereas a car is. But sure, we could
construct a mathematics in which we use the terms rather
specially and something can be at the same time a web page
and a car.
</p>
<p>
Frankly, this doesn't serve the social purpose of the
semantic web, to be able to deal with common sense concepts
and objects. A web page about a car and a car are in most
people's minds quite distinct (as I argue further below). A
philosophy in which they are identical does not allow me to
distinguish between them. not only conflicts with reality as
I see it, but also leaves us no way to make statements
individually about the two things.
</p>
<h4>
<img alt=
"A car has a different identifier -- and very different properties."
src="diagrams/http-uri-1.png" />
</h4>
<h4>
2.1.2 <a name="identifies" id="identifies">The URI identifies
the car, not the web page</a>
</h4>
<p>
So lets fall back on the idea that the URI identifies the
<em>subject</em> of the web page, but not the web page
itself. This makes sense. We can build the semantic web on
top of that easily.
</p>
<p>
The problem with this is that there are a large number of
systems which already do use URIs to identify the document.
This is the whole metadata world. Think of a few:
</p>
<ul>
<li>The Dublin Core
</li>
<li>RSS
</li>
<li>The HTTP headers
</li>
<li>The Adobe XML system
</li>
<li>Access control systems
</li>
</ul>
<p>
(I'm sticking with the machine-processable languages as
examples because human-processable ones like HTML have a
level of ambiguity traditional in human natural language but
quite out of place in the WWW infrastructure -- or the
Semantic Web. You can argue that people say "I work for
w3.org" or "http://www.amazon.com/shrdlu?asin=314159265359"
is a great book, just as they happily say "<em>Moby Dick</em>
weighs over three thousand tons", "<em>Moby Dick</em> was
finished over a century ago" and "I left <em>Moby Dick</em>
on the beach" without expecting to be misunderstood. So we
won't use human language as a guide when defining
unambiguously the question of what a URI identifies. If we
want to do that on the Semantic Web, we will say "I work for
<em>the organization whose home page is</em>
http://www.ww3.org.)
</p>
<p>
Some argue the the URI which I associate with someone's home
page actually identifies that person. They argue that
conventionally people use the identifier to identify the
person. However, consider another page put together by
friends who found a photograph of the same person. A lot of
content filtering systems would collect that URI and put put
into their list. Even though the photo had many
representations which different devices could download using
content negotiation and/or CC/PP (color or black and white
and versions of different resolutions) the URI itself would
be listed as containing nudity. The public are very aware of
different works on the web, even though they have the same
topic.
</p>
<h4>
2.1.3 <a name="Indirect" id="Indirect">Indirect
identification</a>
</h4>
<p>
You can argue that a web page <em>indirectly</em> identifies
something, of course, and I am quite happy with that. If you
identify an organization as that which has home page
http://www.w3.org, then you are not saying that
http://www.w3.org/ itself is that organization. This scenario
is very very common, just as we identify people and things by
their "unambiguous properties": books by ISBN, people by
email address, and so forth. So long as we don't think that
the person <em>is</em> an email address, we are fine. Some
people have thought that in saying "An HTTP URI can't
identify an organization" I was ruling out this indirect
identification, but not so: I am very much in favor of it.
The whole SQL world, after all, only identified things
indirectly by a key property. This causes no contradiction.
Perhaps I should say "An HTTP URI can't directly identify an
organization". But by "identify" I mean "directly identify",
and "identity" is a fairly direct word and concept, so I will
stick with it.
</p>
<p>
Conclusion so far: the idea that a URI identifies the thing
the document is about doesn't work because we can only use a
URI to identify one thing and we have and already do use it
to identify documents on the web.
</p>
<h4>
2.1.4 <a name="argument" id="argument">The argument for HTTP
URIs identifying a Conceptual Work</a>
</h4>
<p>
So what's wrong with the URI being taken to identify whatever
the owner says?
</p>
<p>
Let's look at what we mean by <em>identifies</em>. When we
say there is identity, that means that there is some form of
sameness that we associate with the identifier. Now, for all
the philosophical argument, we can never test the identity of
an abstract thing. What we can test is a representation which
has been returned by the server when given that URI. When we
use aURI, and get back several possible representations of
it, then what expectation do we have about those
representations?
</p>
<p>
Take the test case that I see the web page which has a
picture of a car, and I see in the URI in the URI bar in the
browser. I email you the URI, "you see, the car is a
Toyota?". You click on the link. Your browser shows the same
URI as mine in the "URL bar" but you see a table of the car's
weight, length, height, color, and registration number. We
are confused. The web didn't work because you didn't get the
same information as me. I expected you to get the same
information, basically. That is how the Web works. That is
the expectation behind every hypertext link - that the
follower of the link should get basically the same
information as the person who made the link. I say,
"basically" because I would not have cared whether you saw or
JPEG or a GIF. It probably wouldn't have mattered if you had
seen a lower resolution or even black-and-white copy of the
picture. If you are visually impaired, you may have been able
to manage with a well-written description of the picture. But
the the essential information is the same, not just the
subject of the page.
</p>
<p>
So now we have put the four corners on the expectation we
have of a URI -- that all representations have essentially
the same <em>information content</em>. And what we mean by
"essentially" allows in fact some wriggle room, and in the
end it rests on a common understanding between publisher of
the information and quoter of the URI. The sameness we are
after is the sameness of information content. <em>That</em>
is what is identified by the URI. That is why we say that the
URI identifies that conceptual information content,
irrespective of its particular representation: the
<em>conceptual work</em>. Without that common understanding,
the web does not work.
</p>
<p>
Some people have said, "If we say that URIs identify people,
nothing breaks". But all the time they, day to day, rely on
sameness of the information things on the web, and use URIs
with that implicit assumption. As we formalize how the web
works, we have to make that assumption explicit.
</p>
<h3 id="L876">
2.2 Author definition
</h3>
<p>
So how can we break free of that line of reasoning? We can
try throwing away the rule that a URI identifies only one
thing.
</p>
<blockquote>
<p>
There are many levels of identification (representation as
a set of bits, document, car which the web page is about)
and the URI publisher, as owner of the URI, has the right
to define it to mean whatever he or she likes.
</p>
</blockquote>
<p>
Well, this one is tempting from the point of view that the
owner of an identifier should reign supreme when it comes to
saying what it identifies. It is quite a logically consistent
position to take. After all, isn't this the case with
<code>uuid</code>'s? And for a new scheme, this would be
interesting. How can we do it though, with HTTP? the problem
is an engineering one: I can't in practice use a URI until I
have some definitive information from the publisher as to
what it identifies.
</p>
<p>
2.2.1 Default
</p>
<p>
Why can't a URI default to identifying a web page until you
know otherwise? Because the web is open and you will never
know when you might lean some other information which will
make the default incorrect. (You can't use such "closed
world" reasoning).
</p>
<p>
2.2.2 Web operation
</p>
<p>
Why can't a URI identify a web page until you have done some
well-defined operation -- such as HTTP HEAD or GET -- and
checked for information in that? Well, that would certainly
work logically. Suppose we we define a return code or HTTP
header which means "abstract object requested". It would mean
that every web application which deals with web pages as web
pages would actually be working under an ambiguity, and RDF
processors could be programmed to look for that special
information. We can't retrofit the millions of web servers
out there, I assume.
</p>
<p>
I feel that there is a great benefit to fixing this question
at the spec level. Otherwise, what happens? I read a web
page, I like it and I am going to annotate it as being a
great one -- but first I have to find out whether the URI my
browser is used, conceptually by the author of the page, to
represent some abstract idea? Before I recommend the
<em>Vietnam War</em> page, I have to be careful I am not
recommending the Vietnam War.
</p>
<p>
There has been no way to do this before RDF, but then
similarly no real need for it. (What, is this just a problem
with RDF? No, it will happen with any webized knowledge
representation system.). We really need to have communication
in which two people use the same URI to mean the same thing.
If there
</p>
<p>
We could fix HTTP so that it would return me some extra
semantic headers explaining the whole thing. And in the case
that the URI was deemed to be some abstract thing, I would
not have the option of recommending the web page. Too bad: it
has no URI.
</p>
<p>
The authors of document
<http://www.w3.org/2000/10/rdf-tests/rdfcore/Manifest.rdf>
certainly thought that they could use
"http://www.w3.org/2000/10/rdf-tests/TestSchema/NegativeParserTest"
to identify an abstract thing which is a type of software
test. Now they have a choice as to what to make the server
return for them when I ask for it. It returns 404 "doesn't
match anything we have available". It can't really, because
HTTP doesn't allow one to return a class, only a document.
And if it were to return a document, then I wouldn't be able
to refer to that document without accidentally referring to
the class of negative parser tests.
</p>
<p>
So, we could change HTTP to make this work. We could make a
new form of redirect, <em>343 Abstract Object, please see . .
.</em>, which would tell the client that the thing requested
was abstract, and would suggest a document to read about it.
This avenue of argument is still outstanding. We could take
it. It isn't the status quo, but we could make changes in
HTTP if the community felt that this was they way to go.
</p>
<h3 id="L883">
2.3 Logic disambiguates
</h3>
<p>
Otherwise,we have to try another way of letting the URI mean
sometimes one thing and sometimes another. Here is another.
</p>
<blockquote>
<p>
Actually the URI has to, like in English, identify these
different things ambiguously. Machines have to disambiguate
using common sense and logic
</p>
</blockquote>
<p>
This is possible in theory. It is a mess. It fails
particularly spectacularly when a URI is used ambiguously to
refer to a web page and the thing that web page is about,
which happens to be another web page. <em>Anyone can write
anything about anything</em> is a Web motto, but here it
falls down. <em>Anyone can write anything about anything
except those things which might get confused with the
document they are writing</em>. It breaks the axiom that we
mean the same thing by a URI - in all contexts. (And RDF has
a model theory in which necessarily in any interpretation, a
symbol always denotes one thing).
</p>
<h3 id="L890">
2.4 Different Properties
</h3>
<blockquote>
<p>
Actually the URI has to, like in English, identify these
different things ambiguously. Machines have to disambiguate
using the fact that different properties will refer to
different levels.
</p>
</blockquote>
<p>
One way of getting here is to start by considering that HTTP
headers can be divided into those which refer to the
representation (or the document) and those that refer to,
say, a car or a donkey. We can look at all RDF properties and
other attributes in other languages and divide them in in
such a way. So, when I say "http://example.com/albert is a
color photo", I am referring to the representation; when I
say "http://example.com/albert used to work down the mill" I
am referring to the person; when I say
"http://example.com/albert was taken on a rainy day" I am
revering to the original photograph, which is basically the
representation of Albert.
</p>
<p>
This one has the problem when a web page refers to a web
page. It can still be pursued, by having different verbs for
talking about ownership of the web page and ownership of the
car. This is a classic example of the 2-level syndrome (see
also <em>Dictionaries in the Library</em>). The basic fallacy
is that you can make the system general by introducing a
second level - a new set of attributes, properties, or
whatever, which allow you to refer to the metadata of
something separately from the thing itself. These systems
either turn out to be just limited 2-level systems (like XML
and DTDs) or have to be extended to be recursive in some way
later on such that in fact the two levels become unnecessary.
</p>
<h3 id="L897">
2.5 Extra info with URI
</h3>
<blockquote>
<p>
Actually the URI has to, like in English, identify these
different things ambiguously. Machines have to disambiguate
using extra information which will be provided in other
ways along with the URI
</p>
</blockquote>
<p>
This twist now relies on sending extra information with a
URI. Effectively, the URI scheme has now failed to identify
anything by itself. Those most familiar URIs as used by HTML
sometimes suggested adding new attributes to the anchor tags
of HTML documents to disambiguate a reference. I guess it
would work if HTML anchors were the only uses of URIs. By
contrast, they are used in thousands of places and way, many
of which I am unaware. The architecture, however, is not that
way: the architecture of the WWW is that a URI is a global
unambiguous identifier. Not a URI and something else.
</p>
<p>
(The various designs such a WebDav's propfind which use HTTP
methods apart from GET to retreive information suffer from
this same problem. the information does not have a URI: it is
not on the web.)
</p>
<h3 id="L909">
2.6 Different meaning in different context
</h3>
<blockquote>
<p>
Actually the URI has to, like in English, identify these
different things ambiguously. Machines have to disambiguate
them by context: A catalog card will talk about a document.
A car catalog will talk about a car.
</p>
</blockquote>
<p>
This works in the short term, when the two contexts are
disjoint groups who do not need to communicate. It is in fact
the current state: the groups of people who use HTTP URIs to
talk about documents, and those who have just started to use
them to talk about abstract concepts haven't collided yet.
(Well, they have in my code. I need to be able to model the
metadata about an HTTP URI as that about a document, and it
being a class at the same time doesn't jive.)
</p>
<p>
It doesn't work in the long term because it breaks the axiom
that a URI must identify one thing,
</p>
<h3 id="L920">
2.7 Change it for the Semantics Web
</h3>
<blockquote>
<p>
They may have been used to identify documents up till now,
but for RDF and the Semantic Web, we should change that and
start to use them as the Dublin Core and RDF Core groups
have for abstract concepts.
</p>
</blockquote>
<p>
I think that we would have to design a new URI scheme before
we change things that much. That is tempting of course. But
then -- building a semantic web out of what we have is
tempting too. It was tempting to rehash TCP a little when
making HTTP. It wasn't practical, and we would have lost a
lot more than we would have gained. There is a lot to be said
for using common technology. We've got an infrastructure of
documents. We want to build an infrastructure of knowledge.
Let's build it using the documents. We might find that the
commonality with the web of human-readable information is a
boon.
</p>
<h3 id="L735">
2.8 Abandon any identification of abstract things
</h3>
<p>
An argument which surprised me is that yes, HTTP URIs
identify documents, but in fact the frgament identifier must
only be used to identify parts -- fragments -- of documents.
This means that RDF cannot in fact use HTTP URI schemes at
all. A completely different system would have to be put
together -- either a new set of URIs, or RDF conventions in
which the relationship to the part of a document in which
something was described became explicit. In N3 this would
like like
</p>
<p>
[ is rdf:referent of <#fmyCar> ] [ is rdf:referent of
<#color> ] [ is rdf:referent of <#blue> ]
</p>
<p>
Of course, languages would quickly generate special syntax
for this. Alternatively, the RDF system would built entirely
on the understanding that we were referring always to that
denoted by a given bit of document, not the bit of document
itself. This would mean that there would be no way for the
RDF system to refer to documents themselves directly.
</p>
<p>
This is actually a consistent way of working. It would be a
change only for those people who use RDF to talk about
documents as documents. We could change.
</p>
<h2>
<a name="L409" id="L409">3. Conclusion</a>
</h2>
<p>
I didn't have this thought out a few years ago. It has only
been in actually building a relatively formal system on top
of the web infrastructure that I have had to clarify these
concepts my own mind. I am forced to conclude that modeling
the HTTP part of the web as a web of abstract documents if
the only way to go which is practical and, by the
philosophical underpinnings of the WWW, tenable.
</p>
<p>
I apologize again if I have misunderstood or misrepresented
other's arguments in this process of this explanation of my
own position.
</p>
<p>
Tim Berners-Lee
</p>
<p>
2002-07-28Z
</p>
<hr />
<h3>
FAQ
</h3>
<p>
<em>Q: But surely, if a document is identified by a namespace
URI, then when we look up an RDF namespace will millions of
words in it we will have too long a document to be
practical!</em>
</p>
<p>
A: It is arguable, for such as situation, whether the
namespace itself is more cumbersome to manage than the
document is to deliver. You can make an analogy with
hypertext: Isn't the model of retrieving a document going to
be inefficient when the documents are huge? Answers are
twofold in each case,
</p>
<p>
Firstly, yes it is likely to be less convenient, but that is
no reason to skew which is a good engineering design for the
vast proportion of namespaces (or hypertext documents) which
are not huge.
</p>
<p>
Secondly, the HTTP protocol actually does have methods of
retrieving parts of a large document.
</p>
<p>
<em>Q: It seems strange that an HTTP URI should be limited to
referring to documents, but that all one has to add is this
little hash mark and suddenly you say it can be used to
identify anything.</em>
</p>
<p>
A: The hash is not a minor appendage to the URI: It is the
most significant piece of punctuation in the whole URIref.
The hash adds a whole new level of abstraction and
specification! It is true that in a hypertext page and that
page scrolled to a given point seem very similar. The same
applies to a graphic chart and an object within that chart,
especially when it is displayed in the context of the
original document. So I suppose it may be a shock when the
technique is used with a semantic web language to refer to
not the document, but something which the document discusses.
That does allow it to break out of the whole concept of
documents and into -- anything. But no one promised the
Semantic Web would be boring. :-)
</p>
<p>
<em>Q: I thought you said "anything should be able to have a
URI"?</em>
</p>
<p>
Yes, and it should. There is nothing in the URI spec to say
what an individual scheme should or should not be created to
identify. A new URI scheme could for example be ale to
identify anything. But here we are talking about HTTP URIs.
And remember that with semantic web languages, you can use a
URIref (very different from a URI) to identify anything, for
example with HTTP and RDF.
</p>
<p>
<em>Q: But what about CGI scripts? Surely you don't mean the
HTTP URI identifies the script?</em>
</p>
<p>
A: Of course not. When we talk about the "document"
identified by a URI it is very often an virtual document
produced by, for example, a CGI script. The URI identifies
the document on the web, with no regard to the process which
causes representations of it to be served.
</p>
<p>
<em>Q: Some HTTP URIs can be POSTed to. Can you still say
they identify documents?</em>
</p>
<p>
A: Well, some HTTP URIs can't be accessed at all, and some
access is not allowed, and yes, some URIs are not only
documents but also can be posted to. So they object is more
complex than simply a document. But that it has this extra
functionality doesn't make it any less a HTTP document
formally. Something can have extra features and still remain
in the same class of things.
</p>
<p>
<em>Q: What do you mean by "identify", anyway, in Model
Theory terms? (2003)</em>
</p>
<p>
The closest term used in Model Theory to the way I am using
<em>identify</em> is <em>denote</em>. Model theory analyses
communication and understanding by imagining a set of
<em>interpretations</em>, where an interpretation is a
mapping from a symbol to that which it denotes. Model
theorists and linguists tend to complain that one cannot talk
about the meaning of a term, as you can never know what
anyone means by anything, you can only see how they react. A
given agent may have many possible interpretations, but new
information the agent believes which mentions a symbol will
rule out interpretations with which are inconsistent with the
symbol. By the process of exchange of a lot of information,
one arrives at a state in which one behaves as though other
agents has the effectively the same set of interpretations.
Under these conditions, one can think of the thing
<em>identified</em> by the symbol in the community as being
the set of things denoted by the symbol in the
interpretations which agents in the community are left with.
There has been much more discussion of this process (which is
the essence of the writing of a standard and the purpose of
documents like this) in email on www-tag with Pat Hayes and
others in 2003.
</p>
<address>
The rest are from Aaron Swartz
</address>
<p>
<em>Q: Can you point to something in the spec that says HTTP
URIs must identify a document?</em>
</p>
<p>
There are many answers. I can point to things which could be
interpreted to say that. The HTTP spec defines resources as
<em>network data objects</em>. To me that "data" indicates
the information nature of the thing. It precludes, in most
people's minds, a car or the Andromeda Galaxy.
</p>
<p>
I could explain that, as I originally wrote the HTTP spec,
that was the author's intent.
</p>
<p>
But I think the fairest thing is to say that the spec was
written it was not sufficiently clear about this particular
ambiguity, and for reasons mentioned above, this hasn't been
a problem until now.
</p>
<p>
<em>Q: Isn't it a little weird to start making pronouncements
about the entire HTTP Web when neither the spec nor the other
TAG members agree?</em>
</p>
<p>
Pronouncements about the whole Web are really important where
they are needed. In that case the TAG has a duty to make
them. And so do I. It seems to me that this assumption is one
we have been implicitly making and are now breaking, in a way
which will make the semantic web either inconsistent or much
less efficient. The TAG members do not agree on this: that is
why they asked my to write this document. It is written as a
TAG action item about tag issue HTTPRange-14. Things get a
lot weirder than that. ;-)
</p>
<p>
<em>Q: Why do we need to use URI-refs to identify abstract
concepts in a protocol where we can get more information
about them? .I thought URIs were doing just fine. If we have
to resort to UUIDs to identify things, I'll get annoyed
because I won't be able to put them in my browser.</em>
</p>
<p>
Well, there you are... you want to be able to put something
in your browser, then you must have a representation of it.
So somewhere in the picture, representations aside, is a
ConceptualWork. If the ConceptualWork is important, then it
needs a URI, in my opinion. The alternatives are attractive
when you start to look at them, but each has a different
snag. I have tried to explain above.
</p>
<p>
<em>Q: How can you say that the Semantic Web can use the hash
mark to make a URI-ref identify anything when the URI RFC is
very clear that hash marks only work when you dereference the
document.</em>
</p>
<p>
I wouldn't say that hash marks "only work when you deference
a document" any more than your street address "only works
when I visit you", or your date of birth "only worked when
you were born". I can use your street address -- or your data
of birth -- to help identify you. What the spec defines is a
way of using this particular URI to get some information over
the Internet. The whole web works by what someone recently
referred to as a "confusion" between name and address. It
isn't a confusion. It is a connection between two pieces of
architecture without which the web would not be. Rethink. It
is primarily a name. We have made a way of looking it up. So
you don't have to look it up for the name to "work" as an
identifier. Just as you don't go and look it up when someone
quotes the RDF namespace -- it works because the same
identifier identifies the same thing in any context. Looked
up or not. The same thing is true for foo#bar. If the
document foo is never served, one can still (if one owns it)
talk about foo#bar with authority. It is of course good
practice to serve documents.
</p>
<p>
<em>Q: Are all Semantic Web agents going to start
dereferencing every document they hear about?</em>
</p>
<p>
No, any more than you have to dereference every hypertext
link you see.
</p>
<p>
<em>Q: Isn't the Semantic Web broken if we have to start
disagreeing with major specifications like this?</em>
</p>
<p>
This philosophy is quite consistent with the HTTP spec as it
is.
</p>
<h3>
Exercises
</h3>
<p>
1) What does "<a href=
"http://www.amazon.com/exec/obidos/ASIN/0679600108/qid=1027958807/sr=2-3/ref=sr_2_3/103-4363499-9407855">http://www.amazon.com/exec/obidos/ASIN/0679600108/qid=1027958807/sr=2-3/ref=sr_2_3/103-4363499-9407855</a>"
identify?
</p>
<ol>
<li>A whale
</li>
<li>"Moby Dick or the Whale" by Herman Melville
</li>
<li>A web page on Amazon offering a book for sale
</li>
<li>A URI string
</li>
<li>All the above
</li>
</ol>
<p>
When was the thing it identified last changed?
</p>
<p>
Have you read the thing it identifies?
</p>
<p>
2) What does "<a href=
"http://www.vrc.iastate.edu/magritte.gif">http://www.vrc.iastate.edu/magritte.gif</a>"
identify?
</p>
<ol>
<li>A pipe
</li>
<li>I don't know, but whatever it is it isn't not a pipe.
</li>
<li>A contradiction
</li>
<li>
<strong>A picture by Magritte</strong>
</li>
<li>
<strong>A photograph of a picture by Magritte</strong>
</li>
<li>
<strong>A representation as a series of 341632 bits in of a
photo of a painting</strong>
</li>
<li>Validly 4, 5 and 6 but not 1
</li>
</ol>
<p>
<img alt="Hint: This is not a pipe" src=
"http://www.vrc.iastate.edu/magritte.gif" />
</p>
<p>
3) What does "<a href=
"http://dm93.org/2002/03/dans-car-23423423">http://dm93.org/2002/03/dans-car-23423423"</a>
identify?
</p>
<ol>
<li>An inaccessible web page
</li>
<li>A black Toyota
</li>
</ol>
<p>
4) What does "<a href=
"http://dm93.org/y2002/myCar-232">http://dm93.org/y2002/myCar-232</a>"
identify?
</p>
<ol>
<li>A black toyota
</li>
<li>A web page
</li>
</ol>
<p>
When was the thing identified last changed?
</p>
<p>
What does the writing on Dan's car say?
</p>
<p>
Answers: 1:3. 2:7 Note here the web tolerates vagueness along
the axis of different representations of the same image, but
not of semantic level between the image and the pipe. 3:1;
4:2
</p>
<h3>
References
</h3>
<p>
@@@links
</p>
<ul>
<li>The huge discussion of this issue on www-tag@w3.org
</li>
<li>
<a href="http://www.textuality.com/tag/s1.1.html">Tim
Bray's text</a>
</li>
<li>RFC 1634 and points west
</li>
<li>Roy Fielding's short history of URI specifications
</li>
<li>Weaving the Web
</li>
<li>
<a href=
"http://www.cyc.com/cycdoc/vocab/info-vocab.html">Cyc's
page about Conceptual Works</a> cyc:ConceptualWork <a href=
"http://ilrt.org/discovery/chatlogs/rdfig/2002-07-31.html#T15-56-58-1">
proposed as what I mean by document by DanC</a>.
</li>
</ul>
<hr />
<p>
<a href="Overview.html">Up to Design Issues</a>
</p>
<p>
<a href="../People/Berners-Lee">Tim BL</a>
</p>
</body>
</html>