GovData.html
23.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content=
"HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
<title>
Putting Government Data online - Design Issues
</title>
<link rel="Stylesheet" href="di.css" type="text/css" />
<meta http-equiv="Content-Type" content="text/html" />
</head>
<body bgcolor="#DDFFDD" text="#000000">
<address>
Tim Berners-Lee<br />
Date: 2009-06, last change: $Date: 2009/06/30 15:49:50
$<br />
Status: personal view only. Editing status: Good enough for
folk. Notes after talking with various people in UK and US
governments who would like to put data on the web and want to
know the next steps.
</address>
<p>
<a href="./">Up to Design Issues</a>
</p>
<hr />
<h1>
Putting Government Data online
</h1>
<h4>
Abstract
</h4>
<p class="abstract">
Government data is being put online to increase
accountability, contribute valuable information about the
world, and to enable government, the country, and the world
to function more efficiently. All of these purposes are
served by putting the information on the Web as Linked Data.
Start with the "low-hanging fruit". Whatever else, the raw
data should be made available as soon as possible.
Preferably, it should be put up as Linked Data. As a third
priority, it should be linked to other sources. As a lower
priority, nice user interfaces should be made to it -- if
interested communities outside government have not already
done it. The Linked Data technology, unlike any other
technology, allows any data communication to be composed of
many mixed vocabularies. Each vocabulary is from a community,
be it international, national, state or local; or specific to
an industry sector. This optimizes the usual trade-off
between the expense and difficulty of getting wide agreement,
and the practicality of working in a smaller community.
Effort toward interoperability can be spent where most
needed, making the evolution with time smoother and more
productive.
</p>
<h2>
Introduction
</h2>
<p>
This, 2009, is the year for putting government data online.
Both <a href=
"http://www.whitehouse.gov/the_press_office/Transparency_and_Open_Government/">
US</a> and <a href=
"http://www.cabinetoffice.gov.uk/newsroom/news_releases/2009/090610_web.aspx">
UK</a> governments made public commitments toward open data.
The <a href=
"http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html">
TED talk on Linked Data</a> was in February. Groups from the
<a href=
"http://www.guardian.co.uk/technology/free-our-data">Guardian</a>
to the <a href="http://www.sunlightfoundation.com/">Sunlight
Foundation</a> had already been pushing for it for a long
time. People like Watchdog.net, mysociety.org, and
govtrack.us had been pushing by publishing government data
themselves in various formats, including Linked Data.
</p>
<p>
So if you want to do this, what should you do? This article
addresses this question very briefly, and makes a set of
points which will probably be outdated by later developments,
but answer a set of relevant question, asked or not.
</p>
<h2>
Using Linked Data as the interconnection bus
</h2>
<p>
Government data is put online typically for 3 reasons:
</p>
<ol>
<li>Increasing citizen awareness of government functions to
enable greater accountability;
</li>
<li>Contributing valuable information about the world; and
</li>
<li>Enabling the government, the country, and the world to
function more efficiently.
</li>
</ol>
<p>
Each of these purposes is best served by using Linked Data
techniques.
</p>
<p>
In general Linked Data is:
</p>
<p>
<strong>Open</strong>: Linked Data is accessible through an
unlimited variety of applications and applications because it
is expressed in open, non-proprietary formats.
</p>
<p>
<strong>Modular</strong>: Linked Data can be combined
(mashed-up) with any other piece of Linked Data. For example,
government data on health care expenditures for a given
geographical area can be combined with other data about the
characteristics of the population of that region in order to
assess effectiveness of the government programs. No advance
planning is required to integrate these data sources as long
as they both use Linked Data standards.
</p>
<p>
<strong>Scalable</strong>: It's easy to add more Linked Data
to what's already there, even when the terms and definitions
that are used change over time.
</p>
<p>
The essential message is that whatever data format people
want the data in, and whatever format they give it to you in,
you use the RDF model as the interconnection bus. That's
because RDF connects better than any other model.
</p>
<ul>
<li>It uses URIs and so allows linking of things and concepts
</li>
<li>It allows separate systems designed independently to be
later joined at the edges
</li>
<li>It allows interoperability to be added where
cost-effective
</li>
<li>It allows any data to be expressed in a mixture of
vocabularies.
</li>
</ul>
<p>
That's enough about why it is useful. That is elaborated
elsewhere, but it can be difficult for those familiar with
other technologies to understand the difference. Sometimes it
is better just to do it.
</p>
<h2>
Just do it
</h2>
<p>
The chances are quite high that the data your
department/agency runs off will be largely in relational
databases, often with a large amount in spreadsheets.
</p>
<p>
There are two philosophies to putting data on the web. The
top-down one is to make a corporate or national plan, by
getting committees together of all the interested parties,
and make a consistent set of terms (<em>ontology</em>) into
which everything fits. This in fact takes so long it is often
never finished, and anyway does not in fact get corporate or
national consensus in the end. The other method experience
recommends is to do it bottom up. A top-level mandate is
extremely valuable, but grass-roots action is essential. Put
the data up where it is: join it together later.
</p>
<p>
A wise and cautious step is to make a thorough inventory of
all the data you have, and figure out which dataset is going
to be most cost-effective to put up as linked data. However,
the survey may take longer than just doing it. So, take some
data.
</p>
<p>
A really important rule when considering which data could be
put on the web is not to threaten or disturb the systems and
the people who currently are responsible for that data. It
often takes years of negotiation to put together a given set
of data. The people involved may be very invested in it.
There are social as well as technical systems which have been
set up. So you leave the existing system undisturbed, and
find a way of extracting the data from it using existing
export or conversion facilities. You add, a thin shim to
adapt the existing system to the standard.
</p>
<p>
Ok, so you have some data. What form is it in?
</p>
<h3>
Relational databases
</h3>
<p>
There are (2009) a number of open source tools for putting
relational databases up as Linked Data, <em>D2RServer</em>
and <em>Triplify</em> being two.
</p>
<p>
These each use a mapping file, in some language, to explain
how the database structure actually represents things and the
relationships. <sup>1</sup>
</p>
<p>
You probably don't want to to run a publicly available server
on your existing database unless it is generally set up for
high volume use. You might want to take a copy of the whole
database, and run a live semantic web server from it, or you
can generate the RDF once and make a copy of that to serve.
</p>
<h4>
Using other people's terms
</h4>
<p>
It is wise and friendly and interoperable, when you public
RDF data, to use terms other people are already sharing. Like
foaf:name for the name of a person, or dc:title for the title
of something, and so one. Like geo:lat and geo:long for
latitude and longitude<sup><a href="">2</a></sup>. There are
a number of these, growing of course. The <a href=
"http://www.w3.org/2001/sw/interest/">Semantic Web Interest
Group</a> is a community which can help you find them: there
are also online tools such as <a href=
"http://swoogle.umbc.edu/">Swoogle</a>, Sindice, etc.
<a href=""></a>
</p>
<h3>
Spreadsheets
</h3>
<p>
In many organizations a surprising amount of information,
sometimes critical information, is emailed around in
spreadsheets. Much of the early recovery.gov data was
published in spreadsheet form. Some of these are raw tables,
with a header in the top row. These are close to raw data.
You can export them as a comma-separated (or tab-separated)
file, CSV. Others are spreadsheets with a lot of
substructure, and little headings and notes all over them for
the human user. These are less easy to convert.
</p>
<p>
There are a number of <a href=
"http://esw.w3.org/topic/ConverterToRdf">tools</a> for
converting the format of a spreadsheet, typically in CSV
form, into RDF.
</p>
<h3>
XML
</h3>
<p>
If you have existing data in XML, first, put that XML up on
the web while you think. Then, figure out what the XML is
about, what things and what relationships. Then, commission
or write a program, possibly a simple script, maybe written
in XSLT, or your favorite scripting language, to convert each
XML file into RDF. You might need to add a file which points
to all the things you have data about, if they are not
already linked.
</p>
<h3>
Random application formats
</h3>
<p>
Ok, so your data is not in any of the above forms. It is in a
proprietary format, or managed by a proprietary program. But
there is some way you can get at it. So someone will have to
write a program somewhere, to get it out, and convert it to
one of the Linked Data standard forms.
</p>
<p>
(It is actually fairly simple. First, you think of what
things the data is about. You make up URIs for those things.
Suppose for example your data is about books and shelves. You
decide the URI for the books will be
http://id.example.com/id/isbn/123457890 and the URIs for
shelves will be like http://id.example.com/id/shelf/746 .
Then you write a (CGI) script, which, when given that a URI
like that extracts the data about the book (including which
shelf it is on) and outputs it, or similarly for the shelf
(including a list of the books on the shelf). It outputs it
in RDF/XML or N3. That script is your web server of virtual
linked data.)
</p>
<h3>
Existing Web Site
</h3>
<p>
If you have an existing web site with, maybe, a page about
each thing, there is an easy way of putting the data in those
pages into Linked Data. You can change the scripts which
generate the site so that the data which is behind each page
is in fact put into the page so that it can be re-extracted
by others as data. The technology to do this is called
<a href="http://rdfa.info/">RDFa</a> <sup><a href=
"#L451"></a>3</sup>. An alternative is for the each web page
to have a parallel page which has the data in RDF/XML.
<sup><a href="#L454">4</a></sup>
</p>
<h2>
Giving access to data
</h2>
<p>
Ok, so you have your data in RDF as Linked Data. Now what?
</p>
<h3>
Index it
</h3>
<p>
The semantic web toolkit includes the SPARQL query language
which allows a client anywhere on the net to query a SPARQL
service. Some methods of publishing data, like D2RServer,
provide a built-in SPARQL service. If you have generated a
bunch of linked data, then there are various products, free
or commercial, which will scoop it up into a "triple store"
and provide a SPARQL service.
</p>
<p>
A SPARQL service is a generally useful tool for technically
aware users. Many clients and analytical tools just use a
SPARQL server. A SPARQL server looks for patterns in the data
and for each match, or outputs what it found in one of a
number of formats, including constructed RDF, XML and, in
some cases, JSON, and maybe even CSV.
</p>
<h3>
Generating XML with SPARQL
</h3>
<p>
SPARQL, then, can be used as an RDF to XML converter. You
amass a heap of linked data. Then you think of a combination
of data, involving connections across different data. There
is a SPARQL query for that data with the results expressed in
XML. That SPARQL query can be encoded into a long URI, a URI
for a virtual XML document for that particular view.
</p>
<h3>
Generating CSV files and JSON
</h3>
<p>
Some SPARQL servers also support JSON as an output format.
This is easy to use in Web Applications.
</p>
<h3>
Generating nice web pages
</h3>
<p>
The priority first is to get raw data onto the net, and
preferably converted into Linked Data form. This is partly
because there may be other sites, commercial or not, who pick
it up and make great interfaces to that data. Of course there
are times when the government site must provide a easy human
interface for ordinary users to access the data.
</p>
<p>
There are many routes to pretty HTML for real users. Tools
like Exhibit provide facetted browser views, given a
configuration set up by the web master, for example.
</p>
<p>
Webmasters can can run script in languages (not standardized
yet) like XSPARQL or N3 rules, or write custom code in their
favorite programming language such as PHP, Python, Ruby, or
server-side Javascript.
</p>
<p>
Note, though, there are two ways though that a department or
agency web site can never be expected to compete with
external sites. One is because there are as yet no user
interface techniques which allow a normal user to create
their own query, (though tools like Tabulator are getting
close).
</p>
<p>
The second is that an external site will add value to the
data by joining it to other data from different sites for a
particular purpose. If the Department of Transport publishes
road accident data, a cycling site selects the cycle accident
subset, and can publish it as a map adding cycle routes and
hills, and cycle shops. An agency publishes data about the
amount of money given to different towns, another maps it
against the per capital income levels in those towns. And so
on in uncountable permutation.
</p>
<p>
An informal random sample of some public feedback suggests
that there are users who would prefer each of these formats
above, so a system which generates them automatically is
clearly called for.
</p>
<h2>
Metadata
</h2>
<p>
When you write or generate a small RDF file for each dataset
exported, the results can be harvested as more useful linked
data to form a catalog. Like the data, this can be
distributed form as linked data, and also sucked into a
repository to be indexed and SPARQLed. Remember that, as with
the data, RDF allows you to mix vocabularies, so you can
record everything you or others may feel is important about
the datasets. This provenance information is very valuable.
It clearly is one of the many areas this note touches on
which much more could be said.
</p>
<p>
Neither does it really address licensing issues. In the US,
government data is generally in the Public Domain. It is good
to put the fact that a given resource has a given license in
a machine-readable way. The creative commons cc:license term
is appropriate. Creative commons also have produced a "CC0"
waiver which disclaims all rights appropriately (and where
possible) for each country.
</p>
<h2>
Privacy
</h2>
<p>
A very common and important concern is the privacy of data
which contains personally identifiable nformation. This
article does not suggest that all data should be made public,
nor does it discuss issues with anonymisation of data.
Systems where PIP is an issue will probably not be an early
choice when selecting those to put on the web. However, in
cases in which these issues have already been resolved and
the data is already public but not in the standard form,
converting it to Linked Data is an excellent idea. In
general, new government systems should be built to be aware
of the provenance of the data they use, and of the
appropriate use to which it may be put. But the design of
these <a href=
"http://dig.csail.mit.edu/2008/06/info-accountability-cacm-weitzner.pdf">
accountable systems</a> is another topic we do not have space
for here.
</p>
<h2>
Conclusion
</h2>
<p>
This brief note is too short to go into great detail, and has
ignored many important topics. It has stressed the practical
technical steps. Deeper information, about techniques and
also about the social issues and challenges, are being
produced frequently elsewhere. Many cities have Semantic Web
gatherings or <a href="http://semweb.meetup.com/">meetup
groups</a>, which can be a source of mutual support for those
involved in or interested in the technology. The W3C eGov
Interest Group is an international group of people sharing
challenges and solutions.
</p>
<hr />
<h4>
Footnote: Do's and Don'ts
</h4>
<ul>
<li>Do pick URIs which are likely to be <a href=
"../Provider/Style/URI">persistent</a>
</li>
<li>Do put RDF metadata giving the license.
</li>
<li>Do use the RDF and SPARQL standards
</li>
<li>Make sure your human readable pages are <a href=
"http://www.w3.org/WAI">accessible</a>.
</li>
</ul>
<ul>
<li>Do NOT hide data files inside zip files unless they are
also available directly.
</li>
<li>Do NOT put data up in proprietary formats.
</li>
<li>Do NOT wait until you have a complete schema or ontology
to publish data.
</li>
<li>Do NOT seek to replace existing data systems.
</li>
</ul>
<p>
<a name="L419" id="L419">[1]</a> D2RServer will generate a
default mapping file, which will not make a very good RDF
graph. Browsing the resulting RDF with am RDF browser (such
as Tabulator) will however often show up the deficiencies and
suggest improvements
</p>
<p>
<a name="L470" id="L470">[2]</a> WGS84 latitude and
longitude, like you get from a normal GPS unit. (<a href=
"http://www.w3.org/2003/01/geo/">more</a>)
</p>
<p>
<a name="L451" id="L451">[3]</a> RDFa is used, for example,
in the UK <a href=
"http://www.civilservice.gov.uk/jobs/index.aspx">Civil
Service Jobs</a> web site. (<a href=
"http://www.civilservice.gov.uk/jobs/careers-detail.aspx?JobId=4730">example</a>)
</p>
<p>
<a name="L454" id="L454">[4]</a> Separate RDF/XML web pages
are used, for example, in the <a href=
"http://www.bbc.co.uk/programmes">BBC programmes</a> data.
Here content negotiation gives RDF/XML to data clients, and
HTML to document browsers. (<a href=
"http://www.bbc.co.uk/programmes/genres/comedy#genre">example</a>)
</p>
<h2>
References and Resources
</h2>
<ul>
<li>
<a href=
"http://www.thenationaldialogue.org/ideas/linked-open-data">
Linked Open Data</a>, in "The National Dialogue" about US
recovery transparency.
</li>
<li>
<a href=
"http://ShowUsABetterWay.com/">ShowUsABetterWay.com</a>
(UK)
</li>
<li>
<a href=
"http://www.showusabetterway.co.uk/call/data.html">Example
UK Data available for reuse</a>
</li>
<li>
<a href=
"http://TheNationalDialog.org/">TheNationalDialog</a>.org
(US)
</li>
<li>
<a href="http://www.whitehouse.gov/open/">Open Government
Initiative</a> (US)
</li>
<li>
<a href=
"http://www.cabinetoffice.gov.uk/reports/power_of_information.aspx">
The Power of Information Taskforce Report</a> (UK Gov) one
of whose recommendations is linked government data
</li>
<li>
<a href="http://www.w3.org/2007/eGov/">eGovernment at
W3C</a>
</li>
<li>
<a href="http://www.w3.org/2007/eGov/IG/">W3C eGovernment
Interest Group</a>
</li>
<li>
<a href="http://www.w3.org/TR/egov-improving/">Improving
Access to Government through Better Use of the Web</a>, W3C
eGov IG
</li>
<li>
<a href=
"http://www.whitehouse.gov/the_press_office/Transparency_and_Open_Government/">
Transparency and Open Government</a>, Memorandum for the
Heads of Executive Departments and Agencies, Barack Obama,
2009-01-21
</li>
<li>
<a href="http://eprints.ecs.soton.ac.uk/14429/">Paper on
the lessons from the UK AKTivePSI project</a>
</li>
<li>
<a href="http://esw.w3.org/topic/SemanticWebTools">Semantic
Web Development Tools</a>, eSW Wiki.
</li>
<li>
<a href="http://esw.w3.org/topic/ConverterToRdf">Tools to
convert data into RDF</a>, in eSW Wiki. Don't just look in
the wiki for things -- add things you have found!
</li>
<li>
<a href="http://rdfa.info/">RDFA.info</a> a resource about
RDFa. Ben Adida.
</li>
</ul>
<h4>
Acknowledgements
</h4>
<p>
<small>Thanks for input to this article from Nigel Shadbolt
and Danny Weitzner. Thanks also to the chairs (John Sheridan
and Kevin Novak) and members of the W3C eGov interest group,
and all those in UK and US governments with whom we have
discussed these issues at these early stages.</small>
</p>
<hr />
<p>
<a href="Overview.html">Up to Design Issues</a>
</p>
<p>
<a href="../People/Berners-Lee">Tim BL</a>
</p>
</body>
</html>