GovData.html 23.2 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta name="generator" content=
    "HTML Tidy for Mac OS X (vers 31 October 2006 - Apple Inc. build 13), see www.w3.org" />
    <title>
      Putting Government Data online - Design Issues
    </title>
    <link rel="Stylesheet" href="di.css" type="text/css" />
    <meta http-equiv="Content-Type" content="text/html" />
  </head>
  <body bgcolor="#DDFFDD" text="#000000">
    <address>
      Tim Berners-Lee<br />
      Date: 2009-06, last change: $Date: 2009/06/30 15:49:50
      $<br />
      Status: personal view only. Editing status: Good enough for
      folk. Notes after talking with various people in UK and US
      governments who would like to put data on the web and want to
      know the next steps.
    </address>
    <p>
      <a href="./">Up to Design Issues</a>
    </p>
    <hr />
    <h1>
      Putting Government Data online
    </h1>
    <h4>
      Abstract
    </h4>
    <p class="abstract">
      Government data is being put online to increase
      accountability, contribute valuable information about the
      world, and to enable government, the country, and the world
      to function more efficiently. All of these purposes are
      served by putting the information on the Web as Linked Data.
      Start with the "low-hanging fruit". Whatever else, the raw
      data should be made available as soon as possible.
      Preferably, it should be put up as Linked Data. As a third
      priority, it should be linked to other sources. As a lower
      priority, nice user interfaces should be made to it -- if
      interested communities outside government have not already
      done it. The Linked Data technology, unlike any other
      technology, allows any data communication to be composed of
      many mixed vocabularies. Each vocabulary is from a community,
      be it international, national, state or local; or specific to
      an industry sector. This optimizes the usual trade-off
      between the expense and difficulty of getting wide agreement,
      and the practicality of working in a smaller community.
      Effort toward interoperability can be spent where most
      needed, making the evolution with time smoother and more
      productive.
    </p>
    <h2>
      Introduction
    </h2>
    <p>
      This, 2009, is the year for putting government data online.
      Both <a href=
      "http://www.whitehouse.gov/the_press_office/Transparency_and_Open_Government/">
      US</a> and <a href=
      "http://www.cabinetoffice.gov.uk/newsroom/news_releases/2009/090610_web.aspx">
      UK</a> governments made public commitments toward open data.
      The <a href=
      "http://www.ted.com/index.php/talks/tim_berners_lee_on_the_next_web.html">
      TED talk on Linked Data</a> was in February. Groups from the
      <a href=
      "http://www.guardian.co.uk/technology/free-our-data">Guardian</a>
      to the <a href="http://www.sunlightfoundation.com/">Sunlight
      Foundation</a> had already been pushing for it for a long
      time. People like Watchdog.net, mysociety.org, and
      govtrack.us had been pushing by publishing government data
      themselves in various formats, including Linked Data.
    </p>
    <p>
      So if you want to do this, what should you do? This article
      addresses this question very briefly, and makes a set of
      points which will probably be outdated by later developments,
      but answer a set of relevant question, asked or not.
    </p>
    <h2>
      Using Linked Data as the interconnection bus
    </h2>
    <p>
      Government data is put online typically for 3 reasons:
    </p>
    <ol>
      <li>Increasing citizen awareness of government functions to
      enable greater accountability;
      </li>
      <li>Contributing valuable information about the world; and
      </li>
      <li>Enabling the government, the country, and the world to
      function more efficiently.
      </li>
    </ol>
    <p>
      Each of these purposes is best served by using Linked Data
      techniques.
    </p>
    <p>
      In general Linked Data is:
    </p>
    <p>
      <strong>Open</strong>: Linked Data is accessible through an
      unlimited variety of applications and applications because it
      is expressed in open, non-proprietary formats.
    </p>
    <p>
      <strong>Modular</strong>: Linked Data can be combined
      (mashed-up) with any other piece of Linked Data. For example,
      government data on health care expenditures for a given
      geographical area can be combined with other data about the
      characteristics of the population of that region in order to
      assess effectiveness of the government programs. No advance
      planning is required to integrate these data sources as long
      as they both use Linked Data standards.
    </p>
    <p>
      <strong>Scalable</strong>: It's easy to add more Linked Data
      to what's already there, even when the terms and definitions
      that are used change over time.
    </p>
    <p>
      The essential message is that whatever data format people
      want the data in, and whatever format they give it to you in,
      you use the RDF model as the interconnection bus. That's
      because RDF connects better than any other model.
    </p>
    <ul>
      <li>It uses URIs and so allows linking of things and concepts
      </li>
      <li>It allows separate systems designed independently to be
      later joined at the edges
      </li>
      <li>It allows interoperability to be added where
      cost-effective
      </li>
      <li>It allows any data to be expressed in a mixture of
      vocabularies.
      </li>
    </ul>
    <p>
      That's enough about why it is useful. That is elaborated
      elsewhere, but it can be difficult for those familiar with
      other technologies to understand the difference. Sometimes it
      is better just to do it.
    </p>
    <h2>
      Just do it
    </h2>
    <p>
      The chances are quite high that the data your
      department/agency runs off will be largely in relational
      databases, often with a large amount in spreadsheets.
    </p>
    <p>
      There are two philosophies to putting data on the web. The
      top-down one is to make a corporate or national plan, by
      getting committees together of all the interested parties,
      and make a consistent set of terms (<em>ontology</em>) into
      which everything fits. This in fact takes so long it is often
      never finished, and anyway does not in fact get corporate or
      national consensus in the end. The other method experience
      recommends is to do it bottom up. A top-level mandate is
      extremely valuable, but grass-roots action is essential. Put
      the data up where it is: join it together later.
    </p>
    <p>
      A wise and cautious step is to make a thorough inventory of
      all the data you have, and figure out which dataset is going
      to be most cost-effective to put up as linked data. However,
      the survey may take longer than just doing it. So, take some
      data.
    </p>
    <p>
      A really important rule when considering which data could be
      put on the web is not to threaten or disturb the systems and
      the people who currently are responsible for that data. It
      often takes years of negotiation to put together a given set
      of data. The people involved may be very invested in it.
      There are social as well as technical systems which have been
      set up. So you leave the existing system undisturbed, and
      find a way of extracting the data from it using existing
      export or conversion facilities. You add, a thin shim to
      adapt the existing system to the standard.
    </p>
    <p>
      Ok, so you have some data. What form is it in?
    </p>
    <h3>
      Relational databases
    </h3>
    <p>
      There are (2009) a number of open source tools for putting
      relational databases up as Linked Data, <em>D2RServer</em>
      and <em>Triplify</em> being two.
    </p>
    <p>
      These each use a mapping file, in some language, to explain
      how the database structure actually represents things and the
      relationships. <sup>1</sup>
    </p>
    <p>
      You probably don't want to to run a publicly available server
      on your existing database unless it is generally set up for
      high volume use. You might want to take a copy of the whole
      database, and run a live semantic web server from it, or you
      can generate the RDF once and make a copy of that to serve.
    </p>
    <h4>
      Using other people's terms
    </h4>
    <p>
      It is wise and friendly and interoperable, when you public
      RDF data, to use terms other people are already sharing. Like
      foaf:name for the name of a person, or dc:title for the title
      of something, and so one. Like geo:lat and geo:long for
      latitude and longitude<sup><a href="">2</a></sup>. There are
      a number of these, growing of course. The <a href=
      "http://www.w3.org/2001/sw/interest/">Semantic Web Interest
      Group</a> is a community which can help you find them: there
      are also online tools such as <a href=
      "http://swoogle.umbc.edu/">Swoogle</a>, Sindice, etc.
      <a href=""></a>
    </p>
    <h3>
      Spreadsheets
    </h3>
    <p>
      In many organizations a surprising amount of information,
      sometimes critical information, is emailed around in
      spreadsheets. Much of the early recovery.gov data was
      published in spreadsheet form. Some of these are raw tables,
      with a header in the top row. These are close to raw data.
      You can export them as a comma-separated (or tab-separated)
      file, CSV. Others are spreadsheets with a lot of
      substructure, and little headings and notes all over them for
      the human user. These are less easy to convert.
    </p>
    <p>
      There are a number of <a href=
      "http://esw.w3.org/topic/ConverterToRdf">tools</a> for
      converting the format of a spreadsheet, typically in CSV
      form, into RDF.
    </p>
    <h3>
      XML
    </h3>
    <p>
      If you have existing data in XML, first, put that XML up on
      the web while you think. Then, figure out what the XML is
      about, what things and what relationships. Then, commission
      or write a program, possibly a simple script, maybe written
      in XSLT, or your favorite scripting language, to convert each
      XML file into RDF. You might need to add a file which points
      to all the things you have data about, if they are not
      already linked.
    </p>
    <h3>
      Random application formats
    </h3>
    <p>
      Ok, so your data is not in any of the above forms. It is in a
      proprietary format, or managed by a proprietary program. But
      there is some way you can get at it. So someone will have to
      write a program somewhere, to get it out, and convert it to
      one of the Linked Data standard forms.
    </p>
    <p>
      (It is actually fairly simple. First, you think of what
      things the data is about. You make up URIs for those things.
      Suppose for example your data is about books and shelves. You
      decide the URI for the books will be
      http://id.example.com/id/isbn/123457890 and the URIs for
      shelves will be like http://id.example.com/id/shelf/746 .
      Then you write a (CGI) script, which, when given that a URI
      like that extracts the data about the book (including which
      shelf it is on) and outputs it, or similarly for the shelf
      (including a list of the books on the shelf). It outputs it
      in RDF/XML or N3. That script is your web server of virtual
      linked data.)
    </p>
    <h3>
      Existing Web Site
    </h3>
    <p>
      If you have an existing web site with, maybe, a page about
      each thing, there is an easy way of putting the data in those
      pages into Linked Data. You can change the scripts which
      generate the site so that the data which is behind each page
      is in fact put into the page so that it can be re-extracted
      by others as data. The technology to do this is called
      <a href="http://rdfa.info/">RDFa</a> <sup><a href=
      "#L451"></a>3</sup>. An alternative is for the each web page
      to have a parallel page which has the data in RDF/XML.
      <sup><a href="#L454">4</a></sup>
    </p>
    <h2>
      Giving access to data
    </h2>
    <p>
      Ok, so you have your data in RDF as Linked Data. Now what?
    </p>
    <h3>
      Index it
    </h3>
    <p>
      The semantic web toolkit includes the SPARQL query language
      which allows a client anywhere on the net to query a SPARQL
      service. Some methods of publishing data, like D2RServer,
      provide a built-in SPARQL service. If you have generated a
      bunch of linked data, then there are various products, free
      or commercial, which will scoop it up into a "triple store"
      and provide a SPARQL service.
    </p>
    <p>
      A SPARQL service is a generally useful tool for technically
      aware users. Many clients and analytical tools just use a
      SPARQL server. A SPARQL server looks for patterns in the data
      and for each match, or outputs what it found in one of a
      number of formats, including constructed RDF, XML and, in
      some cases, JSON, and maybe even CSV.
    </p>
    <h3>
      Generating XML with SPARQL
    </h3>
    <p>
      SPARQL, then, can be used as an RDF to XML converter. You
      amass a heap of linked data. Then you think of a combination
      of data, involving connections across different data. There
      is a SPARQL query for that data with the results expressed in
      XML. That SPARQL query can be encoded into a long URI, a URI
      for a virtual XML document for that particular view.
    </p>
    <h3>
      Generating CSV files and JSON
    </h3>
    <p>
      Some SPARQL servers also support JSON as an output format.
      This is easy to use in Web Applications.
    </p>
    <h3>
      Generating nice web pages
    </h3>
    <p>
      The priority first is to get raw data onto the net, and
      preferably converted into Linked Data form. This is partly
      because there may be other sites, commercial or not, who pick
      it up and make great interfaces to that data. Of course there
      are times when the government site must provide a easy human
      interface for ordinary users to access the data.
    </p>
    <p>
      There are many routes to pretty HTML for real users. Tools
      like Exhibit provide facetted browser views, given a
      configuration set up by the web master, for example.
    </p>
    <p>
      Webmasters can can run script in languages (not standardized
      yet) like XSPARQL or N3 rules, or write custom code in their
      favorite programming language such as PHP, Python, Ruby, or
      server-side Javascript.
    </p>
    <p>
      Note, though, there are two ways though that a department or
      agency web site can never be expected to compete with
      external sites. One is because there are as yet no user
      interface techniques which allow a normal user to create
      their own query, (though tools like Tabulator are getting
      close).
    </p>
    <p>
      The second is that an external site will add value to the
      data by joining it to other data from different sites for a
      particular purpose. If the Department of Transport publishes
      road accident data, a cycling site selects the cycle accident
      subset, and can publish it as a map adding cycle routes and
      hills, and cycle shops. An agency publishes data about the
      amount of money given to different towns, another maps it
      against the per capital income levels in those towns. And so
      on in uncountable permutation.
    </p>
    <p>
      An informal random sample of some public feedback suggests
      that there are users who would prefer each of these formats
      above, so a system which generates them automatically is
      clearly called for.
    </p>
    <h2>
      Metadata
    </h2>
    <p>
      When you write or generate a small RDF file for each dataset
      exported, the results can be harvested as more useful linked
      data to form a catalog. Like the data, this can be
      distributed form as linked data, and also sucked into a
      repository to be indexed and SPARQLed. Remember that, as with
      the data, RDF allows you to mix vocabularies, so you can
      record everything you or others may feel is important about
      the datasets. This provenance information is very valuable.
      It clearly is one of the many areas this note touches on
      which much more could be said.
    </p>
    <p>
      Neither does it really address licensing issues. In the US,
      government data is generally in the Public Domain. It is good
      to put the fact that a given resource has a given license in
      a machine-readable way. The creative commons cc:license term
      is appropriate. Creative commons also have produced a "CC0"
      waiver which disclaims all rights appropriately (and where
      possible) for each country.
    </p>
    <h2>
      Privacy
    </h2>
    <p>
      A very common and important concern is the privacy of data
      which contains personally identifiable nformation. This
      article does not suggest that all data should be made public,
      nor does it discuss issues with anonymisation of data.
      Systems where PIP is an issue will probably not be an early
      choice when selecting those to put on the web. However, in
      cases in which these issues have already been resolved and
      the data is already public but not in the standard form,
      converting it to Linked Data is an excellent idea. In
      general, new government systems should be built to be aware
      of the provenance of the data they use, and of the
      appropriate use to which it may be put. But the design of
      these <a href=
      "http://dig.csail.mit.edu/2008/06/info-accountability-cacm-weitzner.pdf">
      accountable systems</a> is another topic we do not have space
      for here.
    </p>
    <h2>
      Conclusion
    </h2>
    <p>
      This brief note is too short to go into great detail, and has
      ignored many important topics. It has stressed the practical
      technical steps. Deeper information, about techniques and
      also about the social issues and challenges, are being
      produced frequently elsewhere. Many cities have Semantic Web
      gatherings or <a href="http://semweb.meetup.com/">meetup
      groups</a>, which can be a source of mutual support for those
      involved in or interested in the technology. The W3C eGov
      Interest Group is an international group of people sharing
      challenges and solutions.
    </p>
    <hr />
    <h4>
      Footnote: Do's and Don'ts
    </h4>
    <ul>
      <li>Do pick URIs which are likely to be <a href=
      "../Provider/Style/URI">persistent</a>
      </li>
      <li>Do put RDF metadata giving the license.
      </li>
      <li>Do use the RDF and SPARQL standards
      </li>
      <li>Make sure your human readable pages are <a href=
      "http://www.w3.org/WAI">accessible</a>.
      </li>
    </ul>
    <ul>
      <li>Do NOT hide data files inside zip files unless they are
      also available directly.
      </li>
      <li>Do NOT put data up in proprietary formats.
      </li>
      <li>Do NOT wait until you have a complete schema or ontology
      to publish data.
      </li>
      <li>Do NOT seek to replace existing data systems.
      </li>
    </ul>
    <p>
      <a name="L419" id="L419">[1]</a> D2RServer will generate a
      default mapping file, which will not make a very good RDF
      graph. Browsing the resulting RDF with am RDF browser (such
      as Tabulator) will however often show up the deficiencies and
      suggest improvements
    </p>
    <p>
      <a name="L470" id="L470">[2]</a> WGS84 latitude and
      longitude, like you get from a normal GPS unit. (<a href=
      "http://www.w3.org/2003/01/geo/">more</a>)
    </p>
    <p>
      <a name="L451" id="L451">[3]</a> RDFa is used, for example,
      in the UK <a href=
      "http://www.civilservice.gov.uk/jobs/index.aspx">Civil
      Service Jobs</a> web site. (<a href=
      "http://www.civilservice.gov.uk/jobs/careers-detail.aspx?JobId=4730">example</a>)
    </p>
    <p>
      <a name="L454" id="L454">[4]</a> Separate RDF/XML web pages
      are used, for example, in the <a href=
      "http://www.bbc.co.uk/programmes">BBC programmes</a> data.
      Here content negotiation gives RDF/XML to data clients, and
      HTML to document browsers. (<a href=
      "http://www.bbc.co.uk/programmes/genres/comedy#genre">example</a>)
    </p>
    <h2>
      References and Resources
    </h2>
    <ul>
      <li>
        <a href=
        "http://www.thenationaldialogue.org/ideas/linked-open-data">
        Linked Open Data</a>, in "The National Dialogue" about US
        recovery transparency.
      </li>
      <li>
        <a href=
        "http://ShowUsABetterWay.com/">ShowUsABetterWay.com</a>
        (UK)
      </li>
      <li>
        <a href=
        "http://www.showusabetterway.co.uk/call/data.html">Example
        UK Data available for reuse</a>
      </li>
      <li>
        <a href=
        "http://TheNationalDialog.org/">TheNationalDialog</a>.org
        (US)
      </li>
      <li>
        <a href="http://www.whitehouse.gov/open/">Open Government
        Initiative</a> (US)
      </li>
      <li>
        <a href=
        "http://www.cabinetoffice.gov.uk/reports/power_of_information.aspx">
        The Power of Information Taskforce Report</a> (UK Gov) one
        of whose recommendations is linked government data
      </li>
      <li>
        <a href="http://www.w3.org/2007/eGov/">eGovernment at
        W3C</a>
      </li>
      <li>
        <a href="http://www.w3.org/2007/eGov/IG/">W3C eGovernment
        Interest Group</a>
      </li>
      <li>
        <a href="http://www.w3.org/TR/egov-improving/">Improving
        Access to Government through Better Use of the Web</a>, W3C
        eGov IG
      </li>
      <li>
        <a href=
        "http://www.whitehouse.gov/the_press_office/Transparency_and_Open_Government/">
        Transparency and Open Government</a>, Memorandum for the
        Heads of Executive Departments and Agencies, Barack Obama,
        2009-01-21
      </li>
      <li>
        <a href="http://eprints.ecs.soton.ac.uk/14429/">Paper on
        the lessons from the UK AKTivePSI project</a>
      </li>
      <li>
        <a href="http://esw.w3.org/topic/SemanticWebTools">Semantic
        Web Development Tools</a>, eSW Wiki.
      </li>
      <li>
        <a href="http://esw.w3.org/topic/ConverterToRdf">Tools to
        convert data into RDF</a>, in eSW Wiki. Don't just look in
        the wiki for things -- add things you have found!
      </li>
      <li>
        <a href="http://rdfa.info/">RDFA.info</a> a resource about
        RDFa. Ben Adida.
      </li>
    </ul>
    <h4>
      Acknowledgements
    </h4>
    <p>
      <small>Thanks for input to this article from Nigel Shadbolt
      and Danny Weitzner. Thanks also to the chairs (John Sheridan
      and Kevin Novak) and members of the W3C eGov interest group,
      and all those in UK and US governments with whom we have
      discussed these issues at these early stages.</small>
    </p>
    <hr />
    <p>
      <a href="Overview.html">Up to Design Issues</a>
    </p>
    <p>
      <a href="../People/Berners-Lee">Tim BL</a>
    </p>
  </body>
</html>