index.html 9.67 KB
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
<html>
  <head about="http://www.w3.org/2010/02/rdfa/Overview.html">
    <title>Profile search data</title>
	<link rel="stylesheet" href="/Guide/pubrules-style.css" type="text/css" />
	<script src="/People/Ivan/JS/sorttable-small.js" type="text/javascript"></script>
	<style type="text/css">
	  td {text-align: right}
	  td:first-child { text-align: left}
	  td:first-child + td a { color:darkGreen; text-align: left}
	  td:first-child + td { color:darkGreen; text-align: left}
	  
	  td {
		  border: inset 1pt;
	  }
	  th { 
		  text-align: center; 
		  font-weight: bold
	  }
	  tr:nth-child(even) { background: lavender}
	  ol { counter-reset: item }
	  ol > li { display: block; }
	  ol > li:before { content: counters(item, ".") ".   "; counter-increment: item }
	</style>
  </head>
  <body>
		<p class="banner">
			<a href=".">
				<img src="/Icons/SW/sw-horz-w3c.png" width="241" height="48" alt="W3C SW Logo"/>
			</a>
		</p>
		<p style="color:red">WARNING: this is still a draft, and some details are still in discussion.</p>
		<h1>Vocabulary Search on the Semantic Web for RDFa Default Profiles</h1>
		
		<p>$Date: 2011/06/08 21:41:47 $</p>
		
		<p>The content of the vocabulary prefixes, to be included in the <a href="/profile/rdfa-1.1">RDFa 1.1 Default Profile</a>, is defined
		based on the general usage of those vocabularies on the Semantic Web. This general usage is established using search crawl data, courtesy of
		<a href="http://sindice.com">Sindice</a> and of <a href="http://search.yahoo.com">Yahoo!</a>. This page describes the methodology used during crawls as well
		as the possible post-processing steps.</p>
		
	  <h2 id="method">How Was the Data Collected?</h2>
	  
	  <p>The methodology used for both the Sindice and the Yahoo! cases were essentially the same, namely:</p>
	  
	  <ol>
	  <li>A crawl of the respective search engine produced a set of URI-s from the Semantic Web. 
	    <ul><li>In the case of Sindice the crawl was on the Semantic Web, yielding around 10B triplets. </li>
	  <li>In the case of Yahoo!, the original generic crawl size was around 12B pages, with 431M documents using RDF (excluding trivial RDFa markup, i.e., pages containing triples in the xhtml namespace only). 
	  The measurement results are based on the RDF extracted from these RDFa pages.</li>
		</ul>	  
	  </li>
	  <li>The result of the crawl was subject to a number of processing steps, namely:
		<ol>
		  <li>Using some simple heuristics and, in some cases, explicit processing rules the vocabulary URI-s were established.</li>
		  <li>A number of vocabularies were eliminated as unsuitable for an RDFa profile.</li>
		  <li>Some common mistakes in the datasets were handled, too. For example, a missing '#' or a '/' at the end of a property yields,
		  formally, a different property URI but, in many cases, it was fairly clear that those were mistakes and could be merged with the intended URI.
		  Another, somewhat more controversial, case is when a known vocabulary has changed its official URI at some point
		  (e.g., Facebook’s <a href="http://ogp.me">Open Graph Protocol</a>), and all data are merged into the current, official URI.</li>
		</ol>
	  </li>
	  <li>The resulting set of vocabularies are ordered using the <em>effective second level domains</em> for each entry.
	  The <a href="http://publicsuffix.org/">Public Suffix List</a>, maintained by the community at large, was used both in the Sindice and
	  the Yahoo! cases to identify the highest domain (i.e., second-level domain) that is directly below a top-level domain.
	  Using this metric rather than,
	  for example, the number of domains or graphs avoided artefacts of a few very important sites publishing a large number of triples or
	  graphs with local vocabularies that would not be appropriate for a generic RDFa profile.
	  </li>
	  </ol>

	<p>The most complex and possibly controversial step is 2.2 above. Here are the categories of vocabularies that were removed from the result set:</p>
	
	<ul>
	  <li>Vocabularies defined through a W3C Recommendation or Working/Interest Group Note (those are part of a default profile “ex officio”)</li>
	  <li>Vocabularies whose URI-s are not referencable publicly, or that do not refer to a proper documentation (at the minimum a commented RDF file)</li>
	  <li>Vocabularies marked as “draft”, “experimental”</li>
	  <li>Vocabularies used for a very specific and specialized purpose (e.g., major ontologies used in medical or drug discovery applications). Note that this
	  is <em>not</em> a judgement on the quality or the usefulness of that vocabulary, simply a reflection of the fact that the vocabulary should not
	  be part of an RDFa profile of general use.</li>		
	</ul>
	
	<p>The rules used for the <a href="sindice/rules.properties">Sindice</a> and the <a href="yahoo/rules.properties">Yahoo</a> cases, respectively, are
	available for download. The final results of the two crawls and subsequent processing are also available; see the <a href="sindice/">Sindice</a> and
	the <a href="yahoo/">Yahoo!</a> pages for further details.</p>
	
	<h2>Merging the results and establishing the final profile content</h2>
	
	<p>Both crawl results have a relatively natural cut-off point for the vocabularies that should or should not be considered for a default profile, 
	taking also into account that the number of default prefixes should not be very high (in the range of 10, considering the fact that the list might grow
	as time goes by). For Yahoo! the cut value of 10 seems to be a natural choice. It is slightly less clear for the Sindice case, though; at present, the vaue of 12 has been used.</p>

	<p>However, the two data sets should be considered together; an entry from one dataset that scores very low on the other should not be added. Based on
	this, the following algorithm is used:</p>
	<ol>
	  <li>The S and Y cut-off points are established (as said above, this is 12 and 10, respectively)</li>
	  <li>The two data sets are considered in parallel from the top; each entry is considered and checked whether it appears in twice the cut-off value of the
	  other. I.e., a top entry in the Sindice list (meaning that its index is under S) should be present in the Yahoo list with an index of maximally 2*Y; similarly
	  for the Yahoo entries. If such entry is found, it is added to the final list. </li>
	</ol>
	<p>This means that the number of final entries is under max(S,Y). (The python script executing the merge is also <a href="merge">available</a>.) The current results are:</p>
	  <div style="text-align: center" id="table-1">
		<table width="80%" border="0" style="text-align: center">
		 <thead><tr><th></th><th>Vocabulary URI</th><th>Effective Second Level Domains in the <a href="yahoo/">Yahoo! dataset</a></th><th>Effective Second Level Domains in the <a href="sindice/">Sindice dataset</a></th></tr></thead>
		 <tbody>
                 <tr><td>1.</td><td><a href="http://purl.org/dc/terms/">http://purl.org/dc/terms/</a></td><td>344545</td><td>32848</td></tr>
                <tr><td>2.</td><td><a href="http://ogp.me/ns#">http://ogp.me/ns#</a></td><td>177761</td><td>18954</td></tr>
                <tr><td>3.</td><td><a href="http://creativecommons.org/ns#">http://creativecommons.org/ns#</a></td><td>37890</td><td>743</td></tr>
                <tr><td>4.</td><td><a href="http://xmlns.com/foaf/0.1/">http://xmlns.com/foaf/0.1/</a></td><td>2545</td><td>3630</td></tr>
                <tr><td>5.</td><td><a href="http://rdf.data-vocabulary.org/#">http://rdf.data-vocabulary.org/#</a></td><td>6083</td><td>845</td></tr>
                <tr><td>6.</td><td><a href="http://rdfs.org/sioc/ns#">http://rdfs.org/sioc/ns#</a></td><td>1633</td><td>1305</td></tr>
                <tr><td>7.</td><td><a href="http://www.w3.org/2006/vcard/ns#">http://www.w3.org/2006/vcard/ns#</a></td><td>1349</td><td>559</td></tr>
                <tr><td>8.</td><td><a href="http://purl.org/goodrelations/v1#">http://purl.org/goodrelations/v1#</a></td><td>488</td><td>390</td></tr>
                <tr><td>9.</td><td><a href="http://purl.org/stuff/rev#">http://purl.org/stuff/rev#</a></td><td>369</td><td>73</td></tr>
                <tr><td>10.</td><td><a href="http://commontag.org/ns#">http://commontag.org/ns#</a></td><td>272</td><td>168</td></tr>
                <tr><td>11.</td><td><a href="http://www.w3.org/2002/12/cal/icaltzd#">http://www.w3.org/2002/12/cal/icaltzd#</a></td><td>62</td><td>50</td></tr>
		 </tbody>
		</table>
	  </div>
	  
	<p>This list has been included in the <a href="/profile/rdfa-1.1">RDFa 1.1 Default Profile</a> (also available in <a href="/profile/rdfa-1.1.ttl">Turtle</a> and
	<a href="/profile/rdfa-1.1.rdf">RDF/XML</a>). In most of the cases the prefixes are well known and widely used (e.g., <code>foaf</code>); in other cases 
	the <a href="http://prefix.cc">prefix.cc</a> service was used to establish the default prefix (e.g., <code>ctag</code> for the <a href="http://commontag.org/ns#">http://commontag.org/ns#</a> vocabulary.)</p>
	
	<hr />
	<address >
	  <span rel="dcterms:creator" >
		<span about="http://www.ivan-herman.net/foaf#me" typeof="foaf:Person">
			<span property="foaf:name" datatype=""><a href="http://www.w3.org/People/Ivan">Ivan Herman</a></span>,
			<a rel="foaf:mbox" href="mailto:ivan@w3.org">ivan@w3.org</a>,
			<a rel="foaf:workplaceHomepage" href="http://www.w3.org">W3C</a>,
			<span rel="rdfs:seeAlso" resource="http://www.ivan-herman.net/foaf" property="foaf:title">Semantic Web Activity Lead</span>,
		</span>
	  </span>
	  thanks to Giovanni Tummarello, Robert Fuller, Diego Ceccarelli, and Renaud Delbru, Sindice, and Péter Mika, Yahoo!
	  <br />
	  <span>$Date: 2011/06/08 21:41:47 $</span>
	</address>
		
		
 </body>
</html>