semantic_data_extractor.html
14.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style type="text/css" media="all">
@import "/QA/2006/01/blogstyle.css";
</style>
<meta name="keywords" content='' />
<meta name="description" content="Every so often, someone writes to me or to the public-qa-dev mailing list to report bugs, or simply to give thanks on the semantic data extractor. I'm always pleasantly surprised when I hear that, what started as a 10 minutes..." />
<meta name="revision" content="$Id: semantic_data_extractor.html,v 1.42 2011/12/05 17:18:25 mirror Exp $" />
<link rel="alternate" type="application/atom+xml" title="Atom" href="http://www.w3.org/QA/atom.xml" />
<link rel="alternate" type="application/rss+xml" title="RSS 1.0" href="http://www.w3.org/QA/news.rss" />
<title>Semantic Data Extractor - W3C Blog</title>
<link rel="start" href="http://www.w3.org/QA/" title="Home" />
<link rel="prev" href="http://www.w3.org/QA/2009/02/social_networking_workshop_rep.html" title="Social Networking Workshop Report" />
<link rel="next" href="http://www.w3.org/QA/2009/02/palm_webos_approach_to_html_ex.html" title="Palm webOS approach to HTML extensibility: x-mojo-*" />
<!--
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="http://www.w3.org/QA/2009/02/semantic_data_extractor.html"
trackback:ping="http://www.w3.org/QA/sununga/mt-tb.cgi/259"
dc:title="Semantic Data Extractor"
dc:identifier="http://www.w3.org/QA/2009/02/semantic_data_extractor.html"
dc:subject="Tools"
dc:description="Every so often, someone writes to me or to the public-qa-dev mailing list to report bugs, or simply to give thanks on the semantic data extractor. I'm always pleasantly surprised when I hear that, what started as a 10 minutes..."
dc:creator="Dominique Hazaël-Massieux"
dc:date="2009-02-12T10:27:09+00:00" />
</rdf:RDF>
-->
<!-- <script type="text/javascript" src="http://www.w3.org/QA/mt.js"></script>-->
</head>
<body class="layout-one-column">
<div id="banner">
<h1 id="title">
<a href="http://www.w3.org/"><img height="48" alt="W3C" id="logo" src="http://www.w3.org/Icons/WWW/w3c_home_nb" /></a>
W3C Blog
</h1>
</div>
<ul class="navbar" id="menu">
<li><strong><a href="/QA/" title="W3C Blog Home">[ W3C Blog ]</a></strong></li>
<li><a href="/QA/Library/" title="Documents and Publications on Web and Quality">Documents</a></li>
<li><a href="/QA/Tools/" accesskey="3" title="Validators and other Tools">Tools</a></li>
<li><a href="/2007/12/qa-blog-help/index#feedback">Feedback</a></li>
</ul>
<div id="searchbox">
<form method="get" action="http://www.google.com/custom" enctype="application/x-www-form-urlencoded">
<p id="formbox"><input type="text" size="15" class="textfield" name="q" accesskey="E" maxlength="255" /> <input type="submit" class="submitfield" value="Search" id="goButton" name="sa" accesskey="G" /> <input type="hidden" name="cof" value="T:black;LW:72;ALC:#ff3300;L:http://www.w3.org/Icons/w3c_home;LC:#000099;LH:48;BGC:white;AH:left;VLC:#660066;GL:0;AWFID:0b9847e42caf283e;" /><input type="hidden" id="searchW3C" name="sitesearch" checked="checked" value="www.w3.org/QA" /><input type="hidden" name="domains" value="www.w3.org/QA" /></p>
</form>
</div>
<div id="main"><!-- This DIV encapsulates everything in this page - necessary for the positioning -->
<p class="content-nav">
<a href="http://www.w3.org/QA/2009/02/social_networking_workshop_rep.html">« Social Networking Workshop Report</a> |
<a href="http://www.w3.org/QA/">Main</a>
| <a href="http://www.w3.org/QA/2009/02/palm_webos_approach_to_html_ex.html">Palm webOS approach to HTML extensibility: x-mojo-* »</a>
</p>
<h2 class="entry-header">Semantic Data Extractor</h2>
<div class="entry-body">
<p>Every so often, someone writes to me or to the <a href="http://lists.w3.org/Archives/Public/public-qa-dev/">public-qa-dev mailing list</a> to report bugs, or simply to give thanks on the <a href="http://www.w3.org/2003/12/semantic-extractor.html">semantic data extractor</a>.</p>
<p>I'm always pleasantly surprised when I hear that, what started as a 10 minutes demonstrator of the semantics attached to HTML, is actually used as a tool by a number of developers.</p>
<p>With a name such "semantic data extractor", it was a bit of a shame that the tool didn't highlight the usage of <a href="http://www.w3.org/TR/2007/NOTE-grddl-primer-20070628/">GRDDL</a> or <a href="http://www.w3.org/TR/2008/NOTE-xhtml-rdfa-primer-20081014/">RDFa</a> on pages that use either of these technologies; I have just added detection of both of these to the extractor.</p>
<p>As a bonus, I have also added detection of non-semantic markup: at this time, it will detect purely-wrapping <code><div></code>, empty <code><span></code>, and tables with a single row or a single column (which have good chances to be layout tables); if you have suggestions for detecting other non-semantic markup, let me know!</p>
</div>
<div id="more" class="entry-more">
</div>
<p class="postinfo">Filed by <a href="http://www.w3.org/People/Dom/">Dominique Hazaël-Massieux</a> on February 12, 2009 10:27 AM in <a href="http://www.w3.org/QA/archive/technology/semantic_web/">Semantic Web</a>, <a href="http://www.w3.org/QA/archive/w3cqa_news/tools/">Tools</a><br />
<span class="separator">|</span> <a class="permalink" href="http://www.w3.org/QA/2009/02/semantic_data_extractor.html">Permalink</a>
| <a href="http://www.w3.org/QA/2009/02/semantic_data_extractor.html#comments">Comments (7)</a>
| <a href="http://www.w3.org/QA/2009/02/semantic_data_extractor.html#trackback">TrackBacks (0)</a>
</p>
<h3 class="comments-header" id="comments">Comments</h3>
<div class="comment" id="comment-176213">
<p class="comment-meta" id="c176213">
<span class="comment-meta-author"><strong>Carlo </strong></span>
<span class="comment-meta-date"><a href="#c176213">#</a> 2009-03-13</span>
</p>
<div class="comment-bulk">
<p>The tool doesn't work!
This is the error message:</p>
<p>Using org.apache.xerces.parsers.SAXParser
Exception net.sf.saxon.trans.DynamicError: org.xml.sax.SAXParseException: Content is not allowed in prolog.
org.xml.sax.SAXParseException: Content is not allowed in prolog.</p>
</div>
</div>
<div class="comment" id="comment-176245">
<p class="comment-meta" id="c176245">
<span class="comment-meta-author"><strong>Dom </strong></span>
<span class="comment-meta-date"><a href="#c176245">#</a> 2009-03-13</span>
</p>
<div class="comment-bulk">
<p>Hi Carlo,</p>
<p>Please report bugs and errors on public-qa-dev@w3.org, with details on the URI you tried the tool on.</p>
<p>Thanks,</p>
<p>Dom</p>
</div>
</div>
<div class="comment" id="comment-180797">
<p class="comment-meta" id="c180797">
<span class="comment-meta-author"><strong>OP </strong></span>
<span class="comment-meta-date"><a href="#c180797">#</a> 2009-04-09</span>
</p>
<div class="comment-bulk">
<p>Hi there.</p>
<p>I am a big fan of this tool. But something that is puzzling me is this message we are getting for our site(s) when running them through.</p>
<p>" with no additional content to their unique child"</p>
<p>Gotta love it. Naturally I checked out a few things and tried what I thought might fix this from appearing, and to no avail.</p>
<p>The divs I have are empty in a sense... they have this. Since the include file is in </p>
<p>So I tried to add a few things into the div container... nbsp's, transparent gifs, other content, and still nothing was reducing the amount of divs with no additional content.</p>
<p>The included file doesnt consist of having empty divs inside it.</p>
<p>Other divs have content in them with headers and such, so I think that the div i mentioned may be throwing it off.</p>
<p>Any suggestions what to do to remedy this? </p>
</div>
</div>
<div class="comment" id="comment-182614">
<p class="comment-meta" id="c182614">
<span class="comment-meta-author"><strong>Joe </strong></span>
<span class="comment-meta-date"><a href="#c182614">#</a> 2009-07-08</span>
</p>
<div class="comment-bulk">
<p>I second OP's comment.
I have no clue what this error means, having tried the same as OP:</p>
<p>Non-semantic markup</p>
<p>The following suspiciously non-semantic markup has been detected:</p>
<pre>* 6 <div> with no additional content to their unique child
</pre>
<p>Any advice?
Someone?
Thanks!</p>
</div>
</div>
<div class="comment" id="comment-182615">
<p class="comment-meta" id="c182615">
<span class="comment-meta-author"><strong>Dom </strong></span>
<span class="comment-meta-date"><a href="#c182615">#</a> 2009-07-08</span>
</p>
<div class="comment-bulk">
<p>I have removed the test for empty div, as it was both confusing and misleading.</p>
</div>
</div>
<div class="comment" id="comment-209881">
<p class="comment-meta" id="c209881">
<span class="comment-meta-author"><strong>James Sanders </strong></span>
<span class="comment-meta-date"><a href="#c209881">#</a> 2011-01-01</span>
</p>
<div class="comment-bulk">
<p>Hello All,</p>
<p>Well, after 3 hours of searching through W3C to try to figure out what is wrong and why the semantics validator keeps throwing errors, I finally decided to post here in hopes that some answers might be forthcoming. Hopefully, with Q@A being closed, this still gets some attention. So without further delay, the error I get is as follows:</p>
<p>Using org.apache.xerces.parsers.SAXParser
Exception net.sf.saxon.trans.XPathException: org.xml.sax.SAXParseException: The markup declarations contained or pointed to by the document type declaration must be well-formed.
org.xml.sax.SAXParseException: The markup declarations contained or pointed to by the document type declaration must be well-formed.</p>
<p>URI link to the file in question is as follows:
<a href="http://www.sanders-consultation-group-plus.com/redesign/grail-template.html" rel="nofollow">http://www.sanders-consultation-group-plus.com/redesign/grail-template.html</a></p>
<p>Any help in this matter would be greatly appreciated because I do so much love the idea of running pages through the validator. It really does give a designer an idea of what spiders might think a page is about based on the semantics, and lets me know, as a designer, if I have done my job to make sure they know.</p>
<p>Thanks in advance</p>
</div>
</div>
<div class="comment" id="comment-209937">
<p class="comment-meta" id="c209937">
<span class="comment-meta-author"><strong>Dom </strong></span>
<span class="comment-meta-date"><a href="#c209937">#</a> 2011-01-03</span>
</p>
<div class="comment-bulk">
<p>The problem reported by James seems to be coming from an invalid/ill-formed XHTML document, see <a href="http://lists.w3.org/Archives/Public/public-qa-dev/2011Jan/0001.html" rel="nofollow">http://lists.w3.org/Archives/Public/public-qa-dev/2011Jan/0001.html</a></p>
</div>
</div>
<div class="comments-open" id="comments-open">
<h3 class="comments-open-header">Leave a comment</h3>
<div class="comments-open-moderated">
<p>
Note: this blog is intended to foster <strong>polite
on-topic discussions</strong>. Comments failing these
requirements and spam will not get published. Please,
enter your real name and email address. Every
individual comment is reviewed by the W3C staff.
This may take some time, thank you for your patience.
</p>
<p>
You can use the following HTML markup (a href, b, i,
br/, p, strong, em, ul, ol, li, blockquote, pre)
and/or <a href="http://daringfireball.net/projects/markdown/syntax">Markdown syntax</a>.</p>
</div>
<div id="comments-open-data">
<form method="post" action="http://www.w3.org/QA/sununga/beach.pl" id="comments-form">
<h4>Your comment</h4>
<div id="comments-open-text">
<textarea id="comment-text" name="text" rows="20" cols="100"></textarea><br />
<label for="comment-text">Write your comment text here. Remember, keep the discussion on topic and courteous.</label>
</div>
<h4>About you</h4>
<div id="comment-form-name">
<input type="hidden" name="static" value="1" />
<input type="hidden" name="entry_id" value="6306" />
<input type="hidden" name="__lang" value="en" />
<label for="comment-author">Your Name</label>
<input id="comment-author" name="author" size="30" value="" />
</div>
<div id="comment-form-email">
<label for="comment-email">Your Email Address</label>
<input id="comment-email" name="email" size="30" value="" />
</div>
<div id="comments-open-footer">
<input type="submit" accesskey="s" name="post" id="comment-submit" value="Submit" />
</div>
</form>
</div>
</div>
<p id="gentime">This page was last generated on $Date: 2011/12/05 17:18:25 $</p>
</div><!-- End of "main" DIV. -->
<address>
This blog is written by W3C staff and working group participants,<br />
and maintained by <a href="/People/CMercier/">Coralie Mercier</a>.<br />
Authorized parties may <a href="/QA/new">log in</a> to create a new entry.<br/>
<span id="poweredby">Powered by Movable Type, magpierss and a lot of Web Technology</span>
</address>
<p class="copyright">
<a rel="Copyright" href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a> © 1994-2011
<a href="http://www.w3.org/"><acronym title="World Wide Web Consortium">W3C</acronym></a>®
(<a href="http://www.csail.mit.edu/"><acronym title="Massachusetts Institute of Technology">MIT</acronym></a>,
<a href="http://www.ercim.eu/"><acronym title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
<a href="http://www.keio.ac.jp/">Keio</a>),
All Rights Reserved.
W3C <a href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
<a href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>,
<a rel="Copyright" href="http://www.w3.org/Consortium/Legal/copyright-documents">document use</a>
and <a rel="Copyright" href="http://www.w3.org/Consortium/Legal/copyright-software">software licensing</a>
rules apply. Your interactions with this site are in accordance
with our <a href="http://www.w3.org/Consortium/Legal/privacy-statement#Public">public</a> and
<a href="http://www.w3.org/Consortium/Legal/privacy-statement#Members">Member</a> privacy
statements.
</p>
</body>
</html>