html5-parsing-howto.html
11.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style type="text/css" media="all">
@import "/QA/2006/01/blogstyle.css";
</style>
<meta name="keywords" content='dom, html, html5, rdfa, tools' />
<meta name="description" content="You have read a lot about the html 5 specification. You heard that there were hidden dragons and acid rains. But what about looking by yourself practically how html 5 parsing is working? There are already some tools to play with html 5." />
<meta name="revision" content="$Id: html5-parsing-howto.html,v 1.37 2011/12/16 03:02:57 gerald Exp $" />
<link rel="alternate" type="application/atom+xml" title="Atom" href="http://www.w3.org/QA/atom.xml" />
<link rel="alternate" type="application/rss+xml" title="RSS 1.0" href="http://www.w3.org/QA/news.rss" />
<title>The How-To for html 5 parsing - W3C Blog</title>
<link rel="start" href="http://www.w3.org/QA/" title="Home" />
<link rel="prev" href="http://www.w3.org/QA/2008/07/interoperability-release-cycle.html" title="Improving Interoperability by Short Release Cycle " />
<link rel="next" href="http://www.w3.org/QA/2008/07/life_without_mime_type_sniffin.html" title="life without MIME type sniffing?" />
<!--
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="http://www.w3.org/QA/2008/07/html5-parsing-howto.html"
trackback:ping="http://www.w3.org/QA/sununga/mt-tb.cgi/194"
dc:title="The How-To for html 5 parsing"
dc:identifier="http://www.w3.org/QA/2008/07/html5-parsing-howto.html"
dc:subject="HTML"
dc:description="You have read a lot about the html 5 specification. You heard that there were hidden dragons and acid rains. But what about looking by yourself practically how html 5 parsing is working? There are already some tools to play with html 5."
dc:creator="Karl Dubost"
dc:date="2008-07-07T02:35:04+00:00" />
</rdf:RDF>
-->
<!-- <script type="text/javascript" src="http://www.w3.org/QA/mt.js"></script>-->
</head>
<body class="layout-one-column">
<div id="banner">
<h1 id="title">
<a href="http://www.w3.org/"><img height="48" alt="W3C" id="logo" src="http://www.w3.org/Icons/WWW/w3c_home_nb" /></a>
W3C Blog
</h1>
</div>
<ul class="navbar" id="menu">
<li><strong><a href="/QA/" title="W3C Blog Home">[ W3C Blog ]</a></strong></li>
<li><a href="/QA/Library/" title="Documents and Publications on Web and Quality">Documents</a></li>
<li><a href="/QA/Tools/" accesskey="3" title="Validators and other Tools">Tools</a></li>
<li><a href="/2007/12/qa-blog-help/index#feedback">Feedback</a></li>
</ul>
<div id="searchbox">
<form method="get" action="http://www.google.com/custom" enctype="application/x-www-form-urlencoded">
<p id="formbox"><input type="text" size="15" class="textfield" name="q" accesskey="E" maxlength="255" /> <input type="submit" class="submitfield" value="Search" id="goButton" name="sa" accesskey="G" /> <input type="hidden" name="cof" value="T:black;LW:72;ALC:#ff3300;L:http://www.w3.org/Icons/w3c_home;LC:#000099;LH:48;BGC:white;AH:left;VLC:#660066;GL:0;AWFID:0b9847e42caf283e;" /><input type="hidden" id="searchW3C" name="sitesearch" checked="checked" value="www.w3.org/QA" /><input type="hidden" name="domains" value="www.w3.org/QA" /></p>
</form>
</div>
<div id="main"><!-- This DIV encapsulates everything in this page - necessary for the positioning -->
<p class="content-nav">
<a href="http://www.w3.org/QA/2008/07/interoperability-release-cycle.html">« Improving Interoperability by Short Release Cycle </a> |
<a href="http://www.w3.org/QA/">Main</a>
| <a href="http://www.w3.org/QA/2008/07/life_without_mime_type_sniffin.html">life without MIME type sniffing? »</a>
</p>
<h2 class="entry-header">The How-To for html 5 parsing</h2>
<div class="entry-body">
<p>You have read a lot about the html 5 specification. You heard that there were hidden dragons and acid rains. But what about looking by yourself practically how <a href="http://www.w3.org/TR/html5/parsing.html#parsing">html 5 parsing</a> is working? There are already some tools to play with html 5.</p>
<h3>DOM in actual browsers</h3>
<p><a href="http://www.w3.org/DOM/faq.html#what">DOM</a> (Document Object Model) is the representation that browsers are using in memory to manipulate Web content. Browsers have <a href="http://www.w3.org/QA/2008/07/interoperability-release-cycle">bugs</a> and the content on the Web is largely not conforming. It results in very different DOM representations in browsers. If you are interested by seeing what a document looks like in different browsers, you can use the <a href="http://software.hixie.ch/utilities/js/live-dom-viewer/">Live DOM Viewer</a>. Open this link with each browser you know and paste code into the window. </p>
<p>This helps you to see how the Web content is understood today by different tools.</p>
<h3>DOM after html 5 parsing</h3>
<p>Now you might be interested to see how a document will be represented by a tool implementing html 5 parsing rules. An important note, html 5 is a specification <strong>in development</strong>. Things might change. The following tools might be incomplete and contain bugs as well. But it will give you an idea of the DOM. It is very practical when you are developing another language which is not html 5 but might be sent as text/html (by mistake or practical choice).</p>
<p>There are at least two online services:</p>
<ul>
<li><a href="http://philip.html5.org/tools/parser/">Live html 5 parser</a> by Philip Taylor</li>
<li><a href="http://james.html5.org/parsetree.html">html5lib Based HTML5 Parser</a></li>
</ul>
<p><a href="http://hsivonen.iki.fi/">Henri Sivonen</a> developed a <a href="http://lists.w3.org/Archives/Public/www-archive/2008Jun/0145">standalone application</a> that you can use on your desktop. Here are the instructions to get it running. It worked fine on my macintosh.</p>
<ol>
<li>Check out the source: svn co http://svn.versiondude.net/whattf/htmlparser/trunk/ htmlparser</li>
<li>Download and untar GWT 1.5 RC1: http://code.google.com/webtoolkit/versions.html</li>
<li>On Linux, install libstdc++5 and a JDK (Ubuntu's OpenJDK-based package worked for me).</li>
<li>Edit the paths in HtmlParser-shell (Mac) or HtmlParser-linux (Linux) to point to the location of GWT.</li>
<li>Run HtmlParser-shell (Mac) or HtmlParser-linux (Linux)</li>
</ol>
<p>Henri gave a list of <a href="http://lists.w3.org/Archives/Public/www-archive/2008Jun/0145">limitations and bugs</a></p>
<h3>Using html 5 parsing in your own code</h3>
<p>There are for now three implementations of the html 5 parsing algorithm. </p>
<ul>
<li><a href="http://html5lib.googlecode.com/files/html5lib-0.11.1.zip">html5lib python</a> 0.11.1</li>
<li><a href="http://html5lib.googlecode.com/files/html5-0.10.0.gem">html5lib ruby</a> 0.10.0</li>
<li><a href="http://about.validator.nu/htmlparser/">html 5 parser java</a></li>
</ul>
<p>There is an attempt at implementing in C# for .Net 2.0, but no code has been released yet.</p>
<ul>
<li><a href="http://code.google.com/p/twintsam/">Twintsam</a></li>
</ul>
<p>If you know other tools implementing it, leave a comment.</p>
</div>
<div id="more" class="entry-more">
</div>
<p class="postinfo">Filed by <a href="http://www.w3.org/People/karl/">Karl Dubost</a> on July 7, 2008 2:35 AM in <a href="http://www.w3.org/QA/archive/technology/html/">HTML</a>, <a href="http://www.w3.org/QA/archive/w3cqa_news/technology_101/">Technology 101</a>, <a href="http://www.w3.org/QA/archive/w3cqa_news/tools/">Tools</a><br />
<span class="separator">|</span> <a class="permalink" href="http://www.w3.org/QA/2008/07/html5-parsing-howto.html">Permalink</a>
| <a href="http://www.w3.org/QA/2008/07/html5-parsing-howto.html#comments">Comments (0)</a>
| <a href="http://www.w3.org/QA/2008/07/html5-parsing-howto.html#trackback">TrackBacks (0)</a>
</p>
<div class="comments-open" id="comments-open">
<h3 class="comments-open-header">Leave a comment</h3>
<div class="comments-open-moderated">
<p>
Note: this blog is intended to foster <strong>polite
on-topic discussions</strong>. Comments failing these
requirements and spam will not get published. Please,
enter your real name and email address. Every
individual comment is reviewed by the W3C staff.
This may take some time, thank you for your patience.
</p>
<p>
You can use the following HTML markup (a href, b, i,
br/, p, strong, em, ul, ol, li, blockquote, pre)
and/or <a href="http://daringfireball.net/projects/markdown/syntax">Markdown syntax</a>.</p>
</div>
<div id="comments-open-data">
<form method="post" action="http://www.w3.org/QA/sununga/beach.pl" id="comments-form">
<h4>Your comment</h4>
<div id="comments-open-text">
<textarea id="comment-text" name="text" rows="20" cols="100"></textarea><br />
<label for="comment-text">Write your comment text here. Remember, keep the discussion on topic and courteous.</label>
</div>
<h4>About you</h4>
<div id="comment-form-name">
<input type="hidden" name="static" value="1" />
<input type="hidden" name="entry_id" value="204" />
<input type="hidden" name="__lang" value="en" />
<label for="comment-author">Your Name</label>
<input id="comment-author" name="author" size="30" value="" />
</div>
<div id="comment-form-email">
<label for="comment-email">Your Email Address</label>
<input id="comment-email" name="email" size="30" value="" />
</div>
<div id="comments-open-footer">
<input type="submit" accesskey="s" name="post" id="comment-submit" value="Submit" />
</div>
</form>
</div>
</div>
<p id="gentime">This page was last generated on $Date: 2011/12/16 03:02:57 $</p>
</div><!-- End of "main" DIV. -->
<address>
This blog is written by W3C staff and working group participants,<br />
and maintained by <a href="/People/CMercier/">Coralie Mercier</a>.<br />
Authorized parties may <a href="/QA/new">log in</a> to create a new entry.<br/>
<span id="poweredby">Powered by Movable Type, magpierss and a lot of Web Technology</span>
</address>
<p class="copyright">
<a rel="Copyright" href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a> © 1994-2011
<a href="http://www.w3.org/"><acronym title="World Wide Web Consortium">W3C</acronym></a>®
(<a href="http://www.csail.mit.edu/"><acronym title="Massachusetts Institute of Technology">MIT</acronym></a>,
<a href="http://www.ercim.eu/"><acronym title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
<a href="http://www.keio.ac.jp/">Keio</a>),
All Rights Reserved.
W3C <a href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
<a href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>,
<a rel="Copyright" href="http://www.w3.org/Consortium/Legal/copyright-documents">document use</a>
and <a rel="Copyright" href="http://www.w3.org/Consortium/Legal/copyright-software">software licensing</a>
rules apply. Your interactions with this site are in accordance
with our <a href="http://www.w3.org/Consortium/Legal/privacy-statement#Public">public</a> and
<a href="http://www.w3.org/Consortium/Legal/privacy-statement#Members">Member</a> privacy
statements.
</p>
</body>
</html>