html5-parsing-howto.html 11.9 KB
<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <style type="text/css" media="all">
    @import "/QA/2006/01/blogstyle.css";
    </style>
    <meta name="keywords" content='dom, html, html5, rdfa, tools' />
    <meta name="description" content="You have read a lot about the html 5 specification. You heard that there were hidden dragons and acid rains. But what about looking by yourself practically how html 5 parsing is working? There are already some tools to play with html 5." />
    <meta name="revision" content="$Id: html5-parsing-howto.html,v 1.37 2011/12/16 03:02:57 gerald Exp $" />    
   <link rel="alternate" type="application/atom+xml" title="Atom" href="http://www.w3.org/QA/atom.xml" />
   <link rel="alternate" type="application/rss+xml" title="RSS 1.0" href="http://www.w3.org/QA/news.rss" />   
   <title>The How-To for html 5 parsing - W3C Blog</title>

   <link rel="start" href="http://www.w3.org/QA/" title="Home" />
   <link rel="prev" href="http://www.w3.org/QA/2008/07/interoperability-release-cycle.html" title="Improving Interoperability by Short Release Cycle " />
   <link rel="next" href="http://www.w3.org/QA/2008/07/life_without_mime_type_sniffin.html" title="life without MIME type sniffing?" />

   <!--
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/"
         xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
    rdf:about="http://www.w3.org/QA/2008/07/html5-parsing-howto.html"
    trackback:ping="http://www.w3.org/QA/sununga/mt-tb.cgi/194"
    dc:title="The How-To for html 5 parsing"
    dc:identifier="http://www.w3.org/QA/2008/07/html5-parsing-howto.html"
    dc:subject="HTML"
    dc:description="You have read a lot about the html 5 specification. You heard that there were hidden dragons and acid rains. But what about looking by yourself practically how html 5 parsing is working? There are already some tools to play with html 5."
    dc:creator="Karl Dubost"
    dc:date="2008-07-07T02:35:04+00:00" />
</rdf:RDF>
-->

    <!-- <script type="text/javascript" src="http://www.w3.org/QA/mt.js"></script>-->

</head>
<body class="layout-one-column">
      <div id="banner">
      <h1 id="title">
	<a href="http://www.w3.org/"><img height="48" alt="W3C" id="logo" src="http://www.w3.org/Icons/WWW/w3c_home_nb" /></a>
W3C Blog
</h1>
    </div>
    
    <ul class="navbar" id="menu">
        <li><strong><a href="/QA/" title="W3C Blog Home">[ W3C Blog ]</a></strong></li>
        <li><a href="/QA/Library/" title="Documents and Publications on Web and Quality">Documents</a></li>
        <li><a href="/QA/Tools/" accesskey="3" title="Validators and other Tools">Tools</a></li>
        <li><a href="/2007/12/qa-blog-help/index#feedback">Feedback</a></li>
    </ul>
<div id="searchbox">
<form method="get" action="http://www.google.com/custom" enctype="application/x-www-form-urlencoded">
<p id="formbox"><input type="text" size="15" class="textfield" name="q" accesskey="E" maxlength="255" /> <input type="submit" class="submitfield" value="Search" id="goButton" name="sa" accesskey="G" /> <input type="hidden" name="cof" value="T:black;LW:72;ALC:#ff3300;L:http://www.w3.org/Icons/w3c_home;LC:#000099;LH:48;BGC:white;AH:left;VLC:#660066;GL:0;AWFID:0b9847e42caf283e;" /><input type="hidden" id="searchW3C" name="sitesearch" checked="checked" value="www.w3.org/QA" /><input type="hidden" name="domains" value="www.w3.org/QA" /></p>
</form>
</div>


    <div id="main"><!-- This DIV encapsulates everything in this page - necessary for the positioning -->

                     <p class="content-nav">
                        <a href="http://www.w3.org/QA/2008/07/interoperability-release-cycle.html">&laquo; Improving Interoperability by Short Release Cycle </a> |
                        <a href="http://www.w3.org/QA/">Main</a>
                        | <a href="http://www.w3.org/QA/2008/07/life_without_mime_type_sniffin.html">life without MIME type sniffing? &raquo;</a>
                     </p>

                        <h2 class="entry-header">The How-To for html 5 parsing</h2>
                           <div class="entry-body">
                              <p>You have read a lot about the html 5 specification. You heard that there were hidden dragons and acid rains. But what about looking by yourself practically how <a href="http://www.w3.org/TR/html5/parsing.html#parsing">html 5 parsing</a> is working? There are already some tools to play with html 5.</p>

<h3>DOM in actual browsers</h3>

<p><a href="http://www.w3.org/DOM/faq.html#what">DOM</a> (Document Object Model) is the representation that browsers are using in memory to manipulate Web content. Browsers have <a href="http://www.w3.org/QA/2008/07/interoperability-release-cycle">bugs</a> and the content on the Web is largely not conforming. It results in very different DOM representations in browsers. If you are interested by seeing what a document looks like in different browsers, you can use the <a href="http://software.hixie.ch/utilities/js/live-dom-viewer/">Live DOM Viewer</a>. Open this link with each browser you know and paste code into the window. </p>

<p>This helps you to see how the Web content is understood today by different tools.</p>

<h3>DOM after html 5 parsing</h3>

<p>Now you might be interested to see how a document will be represented by a tool implementing html 5 parsing rules. An important note, html 5 is a specification <strong>in development</strong>. Things might change. The following tools might be incomplete and contain bugs as well. But it will give you an idea of the DOM. It is very practical when you are developing another language which is not html 5 but might be sent as text/html (by mistake or practical choice).</p>

<p>There are at least two online services:</p>

<ul>
<li><a href="http://philip.html5.org/tools/parser/">Live html 5 parser</a> by Philip Taylor</li>
<li><a href="http://james.html5.org/parsetree.html">html5lib Based HTML5 Parser</a></li>
</ul>

<p><a href="http://hsivonen.iki.fi/">Henri Sivonen</a> developed a <a href="http://lists.w3.org/Archives/Public/www-archive/2008Jun/0145">standalone application</a> that you can use on your desktop. Here are the instructions to get it running. It worked fine on my macintosh.</p>

<ol>
<li>Check out the source: svn co http://svn.versiondude.net/whattf/htmlparser/trunk/ htmlparser</li>
<li>Download and untar GWT 1.5 RC1: http://code.google.com/webtoolkit/versions.html</li>
<li>On Linux, install libstdc++5 and a JDK (Ubuntu's OpenJDK-based  package worked for me).</li>
<li>Edit the paths in HtmlParser-shell (Mac) or HtmlParser-linux  (Linux) to point to the location of GWT.</li>
<li>Run HtmlParser-shell (Mac) or HtmlParser-linux (Linux)</li>
</ol>

<p>Henri gave a list of <a href="http://lists.w3.org/Archives/Public/www-archive/2008Jun/0145">limitations and bugs</a></p>

<h3>Using html 5 parsing in your own code</h3>

<p>There are for now three implementations of the html 5 parsing algorithm. </p>

<ul>
<li><a href="http://html5lib.googlecode.com/files/html5lib-0.11.1.zip">html5lib python</a> 0.11.1</li>
<li><a href="http://html5lib.googlecode.com/files/html5-0.10.0.gem">html5lib ruby</a> 0.10.0</li>
<li><a href="http://about.validator.nu/htmlparser/">html 5 parser java</a></li>
</ul>

<p>There is an attempt at implementing in C# for .Net 2.0, but no code has been released yet.</p>

<ul>
<li><a href="http://code.google.com/p/twintsam/">Twintsam</a></li>
</ul>

<p>If you know other tools implementing it, leave a comment.</p>

                           </div>
                           <div id="more" class="entry-more">
                              

                           </div>
                       <p class="postinfo">Filed by <a href="http://www.w3.org/People/karl/">Karl Dubost</a> on July  7, 2008  2:35 AM in <a href="http://www.w3.org/QA/archive/technology/html/">HTML</a>, <a href="http://www.w3.org/QA/archive/w3cqa_news/technology_101/">Technology 101</a>, <a href="http://www.w3.org/QA/archive/w3cqa_news/tools/">Tools</a><br />
<span class="separator">|</span> <a class="permalink" href="http://www.w3.org/QA/2008/07/html5-parsing-howto.html">Permalink</a>
                                 | <a href="http://www.w3.org/QA/2008/07/html5-parsing-howto.html#comments">Comments (0)</a>
                                 | <a href="http://www.w3.org/QA/2008/07/html5-parsing-howto.html#trackback">TrackBacks (0)</a>
</p>





  <div class="comments-open" id="comments-open">
<h3 class="comments-open-header">Leave a comment</h3>

<div class="comments-open-moderated">
   <p>
   Note: this blog is intended to foster <strong>polite
   on-topic discussions</strong>. Comments failing these
   requirements and spam will not get published. Please,
   enter your real name and email address. Every
   individual comment is reviewed by the W3C staff.
   This may take some time, thank you for your patience.
   </p>
   <p>
   You can use the following HTML markup (a href, b, i, 
   br/, p, strong, em, ul, ol, li, blockquote, pre) 
   and/or <a href="http://daringfireball.net/projects/markdown/syntax">Markdown syntax</a>.</p>
</div>

<div id="comments-open-data">
<form method="post" action="http://www.w3.org/QA/sununga/beach.pl" id="comments-form">
<h4>Your comment</h4>
<div id="comments-open-text">
  <textarea id="comment-text" name="text" rows="20" cols="100"></textarea><br />
<label for="comment-text">Write your comment text here. Remember, keep the discussion on topic and courteous.</label>
</div>

<h4>About you</h4>
<div id="comment-form-name">
  <input type="hidden" name="static" value="1" />
<input type="hidden" name="entry_id" value="204" />
<input type="hidden" name="__lang" value="en" /> 
<label for="comment-author">Your Name</label>
<input id="comment-author" name="author" size="30" value="" />
</div>
<div id="comment-form-email">
<label for="comment-email">Your Email Address</label>
<input id="comment-email" name="email" size="30" value="" />
</div>

<div id="comments-open-footer">
<input type="submit" accesskey="s" name="post" id="comment-submit" value="Submit" />

</div>
</form>
</div>
</div>



<p id="gentime">This page was last generated on $Date: 2011/12/16 03:02:57 $</p> 

      </div><!-- End of "main" DIV. -->

<address>

This blog is written by W3C staff and working group participants,<br />
&nbsp;and maintained by <a href="/People/CMercier/">Coralie Mercier</a>.<br />
Authorized parties may <a href="/QA/new">log in</a> to create a new entry.<br/>
<span id="poweredby">Powered by Movable Type, magpierss and a lot of Web Technology</span>
    </address>


    
    <p class="copyright">
      <a rel="Copyright" href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a> &copy; 1994-2011
      <a href="http://www.w3.org/"><acronym title="World Wide Web Consortium">W3C</acronym></a>&reg;
      (<a href="http://www.csail.mit.edu/"><acronym title="Massachusetts Institute of Technology">MIT</acronym></a>,
      <a href="http://www.ercim.eu/"><acronym title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
      <a href="http://www.keio.ac.jp/">Keio</a>),
      All Rights Reserved.
      W3C <a href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
      <a href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>,
      <a rel="Copyright" href="http://www.w3.org/Consortium/Legal/copyright-documents">document use</a>
      and <a rel="Copyright" href="http://www.w3.org/Consortium/Legal/copyright-software">software licensing</a>
      rules apply. Your interactions with this site are in accordance
      with our <a href="http://www.w3.org/Consortium/Legal/privacy-statement#Public">public</a> and
      <a href="http://www.w3.org/Consortium/Legal/privacy-statement#Members">Member</a> privacy
      statements.
    </p>

  </body>
</html>