utf8-web-growth.html 19.6 KB
<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <style type="text/css" media="all">
    @import "/QA/2006/01/blogstyle.css";
    </style>
    <meta name="keywords" content='html, html5, i18n, implementation, Internationalization, unicode, validator' />
    <meta name="description" content="utf-8 is taking over traditional encodings on the Web." />
    <meta name="revision" content="$Id: utf8-web-growth.html,v 1.43 2011/12/16 03:02:51 gerald Exp $" />    
   <link rel="alternate" type="application/atom+xml" title="Atom" href="http://www.w3.org/QA/atom.xml" />
   <link rel="alternate" type="application/rss+xml" title="RSS 1.0" href="http://www.w3.org/QA/news.rss" />   
   <title>utf-8 Growth On The Web - W3C Blog</title>

   <link rel="start" href="http://www.w3.org/QA/" title="Home" />
   <link rel="prev" href="http://www.w3.org/QA/2008/05/canvas-text-and-cjk.html" title="Vertical Layouts for Canvas Text (CJK)" />
   <link rel="next" href="http://www.w3.org/QA/2008/05/syntax_for_aria_costbenefit_an.html" title="Syntax for ARIA: Cost-benefit analysis" />

   <!--
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/"
         xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
    rdf:about="http://www.w3.org/QA/2008/05/utf8-web-growth.html"
    trackback:ping="http://www.w3.org/QA/sununga/mt-tb.cgi/166"
    dc:title="utf-8 Growth On The Web"
    dc:identifier="http://www.w3.org/QA/2008/05/utf8-web-growth.html"
    dc:subject="HTML"
    dc:description="utf-8 is taking over traditional encodings on the Web."
    dc:creator="Karl Dubost"
    dc:date="2008-05-06T23:51:49+00:00" />
</rdf:RDF>
-->

    <!-- <script type="text/javascript" src="http://www.w3.org/QA/mt.js"></script>-->

</head>
<body class="layout-one-column">
      <div id="banner">
      <h1 id="title">
	<a href="http://www.w3.org/"><img height="48" alt="W3C" id="logo" src="http://www.w3.org/Icons/WWW/w3c_home_nb" /></a>
W3C Blog
</h1>
    </div>
    
    <ul class="navbar" id="menu">
        <li><strong><a href="/QA/" title="W3C Blog Home">[ W3C Blog ]</a></strong></li>
        <li><a href="/QA/Library/" title="Documents and Publications on Web and Quality">Documents</a></li>
        <li><a href="/QA/Tools/" accesskey="3" title="Validators and other Tools">Tools</a></li>
        <li><a href="/2007/12/qa-blog-help/index#feedback">Feedback</a></li>
    </ul>
<div id="searchbox">
<form method="get" action="http://www.google.com/custom" enctype="application/x-www-form-urlencoded">
<p id="formbox"><input type="text" size="15" class="textfield" name="q" accesskey="E" maxlength="255" /> <input type="submit" class="submitfield" value="Search" id="goButton" name="sa" accesskey="G" /> <input type="hidden" name="cof" value="T:black;LW:72;ALC:#ff3300;L:http://www.w3.org/Icons/w3c_home;LC:#000099;LH:48;BGC:white;AH:left;VLC:#660066;GL:0;AWFID:0b9847e42caf283e;" /><input type="hidden" id="searchW3C" name="sitesearch" checked="checked" value="www.w3.org/QA" /><input type="hidden" name="domains" value="www.w3.org/QA" /></p>
</form>
</div>


    <div id="main"><!-- This DIV encapsulates everything in this page - necessary for the positioning -->

                     <p class="content-nav">
                        <a href="http://www.w3.org/QA/2008/05/canvas-text-and-cjk.html">&laquo; Vertical Layouts for Canvas Text (CJK)</a> |
                        <a href="http://www.w3.org/QA/">Main</a>
                        | <a href="http://www.w3.org/QA/2008/05/syntax_for_aria_costbenefit_an.html">Syntax for ARIA: Cost-benefit analysis &raquo;</a>
                     </p>

                        <h2 class="entry-header">utf-8 Growth On The Web</h2>
                           <div class="entry-body">
                              <p>On Google's blog, Mark Davis is explaining that Google is <a href="http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html">moving to Unicode 5.1</a>. The article unfortunately mixes unicode and utf-8 as it has been noticed by David Goodger in <a href="http://www.artima.com/weblogs/viewpost.jsp?thread=230157">Unicode misinformation</a>. But the really interesting bit is the growth of utf-8 on the Web. These data should be interesting for the development of http, html 5 and validators.</p>

<p><img src="/QA/2008/05/utf8-growth-google" width="432" height="458" alt="utf-8 growth on the Web compared to other encoding"/></p>

<p>© graph from Google.</p>

                           </div>
                           <div id="more" class="entry-more">
                              

                           </div>
                       <p class="postinfo">Filed by <a href="http://www.w3.org/People/karl/">Karl Dubost</a> on May  6, 2008 11:51 PM in <a href="http://www.w3.org/QA/archive/technology/html/">HTML</a>, <a href="http://www.w3.org/QA/archive/technology/http/">HTTP</a>, <a href="http://www.w3.org/QA/archive/web_spotting/opinions_editorial/">Opinions &amp;amp; Editorial</a>, <a href="http://www.w3.org/QA/archive/w3cqa_news/tools/">Tools</a><br />
<span class="separator">|</span> <a class="permalink" href="http://www.w3.org/QA/2008/05/utf8-web-growth.html">Permalink</a>
                                 | <a href="http://www.w3.org/QA/2008/05/utf8-web-growth.html#comments">Comments (8)</a>
                                 | <a href="http://www.w3.org/QA/2008/05/utf8-web-growth.html#trackback">TrackBacks (0)</a>
</p>



<h3 class="comments-header" id="comments">Comments</h3>
<div class="comment" id="comment-139548">
<p class="comment-meta" id="c139548">
<span class="comment-meta-author"><strong>Mark Nottingham </strong></span>
<span class="comment-meta-date"><a href="#c139548">#</a> 2008-05-07</span>
</p>
<div class="comment-bulk">
<p>I wonder how they determined the encoding of pages for purposes of the graph.</p>

</div>
</div>


<div class="comment" id="comment-139574">
<p class="comment-meta" id="c139574">
<span class="comment-meta-author"><strong>Fwolf </strong></span>
<span class="comment-meta-date"><a href="#c139574">#</a> 2008-05-07</span>
</p>
<div class="comment-bulk">
<p>The raise of utf-8, equals the down level of us only(ascii)
And another sad point is Chinese(gb2312) 's change are no obvious.</p>

<p>PS: Glad to see another comment support <a href="http://daringfireball.net/projects/markdown/syntax" rel="nofollow">Markdown Syntax</a> (<a href="http://michelf.com/projects/php-markdown/extra/" rel="nofollow">Extra</a>)!</p>

</div>
</div>


<div class="comment" id="comment-139790">
<p class="comment-meta" id="c139790">
<span class="comment-meta-author"><strong>Frank </strong></span>
<span class="comment-meta-date"><a href="#c139790">#</a> 2008-05-08</span>
</p>
<div class="comment-bulk">
<p>When you'd add <b>US-ASCII</b> and <b>UTF-8</b> the picture would be almost stable over the last seven years, a valid US-ASCII page (with NCRs) is also a valid UTF-8 page. Likely folks updated their defaults and dare use more <i>non-ASCII-UTF-8</i> than in 2001, to be sure you'd have to look into the page. <a href="http://en.wikipedia.org/wiki/Lies%2C_damned_lies%2C_and_statistics" rel="nofollow">Lies, damned lies, and statistics</a>...</p>

</div>
</div>


<div class="comment" id="comment-139793">
<p class="comment-meta" id="c139793">
<span class="comment-meta-author"><strong>Karl Dubost <a class="commenter-profile" href="http://www.w3.org/People/karl/"><img alt="Author Profile Page" src="http://www.w3.org/QA/sununga/mt-static/images/comment/mt_logo.png" width="16" height="16" /></a></strong></span>
<span class="comment-meta-date"><a href="#c139793">#</a> 2008-05-08</span>
</p>
<div class="comment-bulk">
<p>@Mark </p>

<p>me neither but that would be interesting to know more. The data have been compiled by Erik van der Poel. Maybe he could chime in and explain a bit more. I will send have sent him a <a href="http://lists.w3.org/Archives/Public/www-archive/2008May/0014" rel="nofollow">pointer to this thread</a>.</p>

</div>
</div>


<div class="comment" id="comment-139843">
<p class="comment-meta" id="c139843">
<span class="comment-meta-author"><strong>Brian Wilson </strong></span>
<span class="comment-meta-date"><a href="#c139843">#</a> 2008-05-08</span>
</p>
<div class="comment-bulk">
<p>[I drafted this up earlier and no one had responded yet...now I'm a little late to the game]</p>

<p>This is an interesting look at encoding, but many questions remain, namely 
their methodology and their URL set.</p>

<p>There are a number of ways to specify the encoding for a document, including 
just doing some blind scanning of the raw document. The Google blog post makes 
zero mention of how they detected the encoding. Based on the graph and the 
wording, it makes me think that they did not look at any of the stated 
encodings from the document (the "charset" parameter of the Content-Type HTTP 
header, the same value via the META markup element and the "encoding" attribute
of the XML declaration).</p>

<p>I've also been doing some studies of stated encodings recently using mainly the 
Open Directory Project (DMoz) as the URL set. [Note: unlike the Google research, 
the results I have found are currently a snapshot only and do not represent any 
trends over time.] Results from that study will be released soon, but, I've 
noticed a few things about character encoding after analyzing about 3.5 million 
URLs so far. Here are a few highlights:</p>

<ul>
<li><p>A minority of documents (~20%) use the HTTP Header to declare the 
encoding[1]. The "utf-8" value IS the dominant value, but only slightly 
(318351 for "utf-8" as versus 286967 for the next most-popular value of 
"iso-8859-1"). This agrees with Google's research. </p></li>
<li><p>The majority of documents (~66%) use the META element to declare the 
encoding. In this situation, the result is much different. The #1 and </p>

2 values are "iso-8859-1" and "windows-1252", which combined are

<p>represented in 1754820 cases, which dominates over the third place 
"utf-8" at 249084 (a 7:1 ratio!)</p></li>
<li><p>Most documents that use XML also specify an encoding in the XML declaration. 
Even there, "iso-8859-1" dominates over "utf-8", 54572 instances to 27052 
(although "utf-8" is the default encoding for XML documents...)</p></li>
</ul>

<p>Stated encodings:
Most browsers will use a stated charset encoding from the HTTP header in 
preference to using auto-detection methods (scanning all or parts of the 
entire document to look for encoding hints), so if Google used some other 
method, they may be ignoring how a browser would actually treat the encoding 
of the document, and hence how it is actually (and accurately) displayed.</p>

<p>Please note that the value of "us-ascii" - the closest value to what they are 
claiming is on a big decline - was very rarely encountered...less than 1% of 
all cases where encodings are specified in any way. So...what does Google mean 
when they say that "ASCII" has such a high usage? Do they just mean the first 
128 code points shared in common between the ASCII, iso-8859-* and UTF-8 
encodings? For UTF-8, did they also use the Byte Order Mark to detect UTF 
usage? </p>

<p>Now, certainly the DMoz URL set has its own issues, namely skewing more toward 
western web pages, and skewing heavily toward top-level pages of a site (about 
3/4). These problems with DMoz are known...but are there any known issues with 
Google's URL set? We don't know anything about Google's URL set other than "it 
is big" and it "<em>probably</em> represents the universe of the Web-at-large" in 
some way.</p>

<p>[1] "utf-8" dominance here may actually be more impressive than it seems. Many 
    Web servers have a default encoding used by the HTTP header. The default 
    encoding for Apache 2.2 for example is not "utf-8", but "iso-8859-1".</p>

</div>
</div>


<div class="comment" id="comment-139855">
<p class="comment-meta" id="c139855">
<span class="comment-meta-author"><strong>Erik van der Poel </strong></span>
<span class="comment-meta-date"><a href="#c139855">#</a> 2008-05-08</span>
</p>
<div class="comment-bulk">
<p>I used an encoding detector that looks at the entire HTML file (not just the "charset" label). We (Google) have samples of Web documents from 2001 onwards, so I ran the detector on those. The detector reports the "lowest" encoding. I.e. it would not report US-ASCII if there were any non-ASCII characters (bytes with value greater than 127). An NCR analysis might also be interesting, I agree. What would you like to see? Unicode scripts over time? Languages over time?</p>

</div>
</div>


<div class="comment" id="comment-139919">
<p class="comment-meta" id="c139919">
<span class="comment-meta-author"><strong>Philip Taylor </strong></span>
<span class="comment-meta-date"><a href="#c139919">#</a> 2008-05-08</span>
</p>
<div class="comment-bulk">
<p>A while ago I collected very similar data to Brian Wilson, and did <a href="http://philip.html5.org/data/charsets.html" rel="nofollow">some analysis</a>. (I was primarily looking for common errors, and for how many bytes you need to check before finding the meta charset, rather than comparing the frequency of encodings.)</p>

<p>An interesting point is that of the pages declared as UTF-8 (in HTTP headers or in &lt;meta&gt;), 4% are not actually valid UTF-8 and are relying on browsers doing error correction. GB2312 is worse, with 16% of the pages I looked at containing invalid byte sequences - it seems many people label their pages as GB2312 when actually they're using GBK/GB18030.</p>

</div>
</div>


<div class="comment" id="comment-139948">
<p class="comment-meta" id="c139948">
<span class="comment-meta-author"><strong>Erik van der Poel </strong></span>
<span class="comment-meta-date"><a href="#c139948">#</a> 2008-05-08</span>
</p>
<div class="comment-bulk">
<p>Hello Brian, thank you for reporting your results. It's always nice to see the results from other samples, since I worry that Google's sample may be somewhat biased toward our mechanisms for choosing a subset of the Web. As you know, we compute a value called Page Rank, and this is used in our systems.</p>

<p>We do use the HTTP and HTML META charsets, but only as initial hints. The rest of the detection is based on the byte stream itself (after the HTTP response headers). Many encoding detectors use the frequencies of occurences of certain byte sequences in a "base" set to build a model during training, and then compute the probability that a document is in a certain encoding based on the byte sequences in that document. Note that the HTTP and HTML charset labels are sometimes wrong or missing. One initial measurement of the fraction of documents that have an "incorrect" label was roughly 5%, I believe, but I will have to go back and confirm that some day. Of course, our own detector may be getting it wrong sometimes too, but when we actually look at the documents, we do find some incorrect charsets, and even some documents that mix UTF-8 with ISO-8859-1. Note also that browsers offer an encoding menu that the user can use to "correct" a garbled display (though novices may never use that feature).</p>

<p>In another study, I found that the HTTP charset was present in 11% of responses in 2001, and 43% in 2007. For the HTML charset, those numbers were 44% and 74%, respectively, while for XML encoding they were 0.39% and 2.7%, respectively.</p>

<p>Some documents come with charset labels declaring iso-8859-1, windows-1252 or even utf-8 when the byte values themselves are never greater than 127. Such documents are pure US-ASCII (if no ISO 2022 escape sequences are encountered). One might even argue that the charset labels are incorrect in such cases. I realize that this is debatable, but I don't think such debate is valuable.</p>

<p>Note that the first 128 values (0-127) are common to many charsets, not just ascii, iso-8859-* and utf-8. Windows-<em>, euc-</em>, shift_jis and big5 all come to mind. Major browsers treat the first 128 values as ascii even if the spec for the charset itself has a few non-ascii characters in that range, such as Yen sign instead of Backslash, and so on.</p>

<p>Yes, we do feed the various BOMs (utf-8, utf-16, etc) into our probability computation.</p>

</div>
</div>



  <div class="comments-open" id="comments-open">
<h3 class="comments-open-header">Leave a comment</h3>

<div class="comments-open-moderated">
   <p>
   Note: this blog is intended to foster <strong>polite
   on-topic discussions</strong>. Comments failing these
   requirements and spam will not get published. Please,
   enter your real name and email address. Every
   individual comment is reviewed by the W3C staff.
   This may take some time, thank you for your patience.
   </p>
   <p>
   You can use the following HTML markup (a href, b, i, 
   br/, p, strong, em, ul, ol, li, blockquote, pre) 
   and/or <a href="http://daringfireball.net/projects/markdown/syntax">Markdown syntax</a>.</p>
</div>

<div id="comments-open-data">
<form method="post" action="http://www.w3.org/QA/sununga/beach.pl" id="comments-form">
<h4>Your comment</h4>
<div id="comments-open-text">
  <textarea id="comment-text" name="text" rows="20" cols="100"></textarea><br />
<label for="comment-text">Write your comment text here. Remember, keep the discussion on topic and courteous.</label>
</div>

<h4>About you</h4>
<div id="comment-form-name">
  <input type="hidden" name="static" value="1" />
<input type="hidden" name="entry_id" value="176" />
<input type="hidden" name="__lang" value="en" /> 
<label for="comment-author">Your Name</label>
<input id="comment-author" name="author" size="30" value="" />
</div>
<div id="comment-form-email">
<label for="comment-email">Your Email Address</label>
<input id="comment-email" name="email" size="30" value="" />
</div>

<div id="comments-open-footer">
<input type="submit" accesskey="s" name="post" id="comment-submit" value="Submit" />

</div>
</form>
</div>
</div>



<p id="gentime">This page was last generated on $Date: 2011/12/16 03:02:51 $</p> 

      </div><!-- End of "main" DIV. -->

<address>

This blog is written by W3C staff and working group participants,<br />
&nbsp;and maintained by <a href="/People/CMercier/">Coralie Mercier</a>.<br />
Authorized parties may <a href="/QA/new">log in</a> to create a new entry.<br/>
<span id="poweredby">Powered by Movable Type, magpierss and a lot of Web Technology</span>
    </address>


    
    <p class="copyright">
      <a rel="Copyright" href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a> &copy; 1994-2011
      <a href="http://www.w3.org/"><acronym title="World Wide Web Consortium">W3C</acronym></a>&reg;
      (<a href="http://www.csail.mit.edu/"><acronym title="Massachusetts Institute of Technology">MIT</acronym></a>,
      <a href="http://www.ercim.eu/"><acronym title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
      <a href="http://www.keio.ac.jp/">Keio</a>),
      All Rights Reserved.
      W3C <a href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
      <a href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>,
      <a rel="Copyright" href="http://www.w3.org/Consortium/Legal/copyright-documents">document use</a>
      and <a rel="Copyright" href="http://www.w3.org/Consortium/Legal/copyright-software">software licensing</a>
      rules apply. Your interactions with this site are in accordance
      with our <a href="http://www.w3.org/Consortium/Legal/privacy-statement#Public">public</a> and
      <a href="http://www.w3.org/Consortium/Legal/privacy-statement#Members">Member</a> privacy
      statements.
    </p>

  </body>
</html>