utf8-web-growth.html
19.6 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style type="text/css" media="all">
@import "/QA/2006/01/blogstyle.css";
</style>
<meta name="keywords" content='html, html5, i18n, implementation, Internationalization, unicode, validator' />
<meta name="description" content="utf-8 is taking over traditional encodings on the Web." />
<meta name="revision" content="$Id: utf8-web-growth.html,v 1.43 2011/12/16 03:02:51 gerald Exp $" />
<link rel="alternate" type="application/atom+xml" title="Atom" href="http://www.w3.org/QA/atom.xml" />
<link rel="alternate" type="application/rss+xml" title="RSS 1.0" href="http://www.w3.org/QA/news.rss" />
<title>utf-8 Growth On The Web - W3C Blog</title>
<link rel="start" href="http://www.w3.org/QA/" title="Home" />
<link rel="prev" href="http://www.w3.org/QA/2008/05/canvas-text-and-cjk.html" title="Vertical Layouts for Canvas Text (CJK)" />
<link rel="next" href="http://www.w3.org/QA/2008/05/syntax_for_aria_costbenefit_an.html" title="Syntax for ARIA: Cost-benefit analysis" />
<!--
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
rdf:about="http://www.w3.org/QA/2008/05/utf8-web-growth.html"
trackback:ping="http://www.w3.org/QA/sununga/mt-tb.cgi/166"
dc:title="utf-8 Growth On The Web"
dc:identifier="http://www.w3.org/QA/2008/05/utf8-web-growth.html"
dc:subject="HTML"
dc:description="utf-8 is taking over traditional encodings on the Web."
dc:creator="Karl Dubost"
dc:date="2008-05-06T23:51:49+00:00" />
</rdf:RDF>
-->
<!-- <script type="text/javascript" src="http://www.w3.org/QA/mt.js"></script>-->
</head>
<body class="layout-one-column">
<div id="banner">
<h1 id="title">
<a href="http://www.w3.org/"><img height="48" alt="W3C" id="logo" src="http://www.w3.org/Icons/WWW/w3c_home_nb" /></a>
W3C Blog
</h1>
</div>
<ul class="navbar" id="menu">
<li><strong><a href="/QA/" title="W3C Blog Home">[ W3C Blog ]</a></strong></li>
<li><a href="/QA/Library/" title="Documents and Publications on Web and Quality">Documents</a></li>
<li><a href="/QA/Tools/" accesskey="3" title="Validators and other Tools">Tools</a></li>
<li><a href="/2007/12/qa-blog-help/index#feedback">Feedback</a></li>
</ul>
<div id="searchbox">
<form method="get" action="http://www.google.com/custom" enctype="application/x-www-form-urlencoded">
<p id="formbox"><input type="text" size="15" class="textfield" name="q" accesskey="E" maxlength="255" /> <input type="submit" class="submitfield" value="Search" id="goButton" name="sa" accesskey="G" /> <input type="hidden" name="cof" value="T:black;LW:72;ALC:#ff3300;L:http://www.w3.org/Icons/w3c_home;LC:#000099;LH:48;BGC:white;AH:left;VLC:#660066;GL:0;AWFID:0b9847e42caf283e;" /><input type="hidden" id="searchW3C" name="sitesearch" checked="checked" value="www.w3.org/QA" /><input type="hidden" name="domains" value="www.w3.org/QA" /></p>
</form>
</div>
<div id="main"><!-- This DIV encapsulates everything in this page - necessary for the positioning -->
<p class="content-nav">
<a href="http://www.w3.org/QA/2008/05/canvas-text-and-cjk.html">« Vertical Layouts for Canvas Text (CJK)</a> |
<a href="http://www.w3.org/QA/">Main</a>
| <a href="http://www.w3.org/QA/2008/05/syntax_for_aria_costbenefit_an.html">Syntax for ARIA: Cost-benefit analysis »</a>
</p>
<h2 class="entry-header">utf-8 Growth On The Web</h2>
<div class="entry-body">
<p>On Google's blog, Mark Davis is explaining that Google is <a href="http://googleblog.blogspot.com/2008/05/moving-to-unicode-51.html">moving to Unicode 5.1</a>. The article unfortunately mixes unicode and utf-8 as it has been noticed by David Goodger in <a href="http://www.artima.com/weblogs/viewpost.jsp?thread=230157">Unicode misinformation</a>. But the really interesting bit is the growth of utf-8 on the Web. These data should be interesting for the development of http, html 5 and validators.</p>
<p><img src="/QA/2008/05/utf8-growth-google" width="432" height="458" alt="utf-8 growth on the Web compared to other encoding"/></p>
<p>© graph from Google.</p>
</div>
<div id="more" class="entry-more">
</div>
<p class="postinfo">Filed by <a href="http://www.w3.org/People/karl/">Karl Dubost</a> on May 6, 2008 11:51 PM in <a href="http://www.w3.org/QA/archive/technology/html/">HTML</a>, <a href="http://www.w3.org/QA/archive/technology/http/">HTTP</a>, <a href="http://www.w3.org/QA/archive/web_spotting/opinions_editorial/">Opinions &amp; Editorial</a>, <a href="http://www.w3.org/QA/archive/w3cqa_news/tools/">Tools</a><br />
<span class="separator">|</span> <a class="permalink" href="http://www.w3.org/QA/2008/05/utf8-web-growth.html">Permalink</a>
| <a href="http://www.w3.org/QA/2008/05/utf8-web-growth.html#comments">Comments (8)</a>
| <a href="http://www.w3.org/QA/2008/05/utf8-web-growth.html#trackback">TrackBacks (0)</a>
</p>
<h3 class="comments-header" id="comments">Comments</h3>
<div class="comment" id="comment-139548">
<p class="comment-meta" id="c139548">
<span class="comment-meta-author"><strong>Mark Nottingham </strong></span>
<span class="comment-meta-date"><a href="#c139548">#</a> 2008-05-07</span>
</p>
<div class="comment-bulk">
<p>I wonder how they determined the encoding of pages for purposes of the graph.</p>
</div>
</div>
<div class="comment" id="comment-139574">
<p class="comment-meta" id="c139574">
<span class="comment-meta-author"><strong>Fwolf </strong></span>
<span class="comment-meta-date"><a href="#c139574">#</a> 2008-05-07</span>
</p>
<div class="comment-bulk">
<p>The raise of utf-8, equals the down level of us only(ascii)
And another sad point is Chinese(gb2312) 's change are no obvious.</p>
<p>PS: Glad to see another comment support <a href="http://daringfireball.net/projects/markdown/syntax" rel="nofollow">Markdown Syntax</a> (<a href="http://michelf.com/projects/php-markdown/extra/" rel="nofollow">Extra</a>)!</p>
</div>
</div>
<div class="comment" id="comment-139790">
<p class="comment-meta" id="c139790">
<span class="comment-meta-author"><strong>Frank </strong></span>
<span class="comment-meta-date"><a href="#c139790">#</a> 2008-05-08</span>
</p>
<div class="comment-bulk">
<p>When you'd add <b>US-ASCII</b> and <b>UTF-8</b> the picture would be almost stable over the last seven years, a valid US-ASCII page (with NCRs) is also a valid UTF-8 page. Likely folks updated their defaults and dare use more <i>non-ASCII-UTF-8</i> than in 2001, to be sure you'd have to look into the page. <a href="http://en.wikipedia.org/wiki/Lies%2C_damned_lies%2C_and_statistics" rel="nofollow">Lies, damned lies, and statistics</a>...</p>
</div>
</div>
<div class="comment" id="comment-139793">
<p class="comment-meta" id="c139793">
<span class="comment-meta-author"><strong>Karl Dubost <a class="commenter-profile" href="http://www.w3.org/People/karl/"><img alt="Author Profile Page" src="http://www.w3.org/QA/sununga/mt-static/images/comment/mt_logo.png" width="16" height="16" /></a></strong></span>
<span class="comment-meta-date"><a href="#c139793">#</a> 2008-05-08</span>
</p>
<div class="comment-bulk">
<p>@Mark </p>
<p>me neither but that would be interesting to know more. The data have been compiled by Erik van der Poel. Maybe he could chime in and explain a bit more. I will send have sent him a <a href="http://lists.w3.org/Archives/Public/www-archive/2008May/0014" rel="nofollow">pointer to this thread</a>.</p>
</div>
</div>
<div class="comment" id="comment-139843">
<p class="comment-meta" id="c139843">
<span class="comment-meta-author"><strong>Brian Wilson </strong></span>
<span class="comment-meta-date"><a href="#c139843">#</a> 2008-05-08</span>
</p>
<div class="comment-bulk">
<p>[I drafted this up earlier and no one had responded yet...now I'm a little late to the game]</p>
<p>This is an interesting look at encoding, but many questions remain, namely
their methodology and their URL set.</p>
<p>There are a number of ways to specify the encoding for a document, including
just doing some blind scanning of the raw document. The Google blog post makes
zero mention of how they detected the encoding. Based on the graph and the
wording, it makes me think that they did not look at any of the stated
encodings from the document (the "charset" parameter of the Content-Type HTTP
header, the same value via the META markup element and the "encoding" attribute
of the XML declaration).</p>
<p>I've also been doing some studies of stated encodings recently using mainly the
Open Directory Project (DMoz) as the URL set. [Note: unlike the Google research,
the results I have found are currently a snapshot only and do not represent any
trends over time.] Results from that study will be released soon, but, I've
noticed a few things about character encoding after analyzing about 3.5 million
URLs so far. Here are a few highlights:</p>
<ul>
<li><p>A minority of documents (~20%) use the HTTP Header to declare the
encoding[1]. The "utf-8" value IS the dominant value, but only slightly
(318351 for "utf-8" as versus 286967 for the next most-popular value of
"iso-8859-1"). This agrees with Google's research. </p></li>
<li><p>The majority of documents (~66%) use the META element to declare the
encoding. In this situation, the result is much different. The #1 and </p>
2 values are "iso-8859-1" and "windows-1252", which combined are
<p>represented in 1754820 cases, which dominates over the third place
"utf-8" at 249084 (a 7:1 ratio!)</p></li>
<li><p>Most documents that use XML also specify an encoding in the XML declaration.
Even there, "iso-8859-1" dominates over "utf-8", 54572 instances to 27052
(although "utf-8" is the default encoding for XML documents...)</p></li>
</ul>
<p>Stated encodings:
Most browsers will use a stated charset encoding from the HTTP header in
preference to using auto-detection methods (scanning all or parts of the
entire document to look for encoding hints), so if Google used some other
method, they may be ignoring how a browser would actually treat the encoding
of the document, and hence how it is actually (and accurately) displayed.</p>
<p>Please note that the value of "us-ascii" - the closest value to what they are
claiming is on a big decline - was very rarely encountered...less than 1% of
all cases where encodings are specified in any way. So...what does Google mean
when they say that "ASCII" has such a high usage? Do they just mean the first
128 code points shared in common between the ASCII, iso-8859-* and UTF-8
encodings? For UTF-8, did they also use the Byte Order Mark to detect UTF
usage? </p>
<p>Now, certainly the DMoz URL set has its own issues, namely skewing more toward
western web pages, and skewing heavily toward top-level pages of a site (about
3/4). These problems with DMoz are known...but are there any known issues with
Google's URL set? We don't know anything about Google's URL set other than "it
is big" and it "<em>probably</em> represents the universe of the Web-at-large" in
some way.</p>
<p>[1] "utf-8" dominance here may actually be more impressive than it seems. Many
Web servers have a default encoding used by the HTTP header. The default
encoding for Apache 2.2 for example is not "utf-8", but "iso-8859-1".</p>
</div>
</div>
<div class="comment" id="comment-139855">
<p class="comment-meta" id="c139855">
<span class="comment-meta-author"><strong>Erik van der Poel </strong></span>
<span class="comment-meta-date"><a href="#c139855">#</a> 2008-05-08</span>
</p>
<div class="comment-bulk">
<p>I used an encoding detector that looks at the entire HTML file (not just the "charset" label). We (Google) have samples of Web documents from 2001 onwards, so I ran the detector on those. The detector reports the "lowest" encoding. I.e. it would not report US-ASCII if there were any non-ASCII characters (bytes with value greater than 127). An NCR analysis might also be interesting, I agree. What would you like to see? Unicode scripts over time? Languages over time?</p>
</div>
</div>
<div class="comment" id="comment-139919">
<p class="comment-meta" id="c139919">
<span class="comment-meta-author"><strong>Philip Taylor </strong></span>
<span class="comment-meta-date"><a href="#c139919">#</a> 2008-05-08</span>
</p>
<div class="comment-bulk">
<p>A while ago I collected very similar data to Brian Wilson, and did <a href="http://philip.html5.org/data/charsets.html" rel="nofollow">some analysis</a>. (I was primarily looking for common errors, and for how many bytes you need to check before finding the meta charset, rather than comparing the frequency of encodings.)</p>
<p>An interesting point is that of the pages declared as UTF-8 (in HTTP headers or in <meta>), 4% are not actually valid UTF-8 and are relying on browsers doing error correction. GB2312 is worse, with 16% of the pages I looked at containing invalid byte sequences - it seems many people label their pages as GB2312 when actually they're using GBK/GB18030.</p>
</div>
</div>
<div class="comment" id="comment-139948">
<p class="comment-meta" id="c139948">
<span class="comment-meta-author"><strong>Erik van der Poel </strong></span>
<span class="comment-meta-date"><a href="#c139948">#</a> 2008-05-08</span>
</p>
<div class="comment-bulk">
<p>Hello Brian, thank you for reporting your results. It's always nice to see the results from other samples, since I worry that Google's sample may be somewhat biased toward our mechanisms for choosing a subset of the Web. As you know, we compute a value called Page Rank, and this is used in our systems.</p>
<p>We do use the HTTP and HTML META charsets, but only as initial hints. The rest of the detection is based on the byte stream itself (after the HTTP response headers). Many encoding detectors use the frequencies of occurences of certain byte sequences in a "base" set to build a model during training, and then compute the probability that a document is in a certain encoding based on the byte sequences in that document. Note that the HTTP and HTML charset labels are sometimes wrong or missing. One initial measurement of the fraction of documents that have an "incorrect" label was roughly 5%, I believe, but I will have to go back and confirm that some day. Of course, our own detector may be getting it wrong sometimes too, but when we actually look at the documents, we do find some incorrect charsets, and even some documents that mix UTF-8 with ISO-8859-1. Note also that browsers offer an encoding menu that the user can use to "correct" a garbled display (though novices may never use that feature).</p>
<p>In another study, I found that the HTTP charset was present in 11% of responses in 2001, and 43% in 2007. For the HTML charset, those numbers were 44% and 74%, respectively, while for XML encoding they were 0.39% and 2.7%, respectively.</p>
<p>Some documents come with charset labels declaring iso-8859-1, windows-1252 or even utf-8 when the byte values themselves are never greater than 127. Such documents are pure US-ASCII (if no ISO 2022 escape sequences are encountered). One might even argue that the charset labels are incorrect in such cases. I realize that this is debatable, but I don't think such debate is valuable.</p>
<p>Note that the first 128 values (0-127) are common to many charsets, not just ascii, iso-8859-* and utf-8. Windows-<em>, euc-</em>, shift_jis and big5 all come to mind. Major browsers treat the first 128 values as ascii even if the spec for the charset itself has a few non-ascii characters in that range, such as Yen sign instead of Backslash, and so on.</p>
<p>Yes, we do feed the various BOMs (utf-8, utf-16, etc) into our probability computation.</p>
</div>
</div>
<div class="comments-open" id="comments-open">
<h3 class="comments-open-header">Leave a comment</h3>
<div class="comments-open-moderated">
<p>
Note: this blog is intended to foster <strong>polite
on-topic discussions</strong>. Comments failing these
requirements and spam will not get published. Please,
enter your real name and email address. Every
individual comment is reviewed by the W3C staff.
This may take some time, thank you for your patience.
</p>
<p>
You can use the following HTML markup (a href, b, i,
br/, p, strong, em, ul, ol, li, blockquote, pre)
and/or <a href="http://daringfireball.net/projects/markdown/syntax">Markdown syntax</a>.</p>
</div>
<div id="comments-open-data">
<form method="post" action="http://www.w3.org/QA/sununga/beach.pl" id="comments-form">
<h4>Your comment</h4>
<div id="comments-open-text">
<textarea id="comment-text" name="text" rows="20" cols="100"></textarea><br />
<label for="comment-text">Write your comment text here. Remember, keep the discussion on topic and courteous.</label>
</div>
<h4>About you</h4>
<div id="comment-form-name">
<input type="hidden" name="static" value="1" />
<input type="hidden" name="entry_id" value="176" />
<input type="hidden" name="__lang" value="en" />
<label for="comment-author">Your Name</label>
<input id="comment-author" name="author" size="30" value="" />
</div>
<div id="comment-form-email">
<label for="comment-email">Your Email Address</label>
<input id="comment-email" name="email" size="30" value="" />
</div>
<div id="comments-open-footer">
<input type="submit" accesskey="s" name="post" id="comment-submit" value="Submit" />
</div>
</form>
</div>
</div>
<p id="gentime">This page was last generated on $Date: 2011/12/16 03:02:51 $</p>
</div><!-- End of "main" DIV. -->
<address>
This blog is written by W3C staff and working group participants,<br />
and maintained by <a href="/People/CMercier/">Coralie Mercier</a>.<br />
Authorized parties may <a href="/QA/new">log in</a> to create a new entry.<br/>
<span id="poweredby">Powered by Movable Type, magpierss and a lot of Web Technology</span>
</address>
<p class="copyright">
<a rel="Copyright" href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a> © 1994-2011
<a href="http://www.w3.org/"><acronym title="World Wide Web Consortium">W3C</acronym></a>®
(<a href="http://www.csail.mit.edu/"><acronym title="Massachusetts Institute of Technology">MIT</acronym></a>,
<a href="http://www.ercim.eu/"><acronym title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
<a href="http://www.keio.ac.jp/">Keio</a>),
All Rights Reserved.
W3C <a href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
<a href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>,
<a rel="Copyright" href="http://www.w3.org/Consortium/Legal/copyright-documents">document use</a>
and <a rel="Copyright" href="http://www.w3.org/Consortium/Legal/copyright-software">software licensing</a>
rules apply. Your interactions with this site are in accordance
with our <a href="http://www.w3.org/Consortium/Legal/privacy-statement#Public">public</a> and
<a href="http://www.w3.org/Consortium/Legal/privacy-statement#Members">Member</a> privacy
statements.
</p>
</body>
</html>