html-charset.html 31.5 KB

Raw Blame History Permalink

<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    <style type="text/css" media="all">
    @import "/QA/2006/01/blogstyle.css";
    </style>
    <meta name="keywords" content='' />
    <meta name="description" content="In this first issue in the cookbook for the web series, we look at character encoding, or &quot;charset&quot;s. Discussing the ingredients, giving a reliable recipe for the detection of character encodings in (x)html, and a quick tip for web authors on an html diet." />
    <meta name="revision" content="$Id: html-charset.html,v 1.63 2011/12/16 03:02:44 gerald Exp $" />
   <link rel="alternate" type="application/atom+xml" title="Atom" href="http://www.w3.org/QA/atom.xml" />
   <link rel="alternate" type="application/rss+xml" title="RSS 1.0" href="http://www.w3.org/QA/news.rss" />
   <title>Character encoding in HTML - W3C Blog</title>

   <link rel="start" href="http://www.w3.org/QA/" title="Home" />
   <link rel="prev" href="http://www.w3.org/QA/2008/03/web-templating-language.html" title="Templating Language for Authoring Tools" />
   <link rel="next" href="http://www.w3.org/QA/2008/03/browser_wars_html_test_jam_and.html" title="Browser wars, HTML test jam, and CSS awards at SXSW Interactive in Austin" />

   <!--
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/"
         xmlns:dc="http://purl.org/dc/elements/1.1/">
<rdf:Description
    rdf:about="http://www.w3.org/QA/2008/03/html-charset.html"
    trackback:ping="http://www.w3.org/QA/sununga/mt-tb.cgi/143"
    dc:title="Character encoding in HTML"
    dc:identifier="http://www.w3.org/QA/2008/03/html-charset.html"
    dc:subject="HTML"
    dc:description="In this first issue in the &lt;em&gt;cookbook for the web&lt;/em&gt; series, we look at character encoding, or &quot;charset&quot;s. Discussing the ingredients, giving a reliable recipe for the detection of character encodings in (x)html, and a quick tip for web authors on an html diet."
    dc:creator="olivier Théreaux"
    dc:date="2008-03-10T16:11:10+00:00" />
</rdf:RDF>
-->

    <!-- <script type="text/javascript" src="http://www.w3.org/QA/mt.js"></script>-->

</head>
<body class="layout-one-column">
      <div id="banner">
      <h1 id="title">
	<a href="http://www.w3.org/"><img height="48" alt="W3C" id="logo" src="http://www.w3.org/Icons/WWW/w3c_home_nb" /></a>
W3C Blog
</h1>
    </div>

    <ul class="navbar" id="menu">
        <li><strong><a href="/QA/" title="W3C Blog Home">[ W3C Blog ]</a></strong></li>
        <li><a href="/QA/Library/" title="Documents and Publications on Web and Quality">Documents</a></li>
        <li><a href="/QA/Tools/" accesskey="3" title="Validators and other Tools">Tools</a></li>
        <li><a href="/2007/12/qa-blog-help/index#feedback">Feedback</a></li>
    </ul>
<div id="searchbox">
<form method="get" action="http://www.google.com/custom" enctype="application/x-www-form-urlencoded">
<p id="formbox"><input type="text" size="15" class="textfield" name="q" accesskey="E" maxlength="255" /> <input type="submit" class="submitfield" value="Search" id="goButton" name="sa" accesskey="G" /> <input type="hidden" name="cof" value="T:black;LW:72;ALC:#ff3300;L:http://www.w3.org/Icons/w3c_home;LC:#000099;LH:48;BGC:white;AH:left;VLC:#660066;GL:0;AWFID:0b9847e42caf283e;" /><input type="hidden" id="searchW3C" name="sitesearch" checked="checked" value="www.w3.org/QA" /><input type="hidden" name="domains" value="www.w3.org/QA" /></p>
</form>
</div>


    <div id="main"><!-- This DIV encapsulates everything in this page - necessary for the positioning -->

                     <p class="content-nav">
                        <a href="http://www.w3.org/QA/2008/03/web-templating-language.html">&laquo; Templating Language for Authoring Tools</a> |
                        <a href="http://www.w3.org/QA/">Main</a>
                        | <a href="http://www.w3.org/QA/2008/03/browser_wars_html_test_jam_and.html">Browser wars, HTML test jam, and CSS awards at SXSW Interactive in Austin &raquo;</a>
                     </p>

                        <h2 class="entry-header">Character encoding in HTML</h2>
                           <div class="entry-body">
                              <p>In the beginning the Web had ASCII. And that was good. But then, not really. The Europeans and their strange accents were a bit of a problem. </p>

<p>So then the Web had iso-latin1. And HTML could be assumed to be using that, by default (<a href="http://www.ietf.org/rfc/rfc2854.txt">RFC2854, section 4</a>). And that was good. But then, not really. There was a whole world out there, with a lot of writing systems, tons of different characters. Many different character encodings...</p>

<p>Today we have Unicode, at long last well adopted in most modern computing systems, and a basic <a href="http://www.w3.org/TR/xml/#charsets">building block</a> of a lot of web technologies. And although there are still a lot of different characters encoding available for documents on the web, this is not an issue, as there are mechanisms, both in HTTP and within HTML for instance, to declare the encoding used, and help tools determine how to decode the content.</p>

<p>All is not always rosy, however. The first issue is that there are quite a lot of mechanisms to declare encoding, and that they don't necessarily agree. The second issue is that not everyone can configure a Web server to declare encoding of HTML documents at the HTTP level.</p>

<h3>Many sources, One encoding</h3>

<h4>if the box says "dangerous, do not open", don't peek inside the box...</h4>

<p>A long (web) time ago, there was a very serious discussion to try and determine  a Web resource was supposed to know its encoding best, or whether the Web server should be the authoritative source. </p>

<p>In the "resource" camp, some were pushing the rather logical argument that a specific document surely knew best about its own metadata that a misconfigured Web server. Who cares if the server thinks that all HTML document it serves are <code>iso-8859-1</code>, when I, as document author, know full well that I am authoring this particular resource as <code>utf-8</code>?</p>

<p>The other camp had two killer arguments. </p>
<ol>
<li><p>The first, and perhaps the simplest, argument was: what's the point of having user agents sniff garbage in hope to find content, and perhaps a character encoding declaration, when the transport protocol has a way of declaring it? This is the basis for the  <a href="http://www.w3.org/2001/tag/doc/mime-respect">authoritative metadata</a> principle. This principle is also sometimes summarized as: If I want to show an HTML document as plain text source, rather than have it interpreted by browsers, I should be able to do so. I should be able to serve any document as <code>text/plain</code> if that is my choice. </p></li>

<li><p>The second killer argument was <em>transcoding</em>. A lot of proxies, they said, transform the content they proxy, sometimes from a character encoding to another. So even though a document might say "I am encoded in <code>iso-2022-jp</code>", the proxy should be able to say "actually, trust me, the content I am delivering to you is in <code>utf-8</code>".</p></li>
</ol>

<p>In the end, the apparent consensus was that the "server knows best" camp had the sound architectural arguments behind them, and so, for everything on the web served via the HTTP protocol, HTTP has precedence over any other method in determining the encoding (and content type, etc.) of resources.</p>
<p>This means that regardless of what is in an (x)html document, if the server says "this is a <code>text/html</code> document encoded as <code>utf-8</code>", user agents should follow that information. Second guessing is likely to cause more harm than good.</p>

<h4>Unlabeled boxes can be full of treasures, or full or trouble</h4>

<p>But what if there is no character encoding declared at the HTTP level? This is where it gets tricky.</p>
<p>"Old school" HTML introduced a <a href="http://www.w3.org/TR/html401/struct/global.html#h-7.4.4" title="The global structure of an HTML document - the META element">specific <code>meta</code> tag</a> for the declaration of the encoding within the document:</p>
<pre>
    &lt;META http-equiv="Content-Type" content="text/html; charset=ISO-8859-5"&gt;
</pre>
<p>Over the years, we have seen that this method was plagued by two serious issues:</p>
<ol>
    <li><p>Its syntax.</p><p>Nobody seems to get it right (it is just... too complicated!) and the Web is littered with approximate, sometimes comical, variants of this syntax. This is no laughing matter for user agents, however, which can't even expect to find this encoding declaration properly marked up!</p></li>
    <li><p>The <code>meta</code> elements have to be within the <code>head</code> of a document, but there is no guarantee that it will be anywhere near the top of the document. the <code>head</code> of a document can have lots of other metadata, title, description, scripts and stylesheeets, before declaring the encoding. This means a lot of sniffing and pseudo-parsing of undecoded garbage. In some cases, it can have dreadful consequences, such as security flaws in the approximate sniffing code.</p></li>
</ol>

<p>It is worth noting that current work on <a href="http://www.w3.org/TR/html5/#meta" title="HTML 5 - the meta element">html5</a> tries to work around these issues by providing a simpler alternate syntax, and making sure that the declaration of encoding should be present at the very beginning of the <code>head</code>.</p>

<p>XML, on the other hand, had a way to declare encoding at the document level in the XML declaration. The good thing about that being that this declaration MUST be at the very beginning of the document, which alleviates the pain of having to sniff the content.</p>
<pre>
    &lt;?xml version="1.0" encoding="UTF-8"?&gt;
</pre>
<p>The XML specification also defines, in its <a href="http://www.w3.org/TR/xml/#sec-guessing" title="Extensible Markup Language (XML) 1.0 (Fourth Edition) - appendix F">Appendix F</a>, a recommended algorithm for the encoding detection.</p>

<h3>The Recipe</h3>

<p>Given all these potential sources for the declaration (or automatic detection) of the document character encoding, all potentially contradicting the others, what should be the recipe to reliably figure out which encoding to use?</p>

<ol>
    <li><p>The charset info in the HTTP <code>Content-Type</code> header should have precedence. Always</p></li>
    <li><p>Next in line is the charset information in the XML declaration. Which may be there, or may not.</p>
        <p>For XHTML documents, and in particular for the XHTML documents served as <code>text/html</code>,
        it is <a href="http://www.w3.org/TR/xhtml1/#C_1" title="XHTML 1.0: The Extensible HyperText Markup Language (Second Edition)">recommended to avoid using an XML declaration</a>.</p>
        <p>But let's remember: XHTML is XML, and XML
        <a href="http://www.w3.org/TR/xml/#sec-guessing-no-ext-info">requires an XML declaration</a> or some other method of declaration for XML documents using encodings other than UTF-8 or UTF-16
        (or ascii, which is a convenient subset...).</p>
<p>As a result, there is a strong likeliness that anything served as  <code>application/xhtml+xml</code> (or <code>text/html</code> and looking a lot like XHTML),
        with neither encoding declaration at the HTTP level nor in an XML declaration is quite likely to be UTF-8 or UTF-16</p>
        <p>Then there is the <a href="http://www.unicode.org/unicode/faq/utf_bom.html"><acronym title="Byte Order Mark">BOM</acronym>, a signature for Unicode character encodings.</a></p>
    </li>
    <li><p>Then comes the search for the <code>meta</code> information that might, just might, provide a character encoding declaration.</p></li>
    <li><p>Beyond that point, it's the land of defaults and heuristics. You may choose to default to
        <code>iso-8859-1</code> for <code>text/html</code> resources, <code>utf-8</code>
        for <code>application/xhtml+xml</code>.</p>
        <p>The rest is heuristics. You could venture towards fallback encodings such as <code>windows-1252</code>,
            which many consider a safe bet, but a bet nonetheless.</p></li>
    <li><p>There are quite a few algorithms to determine the likeliness of one specific encoding based on matching at byte level. Martin Dürst wrote <a href="http://www.w3.org/International/questions/qa-forms-utf-8">a regexp to check whether a document will "fit" as utf-8</a>. If you know other reliable algorithms, feel free to mention them in the comments, I will list them here.</p></li>
</ol>

<p>Does this seem really ugly and complicated to you? You will <em>love</em> the excellent <a href="http://nikitathespider.com/articles/EncodingDivination.html">Encoding Divination Flow Chart</a> by Philip Semanchuk, the developer of the Web quality checker "Nikita the spider".</p>

<p>Or, if this is still horribly fuzzy after looking at the flow chart, why not let a tool do that for you? The <a href="http://search.cpan.org/dist/HTML-Encoding/" title="HTML-Encoding - search.cpan.org">HTML::Encoding perl module by Bj&#246;rn H&#246;hrmann</a> does just that.</p>

<h3>Last word... for HTML authors</h3>
<p>If you create content on the Web and never have to read and parse content on the web, and if you have read that far, you are probably considering yourself very lucky right now. But you can make a difference by making sure the content you put on the web is using consistent character encodings, and declare them properly. Your job is actually <em>much easier</em> than the tricky winding road to determining a document's encoding. In the proverbial three steps:</p>

<ol>
    <li>Use <code>utf-8</code>. Unless you have very specific needs such as very rare character variants in asian languages,
        this should be your charset of choice. Most modern text, web or code editors are likely to support UTF-8, some actually
        <em>only</em> support this encoding. If possible, choose an editor or set up that will <a href="http://www.w3.org/International/questions/qa-utf8-bom" title="W3C I18N FAQ: Display problems caused by the UTF-8 BOM">not output a BOM in UTF-8 files</a>, as this is known to cause some ugly display issues with some agents, and can even crash php includes.</li>
    <li>If you have access to the configuration of your web server, make sure that it <a href="http://www.w3.org/International/techniques/server-setup#setting" title="W3C I18N Server Setup Techniques">serves html as utf-8</a></li>
    <li>That's all.</li>
</ol>
                           </div>
                           <div id="more" class="entry-more">

                           </div>
                       <p class="postinfo">Filed by <a href="http://www.w3.org/People/olivier/">olivier Théreaux</a> on March 10, 2008  4:11 PM in <a href="http://www.w3.org/QA/archive/technology/html/">HTML</a>, <a href="http://www.w3.org/QA/archive/technology/http/">HTTP</a><br />
<span class="separator">|</span> <a class="permalink" href="http://www.w3.org/QA/2008/03/html-charset.html">Permalink</a>
                                 | <a href="http://www.w3.org/QA/2008/03/html-charset.html#comments">Comments (14)</a>
                                 | <a href="http://www.w3.org/QA/2008/03/html-charset.html#trackback">TrackBacks (0)</a>
</p>


<h3 class="comments-header" id="comments">Comments</h3>
<div class="comment" id="comment-122102">
<p class="comment-meta" id="c122102">
<span class="comment-meta-author"><strong>Dana Lee Ling </strong></span>
<span class="comment-meta-date"><a href="#c122102">#</a> 2008-03-11</span>
</p>
<div class="comment-bulk">
<p>In regards the final note to HTML authors, our server serves HTML as iso-8859-1. If I follow rule one and use utf-8, then I get a warning from validation engines that the page meta charset tag encoding declaration disagrees with the HTTP header charset encoding declaration. So to stay clean and valid I have to use iso-8859-1 in my html pages, including <a href="http://www.comfsm.fm/~dleeling/statistics/s81/q06.html" rel="nofollow">those I write in html5.</a></p>

<p>The server cannot be changed without potentially breaking the large number of existing HTML4 pages that declare themselves to be iso-8859-1 encoding. My guess is that this situation is fairly common. Thus to use utf-8, I have to code using xml and send pages with a .xhtml extension. <a href="http://www.comfsm.fm/~dleeling/tech/mathml-in-svg.xhtml" rel="nofollow">My XHTML pages</a> are sent as application/xhtml+xml with utf-8 encoding by our server for which IE happily shows only the page code. </p>

<p>Thanks for the excellent article in any case!</p>

</div>
</div>


<div class="comment" id="comment-122193">
<p class="comment-meta" id="c122193">
<span class="comment-meta-author"><strong>Frank </strong></span>
<span class="comment-meta-date"><a href="#c122193">#</a> 2008-03-11</span>
</p>
<div class="comment-bulk">
<p>The first point of the recipe, a <em>"charset info in the HTTP Content-Type header should have precedence"</em>, won't fly <strong>if</strong> an explicit Latin-1 actually means "dunno", while no charset means default Latin-1, or in practice windows-1252 as far as HTML 5 is concerned. For this part of the madness folks could still try to fix it in the HTTP WG working on 2616bis.</p>

<p>A serious problem is the alleged "rough consensus" for "the server knows best". One of several premises is that there are no other protocols and URI schemes, only HTTP exists. </p>

<p>In the real world many HTTP servers don't know best, have no time to guess, if they try it anyway they will often get it wrong, many users have no way to fix it, and if the server says it is Latin-1 this likely means "dunno", while "dunno" means default Latin-1, see above.</p>

</div>
</div>


<div class="comment" id="comment-122228">
<p class="comment-meta" id="c122228">
<span class="comment-meta-author"><strong>Anne van Kesteren </strong></span>
<span class="comment-meta-date"><a href="#c122228">#</a> 2008-03-11</span>
</p>
<div class="comment-bulk">
<p>XML still allows infinite whitespace within the XML declaration...</p>

</div>
</div>


<div class="comment" id="comment-122275">
<p class="comment-meta" id="c122275">
<span class="comment-meta-author"><strong>olivier Théreaux <a class="commenter-profile" href="http://www.w3.org/People/olivier/"><img alt="Author Profile Page" src="http://www.w3.org/QA/sununga/mt-static/images/comment/mt_logo.png" width="16" height="16" /></a></strong></span>
<span class="comment-meta-date"><a href="#c122275">#</a> 2008-03-11</span>
</p>
<div class="comment-bulk">
<p>@ Dana Lee Ling: good point indeed. A web server should either give the content managers the possibility of overriding the default character encoding, or not set a default at all. You may want to point whoever manages your web server to the <a href="http://www.w3.org/TR/chips/">chips</a> document, particularly the section on character encoding...</p>

<p>On the W3C Web server we solved this issue thanks to dated space URIs. The default used to be iso-8859-1 for all documents, but for anything published into, say, /2007/, the default is utf-8.</p>

</div>
</div>


<div class="comment" id="comment-122277">
<p class="comment-meta" id="c122277">
<span class="comment-meta-author"><strong>olivier Théreaux <a class="commenter-profile" href="http://www.w3.org/People/olivier/"><img alt="Author Profile Page" src="http://www.w3.org/QA/sununga/mt-static/images/comment/mt_logo.png" width="16" height="16" /></a></strong></span>
<span class="comment-meta-date"><a href="#c122277">#</a> 2008-03-11</span>
</p>
<div class="comment-bulk">
<p>@ Frank: I agree that the default of latin-1 in HTTP is problematic, and it looks like the WG working on HTTPbis is not refusing to look into it, but they can't find a good workaround. You may be able to help by drafting one (or several) replacements and bring that to their consideration?</p>

</div>
</div>


<div class="comment" id="comment-122280">
<p class="comment-meta" id="c122280">
<span class="comment-meta-author"><strong>olivier Théreaux <a class="commenter-profile" href="http://www.w3.org/People/olivier/"><img alt="Author Profile Page" src="http://www.w3.org/QA/sununga/mt-static/images/comment/mt_logo.png" width="16" height="16" /></a></strong></span>
<span class="comment-meta-date"><a href="#c122280">#</a> 2008-03-11</span>
</p>
<div class="comment-bulk">
<p>@ Anne: Yes, an xml declaration can have whitespace in it. </p>

<p>I'm not sure I get your point, though… I certainly wouldn't agree that the whitespace <em>within the declaration</em> puts a heavy burden on the parser, and since the spec very clearly forbids having whitespace (or anything, for that matter) <em>before</em> the declaration, I think charset info in an XML declaration is pretty much the easiest in-document encoding declaration, ever.</p>

</div>
</div>


<div class="comment" id="comment-122294">
<p class="comment-meta" id="c122294">
<span class="comment-meta-author"><strong>David Zülke </strong></span>
<span class="comment-meta-date"><a href="#c122294">#</a> 2008-03-11</span>
</p>
<div class="comment-bulk">
<blockquote><p>But let's remember: XHTML is XML, and XML forbids the use of any character encoding other than UTF-8 or UTF-16 (or ascii, which is a convenient subset...) unless one uses an XML declaration</p></blockquote>

<p>That is not correct. The article you're referring to in the link talks about entities, by the way.</p>

<p>Anyways, XML documents actually <em>must not</em> use UTF-16 if their MIME type is text/xml. The assumed defaults are UTF-8 for XML documents served with an application/xml type, and ASCII with text/xml (also overriding the HTTP iso-8859-1 default).</p>

</div>
</div>


<div class="comment" id="comment-122316">
<p class="comment-meta" id="c122316">
<span class="comment-meta-author"><strong>Brian Repko </strong></span>
<span class="comment-meta-date"><a href="#c122316">#</a> 2008-03-11</span>
</p>
<div class="comment-bulk">
<p>utf-8 for asian languages can be a much larger payload than say Big5 or an specific encoding for that language.  I've seen requirements for lots of websites that specify character encodings specific to the "locale" being displayed.</p>

</div>
</div>


<div class="comment" id="comment-122447">
<p class="comment-meta" id="c122447">
<span class="comment-meta-author"><strong>olivier Théreaux <a class="commenter-profile" href="http://www.w3.org/People/olivier/"><img alt="Author Profile Page" src="http://www.w3.org/QA/sununga/mt-static/images/comment/mt_logo.png" width="16" height="16" /></a></strong></span>
<span class="comment-meta-date"><a href="#c122447">#</a> 2008-03-12</span>
</p>
<div class="comment-bulk">
<p>@ David Zülke</p>

<blockquote><p>The article you're referring to in the link talks about entities, by the way.</p></blockquote>

<p>I think the term “entity” in the XML specification is sometimes confusing… </p>

<p>I am rewording the article to be clearer about the fact that each XML entity not accompanied by external encoding information and not in UTF-8 or UTF-16 encoding <a href="http://www.w3.org/TR/xml/#sec-guessing-no-ext-info">must begin with an XML encoding declaration</a> and that in this context, as far as I (and every expert I asked about this), entity is <a href="http://www.w3.org/TR/xml/#sec-documents">a physical building block of an XML document</a>.</p>

<blockquote><p>XML documents actually _must not_ use UTF-16 if their MIME type is text/xml</p></blockquote>

<p>I don't think that's true. See <a href="http://tools.ietf.org/html/rfc3023#section-8.2">the RFC on text/xml</a>.</p>

</div>
</div>


<div class="comment" id="comment-122448">
<p class="comment-meta" id="c122448">
<span class="comment-meta-author"><strong>olivier Théreaux <a class="commenter-profile" href="http://www.w3.org/People/olivier/"><img alt="Author Profile Page" src="http://www.w3.org/QA/sununga/mt-static/images/comment/mt_logo.png" width="16" height="16" /></a></strong></span>
<span class="comment-meta-date"><a href="#c122448">#</a> 2008-03-12</span>
</p>
<div class="comment-bulk">
<p>@ Brian Repko</p>

<blockquote><p> utf-8 for asian languages can be a much larger payload than say Big5 or an specific encoding for that language</p></blockquote>

<p>I see your point here. </p>

<p>As a developer of content, e.g in Japanese, I could indeed just work with iso-2022-jp or shift-jis, but the fact is <em>I don't know</em> who is going to want to parse/read/use that content. </p>

<p>And in order to keep options open, I think I'd rather bet on internationalized tools that support unicode – even if, admittedly, some local tools are sometimes supporting only the local encodings, and not utf-8…</p>

</div>
</div>


<div class="comment" id="comment-122450">
<p class="comment-meta" id="c122450">
<span class="comment-meta-author"><strong>olivier Théreaux <a class="commenter-profile" href="http://www.w3.org/People/olivier/"><img alt="Author Profile Page" src="http://www.w3.org/QA/sununga/mt-static/images/comment/mt_logo.png" width="16" height="16" /></a></strong></span>
<span class="comment-meta-date"><a href="#c122450">#</a> 2008-03-12</span>
</p>
<div class="comment-bulk">
<p>Some followup discussion on the <a href="http://www.w3.org/QA/2008/03/html-charset.html#c122193">comment</a> sent by Frank is taking place on the <a href="http://lists.w3.org/Archives/Public/www-validator/2008Mar/thread.html#msg27">w3c validators mailing-list</a>, too.</p>

</div>
</div>


<div class="comment" id="comment-122814">
<p class="comment-meta" id="c122814">
<span class="comment-meta-author"><strong>Philip Semanchuk </strong></span>
<span class="comment-meta-date"><a href="#c122814">#</a> 2008-03-14</span>
</p>
<div class="comment-bulk">
<p>I'm glad you like the encoding divination flowchart. It took me a while to puzzle out those rules and writing the article helped to reinforce what I'd learned. Looking at it now with fresh eyes, I see that it's more complicated than it needs to be and I'll simplify it when I get a chance.</p>

<p>I'm curious if you know of any specification that states what precedence a BOM has relative to a META http-equiv or XML encoding declaration. I have not been able to find one. In my flowchart, I gave BOMs second priority after the HTTP Content-Type header because I figured that a BOM written by a text editor was more likely to be correct than the page author.</p>

<p>In "The Recipe" above, BOMs are listed in item #2, but almost as an afterthought. Don't you think they deserve their own item in the list? After all, one can find them in documents that have no pretensions whatsoever to being XML.</p>

</div>
</div>


<div class="comment" id="comment-124109">
<p class="comment-meta" id="c124109">
<span class="comment-meta-author"><strong>Chris Lilley </strong></span>
<span class="comment-meta-date"><a href="#c124109">#</a> 2008-03-20</span>
</p>
<div class="comment-bulk">
<p>"In the beginning the Web had ASCII."</p>

<p>When was that? My recollection was that the Web started with Latin-1 - which was seen as one of its advantages (by European folks) and then became a disadvantage (eg harder to introduce UTF-8).</p>

<p>The "last word" is sound advice. There is very little reason to use anything other than UTF-8 nowadays for any new content.</p>

</div>
</div>


<div class="comment" id="comment-124111">
<p class="comment-meta" id="c124111">
<span class="comment-meta-author"><strong>Chris Lilley </strong></span>
<span class="comment-meta-date"><a href="#c124111">#</a> 2008-03-20</span>
</p>
<div class="comment-bulk">
<p>@Brian: Is the Big5 vs UTF-8 size difference for plain text, or for markup (HTML, SVG, whatever)? Interested to see some stats on that.</p>

<p>Regarding text/* and UTF-16 - yes, actually, the requirements for text/* top level type on fallback, and the decision in RFC 3023 for HTTP charset to override the character encoding in the content  <em>even if the charset is missing</em> means that, in theory, UTF-16 content could be displayed as fallback text/plain in US-ASCII with every other character having code point zero. In practice people seem to believe the content, if HTTP supplies no charset.</p>

</div>
</div>


  <div class="comments-open" id="comments-open">
<h3 class="comments-open-header">Leave a comment</h3>

<div class="comments-open-moderated">
   <p>
   Note: this blog is intended to foster <strong>polite
   on-topic discussions</strong>. Comments failing these
   requirements and spam will not get published. Please,
   enter your real name and email address. Every
   individual comment is reviewed by the W3C staff.
   This may take some time, thank you for your patience.
   </p>
   <p>
   You can use the following HTML markup (a href, b, i,
   br/, p, strong, em, ul, ol, li, blockquote, pre)
   and/or <a href="http://daringfireball.net/projects/markdown/syntax">Markdown syntax</a>.</p>
</div>

<div id="comments-open-data">
<form method="post" action="http://www.w3.org/QA/sununga/beach.pl" id="comments-form">
<h4>Your comment</h4>
<div id="comments-open-text">
  <textarea id="comment-text" name="text" rows="20" cols="100"></textarea><br />
<label for="comment-text">Write your comment text here. Remember, keep the discussion on topic and courteous.</label>
</div>

<h4>About you</h4>
<div id="comment-form-name">
  <input type="hidden" name="static" value="1" />
<input type="hidden" name="entry_id" value="153" />
<input type="hidden" name="__lang" value="en" />
<label for="comment-author">Your Name</label>
<input id="comment-author" name="author" size="30" value="" />
</div>
<div id="comment-form-email">
<label for="comment-email">Your Email Address</label>
<input id="comment-email" name="email" size="30" value="" />
</div>

<div id="comments-open-footer">
<input type="submit" accesskey="s" name="post" id="comment-submit" value="Submit" />

</div>
</form>
</div>
</div>


<p id="gentime">This page was last generated on $Date: 2011/12/16 03:02:44 $</p>

      </div><!-- End of "main" DIV. -->

<address>

This blog is written by W3C staff and working group participants,<br />
&nbsp;and maintained by <a href="/People/CMercier/">Coralie Mercier</a>.<br />
Authorized parties may <a href="/QA/new">log in</a> to create a new entry.<br/>
<span id="poweredby">Powered by Movable Type, magpierss and a lot of Web Technology</span>
    </address>


    <p class="copyright">
      <a rel="Copyright" href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a> &copy; 1994-2011
      <a href="http://www.w3.org/"><acronym title="World Wide Web Consortium">W3C</acronym></a>&reg;
      (<a href="http://www.csail.mit.edu/"><acronym title="Massachusetts Institute of Technology">MIT</acronym></a>,
      <a href="http://www.ercim.eu/"><acronym title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
      <a href="http://www.keio.ac.jp/">Keio</a>),
      All Rights Reserved.
      W3C <a href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
      <a href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>,
      <a rel="Copyright" href="http://www.w3.org/Consortium/Legal/copyright-documents">document use</a>
      and <a rel="Copyright" href="http://www.w3.org/Consortium/Legal/copyright-software">software licensing</a>
      rules apply. Your interactions with this site are in accordance
      with our <a href="http://www.w3.org/Consortium/Legal/privacy-statement#Public">public</a> and
      <a href="http://www.w3.org/Consortium/Legal/privacy-statement#Members">Member</a> privacy
      statements.
    </p>

  </body>
</html>