charset-harmful.html 14.4 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<HTML>
<HEAD>
<TITLE>"Character Set" Considered Harmful</TITLE>
<META NAME="author" CONTENT="Connolly">
<META NAME="status" CONTENT="Internet Draft">
<META NAME="title" CONTENT="Character Terminology">
<META NAME="date" CONTENT="May, 1995">
<base href="http://www.w3.org/hypertext/WWW/MarkUp/html-spec/charset-harmful.html">
<!-- $Id: charset-harmful.html,v 1.6 2004/12/02 15:18:12 liam Exp $ -->
</HEAD>

<body>
<PRE>
HTML Working Group                                           <a href="http://www.w3.org/hypertext/WWW/People/Connolly/">D. Connolly</a>
INTERNET-DRAFT                                               <A HREF="../../Consortium/Prospectus/Overview.html">MIT/W3C</a>
draft-ietf-html-charset-harmful-00.txt                       May 2, 1995
Expires November, 1995

</pre>
<h1><em>Character Set</em> Considered Harmful</h1>
<h2>Status of this Document</h2>

<p>This document is an Internet-Draft. Internet-Drafts are working 
   documents of the Internet Engineering Task Force (IETF), its areas, 
   and its working groups. Note that other groups may also distribute 
   working documents as Internet-Drafts.

<p>Internet-Drafts are draft documents valid for a maximum of six 
   months and may be updated, replaced, or obsoleted by other 
   documents at any time. It is inappropriate to use Internet-Drafts 
   as reference material or to cite them other than as "work in 
   progress."

<p>To learn the current status of any Internet-Draft, please check the 
   "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow 
   Directories on ftp.is.co.za (Africa), nic.nordu.net (Europe), 
   munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or 
   ftp.isi.edu (US West Coast).

<p>Distribution of this document is unlimited. Please send comments to 
   the HTML working group (HTML-WG) of the Internet Engineering Task 
   Force (IETF) at <tt>&lt;html-wg@oclc.org&gt;</tt>;. Discussions of
the group are
   archived at <tt><a
href="http://www.acl.lanl.gov/HTML_WG/archives.html">http://www.acl.lanl.gov/HTML_WG/archives.html</a></tt>.

<h2>Abstract</h2>

<p> The term <em>character set</em> is often used to describe a ditigal
representation of text. ASCII is perhaps the most widely deployed
representation of text, and in the interest of interoperability,
information systems on the Internet traditionally rely on it
exclusively.

<p> The Multipurpose Internet Mail Extensions (MIME) introduces
Internet Media Types, including text representations besides
ASCII. The Hypertext Markup Language (HTML) used in the World-Wide Web
is a proposed Internet Media Type. But HTML is also an application of
Standard Generalized Markup Language (SGML).

<p> In the MIME and SGML specifications, the discussion of characters
representation is notoriously complex, and apparently subtly
inconsistent or incompatible. This document presents a collection of
terms intended to reconcile the two specifications and serve as a
basis for rigorous discussion of characters and their digital
representations.


<h2>Introduction</h2>

<p> The term <em>character set</em> is often used to describe a ditigal
representation of text. The specification of such a representation
typically involves identifying a sufficiently expressive collection of
characters, and giving each of them a number.

<p> In conventional mathematics terminology then, a "character set"
is not just a set of characters, but a <em>function</em> whose domain
is a set of integers, and whose range is a set of characters.

<p> Some standards documents, including the SGML standard, make little
or no use of such conventional mathematical terms as function, domain
and range. Perhaps the authors of those documents intend the documents
to be comprehensible without a prior understanding of mathematics. But
the specification of notions such as the conformance of an SGML
document or SGML system are much more complex than the basics of logic
and mathematics.

<p> In his text on Calculus<a href="#Spivak">[Spivak]</a>, Michael
Spivak writes:

<blockquote>

<p> Every aspect of this book was influenced by the desire to present
calculus not merely as a prelude to but as the first real encounter
with mathematics. Since the foundation of analysis provided the arena
in which modern modes of mathematical thinking developed, calculus
ought to be the place in which to expect, rather than avoid, the
strengthening of insight with logic. In addition to developing the
students' intuition about the beautiful concepts of analysis, it is
surely equally important to persuade them that precision and rigor are
neither deterrents to intuition, nor ends in themselves, but the
natural medium in which to formulate and think about mathematical
questions.

</blockquote>

<p> This document is not intended as the first real encounter with
mathematics. But neither will we make any effort to avoid or apologize
for mathematical terminology. The reader is referred to the large body
of literature on logic and set theory, including a history of writings
on math and logic[SET] and Douglas Hofstadter's fascinating book<a
href="#GEB">[GEB]</a>.

<h2>Coded Character Sets</h2>

<p> Using "character set" rather than something such as <em>character
table</em> or even <em>character sequence</em> to denote the functions
that maps integers to characters is unfortunate, but it is water under
the bridge, and a lot of it by now. Rather than attempting to divert
all that water at this point, we introduce the primitive notion of
character and use it to define the term <em>coded character set</em>
from [ISO10646] and other standards:

<dl>

<dt> character

<dd> An atom of information

<dt> coded character set

<dd> A function whose domain is a subset of the integers, and whose
range is a set of characters.

</dl>

<p> Note that by the term character, we do not mean a glyph, a name, a
phoneme, nor a bit combination. A character is simply an atomic unit
of communication. It is typically a symbol whose various
representations are understood to mean the same thing by a community
of people.

<p> It might seem more intuitive to map from characters to integers,
rather than the way it is defined here. But in practice there are
some coded character sets that assign two different numbers to the
same character<a href="lee">[Lee]</a>, and so the inverse is not a
function in the general case.

<p> There are two other terms used in standards such as [ISO10646]
that we define in relation to the first two:

<dl>

<dt> code position

<dd> An integer. A coded character set and a code position from
its domain determine a character.

<dt> character repertoire

<dd> A set of characters; that is, the range of a coded character set.

</dl>

<h2>Character Encoding Schemes</h2>

<p> The only practical means for exchanging information on the
Internet is to represent it as a sequence of octets (bytes).

<p> One way to transmit a sequence of characters is to agree on
a coded character set and transmit the character numbers of each
of the characters.

<p> But in practice, characters are encoded using a variety of
optimizations of this brute-force approach: code switching techniques,
escape sequences, etc. The encoding of a sequence of characters is
not, in general, the result of encoding each character independently
and then concatenating them. But it is sufficiently general to note
that sequences of characters are encoded as a sequence of bytes. So we
define:

<dl>

<dt>octet

<dd>an element of the set {0, 1, 2, ..., 255}

<dt>character encoding scheme

<dd>a function whose domain is the set of sequences of octets, and
whose range is the set of sequences of characters over some character
repertoire.

</dl>


<h2>Representation of SGML Text Entities</h2>

<p> An SGML document is made up of entities: a text entity called the
document entity, and possibly some other text entities and data
entities.

<p> A text entity is a sequence of characters. The representation of a
text entity is not specified by the SGML standard. For the purpose of
MIME-based interchange of SGML text entities, we define the following:

<dl>

<dt>text entity

<dd>a sequence of characters

<dt>message entity

<dd>a pair (T, OS) where T is an Internet Media Type and OS is a
sequence of octets.

</dl>

<p> Note that each <tt>text/*</tt> media type has an associated
<tt>charset</tt> parameter, which designates a character encoding
scheme. The character encoding scheme maps the body -- a sequence of
octets -- to a text entity -- a sequence of characters. Hence any
message entity of type <tt>text/*</tt> is equivalent to a text entity.


<h2>Numeric Character References</h2>

<p> Numeric character references are a great source of confusion. The
key insights are that:

<ul>

<li> Every SGML document has exactly one document character set, which
is a coded character set

<li> Numeric character references give code positions in the document
character set

</ul>

<h3>Example: ISO2022 Encoding with ISO10646 Coded Character Set</h3>

<p> Consider the following message entity:

<pre>
Date: Saturday, 29-Apr-95 03:53:33 GMT
MIME-version: 1.0
Content-Type: text/html; charset=iso-2022-jp

&lt;TITLE>...&lt;/TITLE>
&lt;BODY>
Here is some normal text.
Here is a 10646 numeric character reference &#2432;.
Here is some ISO-2022-JP text: ...
&lt;/BODY>

</pre>

<p> To interpret the message entity, we notice that the
<tt>Content-Type</tt> is <tt>text/html</tt>, so this represents a text
entity. The <tt>charset</tt> parameter <tt>iso-2022-jp</tt>, along
with the octet sequence of the body, determines a sequence of
characters. The octets denoted above by '...' represent characters,
as per <tt>iso-2022-jp</tt>.

<p> To parse the resulting text entity as per SGML, the sender and
receiver must agree on an SGML declaration, since none is present in
the document entity. For this example, we assume that SGML declaration
specifies ISO10646 as the document character set. So the numeric
character reference <tt>&#2432;</tt> is resolved with respect to
ISO10646.

<p> It may seem contradictory that the <tt>ISO-2022-JP</tt> character
encoding scheme is defined in terms of a collection of coded character
sets, none of which is ISO10646. But there is no contradiction. Each
character encoded by <tt>ISO-2022-JP</tt> is in the repertoire of one
of those coded character sets, each of which is a subset of the
repertoire of ISO10646.

<p> So while <tt>ISO-2022-JP</tt> is not sufficient for every ISO10646
document, it is the case that ISO10646 is a sufficient document
character set for any entity encoded with <tt>ISO-2022-JP</tt>.

<h3>Example: Reducing the Repertoire of an Entity</h3>

<p> Suppose we have an SGML document D whose document character set is
the coded character set ISO10646. We find the document entity DE in
the form of sequence of octets OS in a disk file, encoded using the
Unicode-UCS-2 character encoding scheme.

<pre>
	Unicode-UCS-2(OS) = DE
</pre>

<p> We can reduce the character repertoire necessary to represent the
document entity by replacing characters outside the ISO-646-IRV
character repertoire with numeric character references:

<pre>
	DE' = reduce(DE, ISO10646, ISO-646-IRV)

where

  reduce : SEQ(char) X Coded Character Set X Character Repertoire -> SEQ(char)

and

  reduce(c . rest, CCS, R) = if c in R, c . reduce(rest, CCS, R)
					else &#38;#N; . reduce(rest, CCS, R)
					where CCS(N) = c
</pre>

<p> The resulting entity, DE' can then be endoded using US-ASCII


<pre>
	US-ASCII(OS') = DE' = reduce(DE, ISO10646, ISO-646-IRV)
</pre>

<p> Hence, we can represent the document D as a message entity whose
content type is "text/plain; charset=US-ASCII" and whose body is OS'.


<h2>Conclusion</h2>

<p> It is critical to keep separate the notion of a simple table of
characters and their numbers, i.e. a coded character set, separate
from the various algorithms to encoded sequences of characters,
i.e. character encoding schemes. This separation allows a
representation of a text entity which is consistent with both the MIME
and SGML specifications.

<h2>Acknowledgements</h2>


<p> The idea for the title of this document actually came from John
Klensin. The notion of character encoding scheme was inspired by the
MIME specification by Ned Freed. James Clark, Ed Levinson, and several
other members of the MIMESGML working group collaborated in
discussions leading up to this draft. Liam Quin from SoftQuad
and Gavin Nicol from EBT have provided guidance on these issues in the
past. Erik Naggum has provided invaluable aid in understanding the
SGML standard.

<h2>References</h2>

<dl>

<dt><a name="MIME">[MIME]</a>

<dd>N. Borenstein and N. Freed. "MIME (Multipurpose Internet Mail
Extensions) Part One: Mechanisms for Specifying and Describing the
Format of Internet Message Bodies." RFC 1521, Bellcore, Innosoft,
September 1993.

<dt><a name="ASCII">[ASCII]</a>

<dd> US-ASCII. Coded Character Set - 7-Bit American Standard Code for
Information Interchange. Standard ANSI X3.4-1986, ANSI, 1986.

<dt><a name="ISO8859">[ISO-8859]</a>

<dd> ISO 8859. International Standard -- Information Processing --
        8-bit Single-Byte Coded Graphic Character Sets -- Part 1: Latin 
        Alphabet No. 1, ISO 8859-1:1987. Part 2: Latin alphabet No. 2, 
        ISO 8859-2, 1987. Part 3: Latin alphabet No. 3, ISO 8859-3, 
        1988. Part 4: Latin alphabet No. 4, ISO 8859-4, 1988. Part 5: 
        Latin/Cyrillic alphabet, ISO 8859-5, 1988. Part 6: Latin/Arabic 
        alphabet, ISO 8859-6, 1987. Part 7: Latin/Greek alphabet, ISO 
        8859-7, 1987. Part 8: Latin/Hebrew alphabet, ISO 8859-8, 1988. 
        Part 9: Latin alphabet No. 5, ISO 8859-9, 1990.

<dt><a name="SGML">[SGML]</a>

<dd> ISO 8879. Information Processing -- Text and Office Systems -- 
     Standard Generalized Markup Language (SGML), 1986.

<dt><a name="Nicol">[Nicol]</a>

<dd> <a href="http://www.ebt.com:8080/docs/multilingual-www.html">The
Multilingual World Wide Web</a>, Gavin T. Nicol, Electronic Book
Technologies, Japan <tt>gtn@ebt.com</tt>

<!-- @@ DSSSL draft spec -->

<!-- @@ O'Reilly's internationalization book -->

<!-- @@ ISO 10646 -->

<!-- @@ ISO 2022-jp RFC -->

<dt> <a name="lee">[Lee]</a>

<dd> Private communication with Liam Quin, from SoftQuad.

<dt> <a name="Spivak">[Spivak]</a>
<dd> Spivak, Michael. Calculus. 2nd Ed. 1967 ISBN 0-914098-77-2

<dt> <a name="GEB">[GEB]</a>

<dd> Hofstadter, Douglas R.
G&ouml;del, Escher, Bach: An Eternal Golden Braid, 1979
ISBN 0-394-75682-7

<dt>[SET]
<dd>
"Investigations in the foundations of set theory I", in Jean van
Heijenoort (ed.) _From Frege to Godel: A Source Book in Mathematical
Logic, 1879-1931_ (Harvard U.P., 1967)


</dl>

</body>
</html>