index.html 101 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  <meta name="ProgId" content="FrontPage.Editor.Document">
  <style type="text/css">
.unicode     { font-style: normal }
.unicode:link { color: #FF0000; background-color: #FFFFFF }
.unicode:visited { color: #808080; background-color: #FFFFFF }
.unicode:active { color: #0000FF; background-color: #FFFFFF }
em.unicode   { font-style: normal }
 </style>
  <title>Unicode in XML and other Markup Languages</title>
  <link rel="stylesheet" type="text/css"
  href="http://www.w3.org/StyleSheets/TR/W3C-WG-NOTE.css">
</head>

<body>

<div class="head">
<p><a href="http://www.w3.org/"><img alt="W3C"
src="http://www.w3.org/Icons/w3c_home" align="middle" border="0" height="48"
width="72"></a> <a href="http://www.unicode.org/"><img alt="Unicode"
src="http://www.unicode.org/img/unilogo-72.gif" align="middle" border="0"
height="72" width="72"></a> </p>

<h1>Unicode in XML and other Markup Languages</h1>

<h2 class="unicode" id="utr20">Unicode Technical Report #20</h2>

<h2>W3C Working Group Note 16 May 2007</h2>
<dl>
  <dt class="unicode">Revision (Unicode):</dt>
    <dd>8</dd>
  <dt>This version:</dt>
    <dd class="unicode"><a
      href="http://www.unicode.org/reports/tr20/tr20-8.html">http://www.unicode.org/reports/tr20/tr20-8.html</a></dd>
    <dd><a
      href="http://www.w3.org/TR/2007/NOTE-unicode-xml-20070516/">http://www.w3.org/TR/2007/NOTE-unicode-xml-20070516/</a></dd>
  <dt>Latest version:</dt>
    <dd class="unicode"><a
      href="http://www.unicode.org/reports/tr20/">http://www.unicode.org/reports/tr20/</a></dd>
    <dd><a
      href="http://www.w3.org/TR/unicode-xml/">http://www.w3.org/TR/unicode-xml/</a></dd>
  <dt>Previous version:</dt>
    <dd class="unicode"><a
      href="http://www.unicode.org/reports/tr20/tr20-7.html">http://www.unicode.org/reports/tr20/tr20-7.html</a></dd>
    <dd><a
      href="http://www.w3.org/TR/2003/NOTE-unicode-xml-20030613/">http://www.w3.org/TR/2003/NOTE-unicode-xml-20030613/</a></dd>
  <dt>Date (Unicode):</dt>
    <dd>2007-05-16</dd>
  <dt>Authors:</dt>
    <dd>Martin Dürst (<a
      href="mailto:duerst@it.aoyama.ac.jp">duerst@it.aoyama.ac.jp</a>)</dd>
    <dd>Asmus Freytag (<a
      href="mailto:asmus@unicode.org">asmus@unicode.org</a>)</dd>
</dl>

<p class="copyright">Copyright © 2007 Unicode®, and <a
href="http://www.w3.org/"><acronym
title="World Wide Web Consortium">W3C</acronym></a><sup>®</sup> (<a
href="http://www.csail.mit.edu/"><acronym
title="Massachusetts Institute of Technology">MIT</acronym></a>, <a
href="http://www.ercim.org/"><acronym
title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
<a href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved. <a
href="#Copyright">Detailed copyright information</a> is available.</p>
<hr title="Separator from Header">
</div>

<h2><a name="Abstract" id="Abstract"></a>Abstract</h2>

<p>This document contains guidelines on the use of the Unicode Standard in
conjunction with markup languages such as XML.</p>

<h2><a name="CommonStatus">Status of This Document (common)</a></h2>
<!--PROPOSED UPDATE
<p><font color="#FF0000">This is a proposed update to a Technical Report
published jointly by the <a href="http://www.unicode.org/unicode/consortium/utc.html">Unicode
Technical Committee</a> and by the <a href="http://www.w3.org/International/Group/">W3C
Internationalization Working Group/Interest Group</a> (<a href="http://cgi.w3.org/MemberAccess/AccessRequest">W3C
Members only</a>) in the context of the <a href="http://www.w3.org/International/Activity">W3C
Internationalization Activity</a>. This is a draft document which may be
updated, replaced, or superseded by other documents at any time. This is not a
stable document; it is inappropriate to cite this document as other than a work
in progress.&nbsp;</font></p>
-->
<!-- APPROVED -->

<p>This is a Technical Report published jointly by the <a
href="http://www.unicode.org/unicode/consortium/utc.html">Unicode Technical
Committee</a> and by the <a href="http://www.w3.org/International/core/">W3C
Internationalization Core Working Group</a>, which is part of the <a
href="http://www.w3.org/International/Activity">W3C Internationalization
Activity</a>.</p>

<p>The base version of the Unicode Standard for this document is <a
href="#Unicode50">Version 5.0</a>. For more information about versions of the
Unicode Standard, see <a
href="http://www.unicode.org/unicode/standard/versions/">http://www.unicode.org/unicode/standard/versions/</a>.
Both the Unicode Standard and markup technologies are evolving. When
appropriate, a new version of this document may be published.</p>
Please mail corrigenda and other comments to the authors or use the <a
href="http://www.unicode.org/reporting.html">reporting form</a>. 

<h2 class="unicode"><a name="UnicodeStatus">Status of This Document (Unicode
Consortium)</a></h2>

<div>
<!-- PROPOSED UPDATE <font color="#FF0000">This document is a proposed
update of a previously approved <b>Unicode Technical Report</b>. Publication
does not imply endorsement by the Unicode Consortium. </font>
-->
<!-- APPROVED -->
This document has been reviewed by Unicode members and other interested
parties, and has been approved by the Unicode Technical Committee as a
<b>Unicode Technical Report</b>. It is a stable document and may be used as
reference material or cited as a normative reference from another document. <!-- -->
 </div>

<div>

<blockquote>
  <p><b>A Unicode Technical Report (UTR) </b>contains informative material.
  Conformance to the Unicode Standard does not imply conformance to any UTR.
  Other specifications, however, are free to make normative references to a
  UTR.</p>
</blockquote>
</div>

<div>
For a list of current Unicode Technical Reports see <a
href="http://www.unicode.org/reports/">http://www.unicode.org/reports</a>

<h2><a name="W3CStatus">Status of This Document (W3C)</a></h2>

<p><em>This section describes the status of this document at the time of its
publication. Other documents may supersede this document. A list of current
W3C publications and the latest revision of this technical report can be
found in the <a href="http://www.w3.org/TR/">W3C technical reports index</a>
at http://www.w3.org/TR/.</em></p>
<!--PROPOSED UPDATE
<p><font color="#FF0000">This is a proposed update to a Note that has been
previously endorsed by the W3C Internationalization Working Group/Interest
Group, but has not been reviewed or endorsed by W3C Members.</font></p>
-->
<!--APPROVED -->

<p>This document contains guidelines on the use of the Unicode Standard in
conjunction with markup languages such as XML.</p>

<p>This <a href="http://www.w3.org/2005/10/Process-20051014/tr.html#q75">W3C
Working Group Note</a> was produced by the <a
href="http://www.w3.org/International/core/" shape="rect">i18n Core Working
Group</a>, part of the <a
href="http://www.w3.org/International/">Internationalization Activity</a>.
Please send comments related to this document to <a
href="mailto:www-i18n-comments@w3.org?subject=%5Bunicode-xml%5D"
shape="rect">www-i18n-comments@w3.org</a> (<a
href="http://lists.w3.org/Archives/Public/www-i18n-comments/"
shape="rect">public archive</a>). Use "[unicode-xml]" in the subject line of
your email.</p>

<p>Publication as a <a
href="http://www.w3.org/2005/10/Process-20051014/tr.html#tr-end">Working
Group Note</a> does not imply endorsement by the W3C Membership. At the time
of publication, work on this document was considered complete and no further
revisions are anticipated. It is a stable document and may be used as
reference material or cited from another document. However, this document may
be updated, replaced, or made obsolete by other documents at any time.</p>

<p>This document was produced by a group operating under the <a
href="http://www.w3.org/Consortium/Patent-Policy-20040205/">5 February 2004
W3C Patent Policy</a>. W3C maintains a <a
href="http://www.w3.org/2004/01/pp-impl/32113/status">public list of any
patent disclosures</a> made in connection with the deliverables of the group;
that page also includes instructions for disclosing a patent. An individual
who has actual knowledge of a patent which the individual believes contains
<a
href="http://www.w3.org/Consortium/Patent-Policy-20040205/#def-essential">Essential
Claim(s)</a> must disclose the information in accordance with <a
href="http://www.w3.org/Consortium/Patent-Policy-20040205/#sec-Disclosure">section
6 of the W3C Patent Policy</a>.</p>
</div>
<!-- -->

<h2><a name="Contents">Table of Contents</a></h2>
<ol>
  <li><a href="#Introduction">Introduction</a><br>
    1.1 <a href="#Notation">Notation</a></li>
  <li><a href="#General">General Considerations</a><br>
    2.1 <a href="#Linearity">Linearity versus Structure</a><br>
    2.2 <a href="#Overlap">Overlap of Control Code and Markup
    Semantics</a><br>
    2.3 <a href="#Markup">Markup and Styling</a><br>
    2.4 <a href="#Coincidence">Coincidence of Markup and Functions</a><br>
    2.5 <a href="#Extensibility">Extensibility of Markup</a><br>
    2.6 <a href="#Suitability">Suitability of Characters in Markup</a></li>
  <li><a href="#Suitable">Characters not Suitable for Use With Markup</a><br>
    3.1 <a href="#Charlist">Table of Characters not Suitable for Use With
    Markup</a><br>
    3.2 <a href="#Line">Line and Paragraph Separator</a><br>
    3.3 <a href="#Bidi">Bidi Embedding Controls</a><br>
    3.4 <a href="#Deprecated">Deprecated Formatting Characters</a><br>
    3.5 <a href="#BOM">Byte Order Mark</a><br>
    3.6 <a href="#Interlinear">Interlinear Annotation Characters</a><br>
    3.7 <a href="#Object">Object Replacement Character</a><br>
    3.8 <a href="#Musical">Musical Controls</a><br>
    3.9 <a href="#Language">Language Tag Characters</a><br>
    3.10 <a href="#OtherDeprecated">Other Deprecated Characters</a></li>
  <li><a href="#Format">Format Characters Suitable for Use With Markup</a>
     <br>
    4.1 <a href="#Subtending">Subtending Marks</a><br>
    4.2 <a href="#Fraction">Fraction Slash</a><br>
    4.3 <a href="#Variation">Variation Selector</a><br>
    4.4 <a href="#Ideographic">Ideographic Description Characters</a><br>
    4.5 <a href="#Invisible">Invisible Mathematical Operators</a><br>
    4.6 <a href="#LineBreak">Line Break Controls</a><br>
    4.7 <a href="#Fillers">Hangul Fillers</a></li>
  <li><a href="#Compatibility">Characters with Compatibility Mappings</a><br>
    5.1 <a href="#Overview">Overview</a><br>
    5.2 <a href="#Generating">Generating New Text</a><br>
    5.3 <a href="#List">List item Marker Characters</a><br>
    5.4 <a href="#Fractions">Fractions</a><br>
    5.5 <a href="#Squared">Squared or Horizontal</a><br>
    5.6 <a href="#Superscripts">Superscripts and Subscripts</a><br>
    5.7 <a href="#Other">Other Characters Marked &lt;compat&gt;</a></li>
  <li><a href="#Noncharacters">Noncharacters</a></li>
  <li><a href="#White">White Space</a><br>
    <a href="#converting-nl-to-ws">7.1 Converting Newline Functions to White
    Space</a></li>
  <li><a href="#Versioning">Versioning</a></li>
  <li><a href="#Conformance">Conformance</a></li>
  <li><a href="#References">References</a></li>
  <li><a href="#Acknowledgements">Acknowledgements</a></li>
  <li><a href="#ChangeHistory">Change History</a></li>
  <li><a href="#Copyright">Copyright</a></li>
</ol>

<h2><a name="Introduction">1. Introduction</a></h2>

<p>The Unicode Standard  [<a href="#Unicode">Unicode</a>] defines the
universal character set. Its primary goal is to provide an unambiguous
encoding of the content of plain text, ultimately covering all languages in
the world, but also major text-based notational systems for science,
technology, music, and scholarship.</p>

<p>Currently in its <a href="#Unicode50">fifth major version</a>, Unicode
contains a large number of characters covering most of the currently used
scripts in the world. It also contains additional characters for
interoperability with older character encodings, and characters with
control-like functions included primarily for reasons of providing
unambiguous interpretation of plain text. Unicode provides specifications for
use of all of these characters.</p>

<p>For document and data interchange, the Internet and the World Wide Web
make extensive use of marked-up text such as <a href="#html4.01">HTML4.01</a>
and <a href="#xml10">XML</a>. In many instances, markup provides the same, or
essentially similar features to those provided by format characters in the
Unicode Standard for use in plain text. Another special character category
provided by Unicode are compatibility characters. While there may be valid
reasons to support these characters and their specifications in plain text,
their use in marked-up text can conflict with the rules of the markup
language. Formatting characters are discussed in Section 3, <i><a
href="#Suitable">Characters not Suitable for Use With Markup</a></i> and
Section 4, <i><a href="#Format">Format Characters Suitable for Use With
Markup</a>, </i>compatibility characters in Section 5,<i><a
href="#Compatibility">Characters with Compatibility Mappings</a> </i>.
Section 6 briefly discusses noncharacters, and Section 7 is devoted to white
space.</p>

<p>Issues resulting from canonical equivalences and Normalization [<a
href="#UTR15">Normalization</a>] as well as the interaction of character
encoding and methods of escaping characters in markup are discussed in the
Character Model for the World Wide Web [<a href="#Charmod">Charmod</a>] and
[<a href="#Charmodnorm">Charmodnorm</a>].</p>

<p>The issues of using Unicode characters with marked-up text depend to some
degree on the rules of the markup language in question and the set of
elements it contains. In a narrow sense, this document concerns itself only
with XML, and to some extent HTML. However, much of the general information
presented here should be useful in a broader context, including some page
layout languages.</p>

<blockquote>
  <p><b><a name="Note">Note:</a></b> Many of the recommendations of this
  report depend on the availability of particular markup or styling. Where
  possible, appropriate DTDs or Schemas should be used or designed to make
  such markup or styling available, or the DTDs or Schemas used should be
  appropriately extended. The current version of this document makes no
  specific recommendations for the design of DTDs or Schemas, or for the use
  of particular DTDs or Schemas, but the information presented here may be
  useful to designers of DTDs and Schemas, and to people selecting DTDs or
  Schemas for their applications. </p>

  <p><b>Note: </b>The recommendations of this report do not apply in the case
  of XML used for blind data transport and similar cases.</p>
</blockquote>

<h3><a name="Notation">1.1 Notation</a></h3>

<p>This report uses XML [<a href="#xml10">XML</a>] as a prominent and general
example of markup. The XML namespace notation [<a
href="#Namespace">Namespace</a>] is used to indicate that a certain element
is taken from a specific markup language. As an example, the prefix 'xhtml:'
indicates that this element is taken from [<a href="#XHTML">XHTML</a>]. This
means that the examples containing the namespace prefix 'xhtml:' are assumed
to include a namespace declaration of xmlns:xhtml="..." </p>

<p>Characters are denoted using the notation used in the Unicode Standard,
that is, an optional U+ followed by their hexadecimal number, using at least
4 digits, such as "U+1234" or "U+10FFFD". In XML or HTML this could be
expressed as "&amp;#x1234;" or "&amp;#x10FFFD;".</p>

<h2><a name="General">2. General Considerations</a></h2>

<p>There are several general points to consider when looking at the
interaction between character encoding and markup. </p>
<ul>
  <li>Linearity of text vs. hierarchy of markup structure</li>
  <li>Overlap of control codes and markup semantics</li>
  <li>Markup <i>vs.</i> Styling</li>
  <li>Coincidence of semantic markup and functions </li>
  <li>Extensibility of markup</li>
</ul>

<h3 align="left"><a name="Linearity">2.1 Linearity versus Structure</a></h3>

<p align="left">Encoding text as a sequence of characters without further
information leads to a linear sequence, commonly called plain text. Character
follows character, without any particular structure. Markup, on the other
hand, defines a hierarchical structure for the text or data. In the case of
XML and most other, similar markup languages, the markup defines a tree
structure. While this tree structure is linearized for transmission in the
XML document, once the document has been parsed, the tree is available
directly.</p>

<p align="left">Operations that are easy to perform on trees are often
difficult to perform on linear sequences and vice versa. By separating
functionality between character encoding and markup appropriately, the
architecture becomes simpler, more powerful and longer-lasting.</p>

<p align="left">In particular, operations on hierarchical structures can
easily make sure that information is kept in context. Attributes assigned to
parts of a document are moved together with the associated part of the
document. Assigning an attribute to a part of a document limits the scope of
the attribute to that part of the document. Performing the same operations on
linear sequences of characters using control codes to set attributes and to
delimit their scope requires much more work and is error prone. Locating the
start or end of a span of text of the same attribute requires scanning
backwards and forwards for the embedded delimiter or control code. Moving or
editing text often results in mismatched control codes, so that an attribute
might suddenly apply to text it was not intended for.</p>

<h3 align="left"><a name="Overlap">2.2 Overlap of Control Code and Markup
Semantics</a></h3>

<p align="left">When markup is not available, plain text may require control
characters. This is usually the case where plain text must contain some
scoping or attribute information in order to be legible, <i>i.e.</i> to be
able to transmit the same content between originator and receiver. Many of
these control characters have direct equivalents in particular markup
languages, since markup handles these concerns efficiently. If both
characters and their markup equivalents may be present in the same text, the
question of priority is raised. Therefore it is important to identify and
resolve these ambiguities at the time markup is first applied.</p>

<h3 align="left"><a name="Markup">2.3 Markup and Styling</a></h3>

<p align="left">Besides the basic character encoding and text markup there is
a third contributor to text functionality, namely styling. Markup is
concerned with the logical structure of the text or data, <i>e.g. </i>to
indicate sections, subsections, and headers in a document, or to indicate the
various fields of an address record. Styling is used to present the
information in various ways, <i>e.g.</i> in different fonts, different type
styles (italic, bold), different colors, <i>etc. </i>Some character codes do
not encode a generic character, but a styled character. Where these
characters are used, styling information is frozen, <i>i.e.</i> it is no
longer possible to alter the appearance of the text by applying style
information. However, there are many examples where a historically free
stylistic variation has over time become a semantic distinction that is
properly encoded as plain text. Sometimes, what is a free variation in some
contexts, implies strict semantic differentiation in others. In all such
instances, altering the appearance of the text by styling information would
irreparably alter the content of the text. This is of particular concern with
mathematical notation or systems for phonetic and phonemic transcription
which make extensive semantic use of styles on a character by character
basis.</p>

<h3 align="left"><a name="Coincidence">2.4 Coincidence of Markup and
Functions</a></h3>

<p align="left">Dealing with various functionalities on the markup level has
the additional advantage that in most cases, text portions that need some
particular attribute (or styling) are actually those text portions identified
by markup. A paragraph may be in French, a citation may need a bidi
embedding, a keyword may be in italics, a list number may be circled, and so
on. This makes it very efficient to associate those attributes with
markup.</p>

<p align="left">However, where local or point-like functionality is needed,
markup is <i>not</i> very efficient and its main benefit, easy manipulation
of scope, is not required. On the contrary, the intrusion of markup in the
middle of words can make search or sort operations more difficult. For these
cases expressing the information as character codes is not only a viable, but
often the preferred alternative, which needs to be considered in the design
of markup languages.</p>

<h3 align="left"><a name="Extensibility">2.5 Extensibility of Markup</a></h3>

<p align="left">Character encoding works with a range of integers used as
character codes. This is extremely efficient, but has some limitations.
Markup, on the other hand, is much more extensible. Using technologies such
as XML Namespaces [<a href="#Namespace">Namespace</a>] and their application
in schema languages like [<a href="#XMLSchema">XML Schema</a>], various
vocabularies can be mixed.</p>

<h3><a name="Suitability">2.6 Suitability of Characters in Markup</a></h3>

<p>The suitability of a particular character for markup depends on its status
in the Unicode Standard, the nature of its behavior in text and the
availability of equivalent markup. Many format characters that are needed for
advanced plain text are not suitable for use with markup. <a
href="#Suitable">Section 3</a> gives a list and detailed descriptions.
However, not all format characters are unsuitable for use with markup. <a
href="#Format">Section 4</a> provides a list of format characters that are
suitable for use with markup and gives some discussion about their use. In
addition to format characters, the Unicode Standard also has compatibility
characters, some of which may be replaceable by suitable markup. These
characters are discussed in <a href="#Compatibility">Section 5</a>.</p>

<h2><a name="Suitable">3. Characters not Suitable for use With Markup</a></h2>

<p>There are characters which are unsuitable in the context of markup in
XML/HTML and whose use is discouraged, because one or more of the following
conditions apply:</p>
<ul>
  <li>They are deprecated in the Unicode Standard.</li>
  <li>They are unsupportable without additional data.</li>
  <li>They are difficult to handle because they are stateful.</li>
  <li>They are better handled by markup.</li>
  <li>They are undesirable because of conflict with equivalent markup.</li>
</ul>

<p><a href="#Charlist">Section 3.1</a> provides a list of such characters.
Sections <a href="#Line">3.2</a> through <a href="#OtherDeprecated">3.10</a>
discuss in more detail the following points for the discouraged
characters.</p>
<ul>
  <li>Short description of semantics</li>
  <li>Reason for inclusion in Unicode</li>
  <li>Specific problems when used with markup</li>
  <li>Other areas where problems may occur (<i>e.g.</i> plain text)</li>
  <li>What kind of markup to use instead</li>
  <li>What to do if detected in a particular context</li>
</ul>

<h3><a name="Charlist">3.1 Table of Characters not Suitable for use With
Markup</a></h3>

<p>The following table contains the characters currently considered not
suitable for use with markup in XML or HTML. (See however the <a
href="#Note">note</a> in the <a href="#Introduction">Introduction</a>.) They
may also be unsuitable for other markup or page layout languages. For
determining possible conflict this report uses the markup available in
HTML.</p>

<p align="center"><b>Table 3.1 Characters not suitable for use with
markup</b></p>

<table border="1" cellpadding="2" cellspacing="0" width="95%">
  <tbody>
    <tr>
      <th align="left" bgcolor="#ccffcc" width="210"><p
        align="left">Codepoints</p>
      </th>
      <th align="left" bgcolor="#ccffcc" width="273"><p
        align="left">Names/Description</p>
      </th>
      <th align="left" bgcolor="#ccffcc" width="341"><p align="left">Short
        Comment</p>
      </th>
    </tr>
    <tr>
      <td width="210">U+0340..U+0341</td>
      <td width="273">Clones of grave and accent</td>
      <td width="341">Deprecated in Unicode</td>
    </tr>
    <tr>
      <td width="210">U+17A3, U+17D3</td>
      <td width="273">Obsolete characters for Khmer</td>
      <td width="341">Deprecated in Unicode</td>
    </tr>
    <tr>
      <td width="210">U+2028..U+2029</td>
      <td width="273">Line and paragraph separator</td>
      <td width="341">use &lt;xhtml:br /&gt;,
        &lt;xhtml:p&gt;&lt;/xhtml:p&gt;, or equivalent</td>
    </tr>
    <tr>
      <td width="210">U+202A..U+202E</td>
      <td width="273">BIDI embedding controls <br>
        (LRE, RLE, LRO, RLO, PDF)</td>
      <td width="341">Strongly discouraged in [<a
        href="#html4.01">HTML4.01</a>]</td>
    </tr>
    <tr>
      <td width="210">U+206A..U+206B</td>
      <td width="273">Activate/Inhibit Symmetric swapping</td>
      <td width="341">Deprecated  in Unicode</td>
    </tr>
    <tr>
      <td width="210">U+206C..U+206D</td>
      <td width="273">Activate/Inhibit Arabic form shaping</td>
      <td width="341">Deprecated in Unicode</td>
    </tr>
    <tr>
      <td width="210">U+206E..U+206F</td>
      <td width="273">Activate/Inhibit National digit shapes</td>
      <td width="341">Deprecated in Unicode</td>
    </tr>
    <tr>
      <td width="210">U+FFF9..U+FFFB</td>
      <td width="273">Interlinear annotation characters</td>
      <td width="341">Use ruby markup [<a href="#Ruby">Ruby</a>]</td>
    </tr>
    <tr>
      <td rowspan="2" width="210">U+FEFF</td>
      <td width="273">as ZWNBSP</td>
      <td width="341">Use U+2060 Word Joiner instead</td>
    </tr>
    <tr>
      <td width="273">as Byte Order Mark</td>
      <td width="341">Use only at the start of a file, not as part of
      markup</td>
    </tr>
    <tr>
      <td width="210">U+FFFC</td>
      <td width="273">Object replacement character</td>
      <td width="341">Use markup, e.g. HTML &lt;object&gt; or HTML
      &lt;img&gt;</td>
    </tr>
    <tr>
      <td width="210">U+1D173..U+1D17A</td>
      <td width="273">Scoping for Musical Notation</td>
      <td width="341">Use an appropriate markup language</td>
    </tr>
    <tr>
      <td width="210">U+E0000..U+E007F</td>
      <td width="273">Language Tag code points </td>
      <td width="341">Use xhtml:lang or xml:lang</td>
    </tr>
  </tbody>
</table>

<p>Except for Line and Paragraph Separator, or the Byte Order Mark, it is
acceptable for browsers and similar user agents to ignore the presence of
discouraged characters in HTML or XML. It is up to authoring tools to ensure
proper conversion between these characters and equivalent markup where it
exists.</p>

<h3><a name="Line">3.2 Line and Paragraph Separator, U+2028..U+2029</a></h3>

<p><em>Short description</em>: The line and paragraph separator provide
unambiguous means to denote hard line breaks and paragraph delimiters in
plain text.</p>

<p><em>Reason for inclusion</em>: These characters were introduced into the
Unicode Standard to overcome the ambiguous and widely divergent use of
control codes for this purpose.<font color="#00ffff"></font> See <i>Section
5.8, Newline Guidelines,</i> in [<a href="#Unicode">Unicode</a>].</p>

<p><em>Problems when used in markup</em>: Including these characters in
markup text does not work where it would duplicate the existing markup
commands for delimiting paragraphs and lines.</p>

<p><em>Problems with other uses</em>: The separator characters can also
problematic when used in plain text, because legacy data is usually converted
code point for code point into Unicode and all receivers of Unicode plain
text have to effectively be able to interpret the existing use of control
codes for this purpose. As a result, fewer Unicode implementations support
these characters, than would be the case otherwise.</p>

<p><em>Replacement markup</em>: In HTML, use &lt;xhtml:br /&gt; instead of
U+2028 and surround paragraphs by &lt;xhtml:p&gt; and &lt;/xhtml:p&gt;
instead of separating them with U+2029.</p>

<p><em>What to do if detected</em>: In a browser context, treat as white
space, or ignore. When received in an editing context, replace the character
by the corresponding markup. </p>

<h3><a name="Bidi">3.3 Bidi Embedding Controls (LRE, RLE, LRO, RLO, PDF),
U+202A..U+202E</a></h3>

<p><em>Short description</em>: The bidi embedding controls are required to
supplement the Unicode Bidirectional Algorithm in plain text</p>

<p><em>Reason for inclusion</em>: The Unicode Bidirectional algorithm
unambiguously resolves the display direction for bidirectional text. It does
so by assigning all characters directional categories and then resolving
these in context. In a small number of circumstances this <i>implicit </i>
method does not produce satisfactory results and embedding controls are
needed to ensure that sender and receiver agree on the display direction for
a given text. See Unicode Technical Report #9, The Bidirectional Algorithm <a
href="#UTR9">[UAX 9]</a>.</p>

<p><em>Problems when used in markup</em>: These characters duplicate
available markup, which is better suited to handle the stateful nature of
their effect. </p>

<p><em>Problems with other uses</em>: The embedding controls introduce a
state into the plain text, which must be maintained when editing or
displaying the text. Processes that are modifying the text without being
aware of this state may inadvertently affect the rendering of large portions
of the text, for example by removing a PDF.</p>

<p><em>Replacement markup</em>: The following table gives the replacement
markup:<br>
</p>

<blockquote>

  <table border="1" cellspacing="0">
    <tbody>
      <tr>
        <td bgcolor="#ccffcc" width="15"><b>Unicode</b></td>
        <td bgcolor="#ccffcc" width="30%"><b>Equivalent markup</b></td>
        <td bgcolor="#ccffcc" width="55%"><b>Comment</b></td>
      </tr>
      <tr>
        <td width="15"><p>RLO</p>
        </td>
        <td width="30%">&lt;xhtml:bdo dir = "rtl"&gt;</td>
        <td width="55%"> </td>
      </tr>
      <tr>
        <td width="15"><p>LRO</p>
        </td>
        <td width="30%">&lt;xhtml:bdo dir = "ltr"&gt;</td>
        <td width="55%"> </td>
      </tr>
      <tr>
        <td width="15">PDF</td>
        <td width="30%">&lt;/xhtml:bdo&gt;</td>
        <td width="55%">when used to terminate RLO or LRO only, otherwise
          ignore</td>
      </tr>
      <tr>
        <td width="15">RLE</td>
        <td width="30%">dir = "rtl"</td>
        <td width="55%">attribute on block or inline element</td>
      </tr>
      <tr>
        <td width="15">LRE</td>
        <td width="30%">dir = "ltr"</td>
        <td width="55%">attribute on block or inline element</td>
      </tr>
    </tbody>
  </table>
</blockquote>

<p>For details on bidi markup, please see Section 8.2 of HTML [<a
href="#HTML4.0-8.2">HMTL 4.0-8.2</a>]. The text of HTML 4.0 gives this
recommendation: </p>

<blockquote>
  <p><em><strong>Using HTML directionality markup with Unicode
  characters.</strong> Authors and designers of authoring software should be
  aware that conflicts can arise if the <a
  href="http://www.w3.org/TR/html401/struct/dirlang.html#adef-dir"
  class="noxref"><samp class="ainst">dir</samp></a> attribute is used on
  inline elements (including <a
  href="http://www.w3.org/TR/html401/struct/dirlang.html#edef-BDO"
  class="noxref"><samp class="einst">BDO</samp></a>) concurrently with the
  corresponding<a rel="biblioentry" href="#Unicode"
  class="normref">[UNICODE]</a> formatting characters. Preferably one or the
  other should be used exclusively. The markup method offers a better
  guarantee of document structural integrity and alleviates some problems
  when editing bidirectional HTML text with a simple text editor, but some
  software may be more apt at using the<a rel="biblioentry" href="#Unicode"
  class="normref">[UNICODE]</a> characters. If both methods are used, great
  care should be exercised to insure proper nesting of markup and directional
  embedding or override, otherwise, rendering results are undefined.</em></p>
</blockquote>

<p>This document goes beyond HTML and recommends that <i>only</i> the markup
should be used.</p>

<blockquote>
  <p><b>Note:</b> The interpretation of how to handle directionality markup
  for block level elements differs in different versions of [<a
  href="#CSS">CSS</a>].</p>
</blockquote>

<p><em>What to do if detected</em>: In a browser context, ignore. When
received in an editing context, replace the characters by the appropriate
markup. </p>

<h3><a name="Deprecated">3.4<em></em>Deprecated Formatting Characters,
U+206A..U+206F</a></h3>

<p><em>Short description</em>: These characters are deprecated. They were
originally intended to allow explicit activation of contextual shaping,
numeric digit rendering and symmetric swapping.</p>

<p><em>Reason for inclusion</em>: These characters were retained from draft
versions of ISO 10646.</p>

<p><em>Problems when used in markup</em>: The processing model for these
characters is not supported in markup.</p>

<p><em>Problems with other uses</em>: The Unicode Standard requires that
symmetric swapping, contextual shaping, and alternate digit shapes are
enabled by default and no longer supports inhibiting any of them by use of
these character codes. The most likely effect of their occurrence in
generated text would be that of a 'garbage' character.</p>

<p><em>Conversion for use with markup</em>: Apply the appropriate conversion
to bring the data stream in line with the Unicode text model for
bidirectional text and cursively-connected scripts.</p>

<p><em>What to do if detected</em>: When received by a browser as part of
marked up text, they may be ignored. When received in an editing context,
they may be removed, possibly with a warning. Alternatively, an appropriate
conversion from the legacy text model may be provided. This will most likely
be limited to applications directly interfacing with and knowledgeable of the
particular legacy implementation that inspired these characters.</p>

<h3><a name="BOM">3.5 Byte Order Mark, ZWNBSP, U+FEFF</a></h3>

<p><em>Short description</em>: U+FEFF has two functions. It is formally known
as <span style="font-variant: small-caps;">zero width no-break space</span>
(ZWNBSP), and can act as a word joiner, but its primary use is as <i>byte
order mark (BOM)</i>, to indicate in a file signature at the start of a file
that a file is in a particular Unicode encoding form and of a particular byte
order. Using U+FEFF as a word joiner in new data is deprecated  as of [<a
href="#Unicode32">Unicode3.2</a>] in favor of U+2060 <span
style="font-variant: small-caps;">word joiner</span> (WJ). The use as byte
order mark remains unaffected.</p>

<p><em>Reason for inclusion</em>: Originally included in Unicode for the sole
purpose of indicating byte order or use in file signatures, the character
acquired the ZWNBSP semantics as part of the merger between ISO/IEC 10646 and
Unicode. When used as a byte order mark the character is placed at the
beginning of a file. If a recipient views it as FEFF then the byte order
between sender and receiver match. If the recipient views it as FFFE (a
non-character code point) then the sender used opposite byte order from the
recipient, and the recipient needs to invert the byte order or refuse to read
the file. When used as a ZWNBSP the character is intended to prevent breaks
between adjacent characters. This function is now provided by U+2060 <span
style="font-variant: small-caps;">word joiner</span> (WJ) making it
unnecessary to insert U+FEFF in the middle of a file. For more information
see Chapter 16 of [<a href="#Unicode">Unicode</a>].</p>

<p><em>Problems when used in markup</em>: Using U+FEFF as ZWNBSP makes it
impossible to distinguish it from the case where a byte order mark was left
in the middle of a file inadvertently due to incorrect splicing. U+FEFF can
and in some cases (XML encoded in UTF-16) must be used at the start of a file
containing markup, but as a signature, this is not part of actual markup or
marked-up content. Some older versions of browsers and parsers may not
correctly recognize U+FEFF at the start of a file encoded in UTF-8. For
details of how U+FEFF participates in encoding detection of XML files, see
Appendix F of <a href="#xml10">[XML 1.0]</a>. </p>

<p><em>Problems with other uses</em>: The use of byte order mark as ZWNBSP is
also problematic when used in plain text, and has been deprecated for that
purpose in favor of U+2060 <span style="font-variant: small-caps;">word
joiner</span>. The use of U+FEFF in file signatures to indicate byte order is
the only recommended use of this character.</p>

<p><em>Replacement markup</em>: None. In locations other than the beginning
of a text file, U+FEFF can be removed or replaced by U+2060 in an editing
environment.</p>

<p><em>What to do if detected</em>:  When received by a browser as part of
marked-up text, treat depending on location. At the start of an external
entity, treat as byte order mark (i.e. as part of the character encoding, not
as part of the parsed character stream, see e.g. Section 4.3.3 of <a
href="#xml10">[XML 1.0]</a>). Otherwise, assume it is older data using it as
ZWNBSP. When receiving plain text in an editing environment, editors may take
one or more of several actions: replace ZWNBSP in the middle of a file with
WJ or issue a warning to the user.</p>

<h3><a name="Interlinear">3.6 Interlinear Annotation Characters,
U+FFF9-U+FFFB</a></h3>

<p><em>Short description</em>: The interlinear annotation characters are used
to delimit interlinear annotations in certain circumstances. They are
intended to provide text anchors and delimiters for interlinear annotation
for in-process use and are not intended for interchange.</p>

<p><em>Reason for inclusion</em>: The interlinear annotation characters were
included in Unicode only in order to reserve code points for very frequent
application-internal use. The interlinear annotation characters are used to
delimit interlinear annotations in contexts where other delimiters are not
available, and where non-textual means exist to carry formatting information.
Many text-processing applications store the text and the associated markup
(or in some cases styling information) of a document in separate structures.
The actual text is kept in a single linear structure; additional information
is kept separately with pointers to the appropriate text positions. This is
called out-of-band information. The overall implementation makes sure that
these two structures are kept in sync. If the text contains interlinear
annotations, it is extremely helpful for implementations to have delimiters
in the text itself; even though delimiters are not otherwise used for style
markup. With this method, and unlike the case of the object replacement
character, all textual information can remain in the standard text stream,
but any additional formatting information is kept separately. In addition,
the Interlinear Annotation Anchor serves as a placeholder for formatting
information for the whole annotation object, the same way a paragraph mark
can be a placeholder to attach paragraph formatting information.</p>

<p><em>Problems when used in markup</em>: Including interlinear annotation
characters in marked-up text does not work because the additional formatting
information (how to position the annotation,...) is not available.</p>

<p><em>Problems with other uses</em>: The interlinear annotation characters
are also problematic when used in plain text, and are not intended for that
purpose. In particular, on older display systems that simply ignore or
replace the Interlinear Annotation Characters, the meaning of the text may be
changed.</p>

<p><em>Replacement markup</em>: The markup to be used in place of the
Interlinear Annotation Characters depends on the formatting and nature of the
interlinear annotation in question. For ruby, please see [<a
href="#Ruby">Ruby</a>].</p>

<p><em>What to do if detected</em>:  When received by a browser as part of
marked-up text, they may be ignored. When receiving plain text in an editing
environment, editors may take one or more of several actions: remove U+FFF9
together with removing all characters between U+FFFA and following U+FFFB; 
ignore U+FFF9 and turn U+FFFA and U+FFFB  into "[" and "]" respectively, or
into similar characters; issue a warning to the user; or tentatively convert
into appropriate ruby markup for further editing and formatting by the
user.</p>

<h3><a name="Object">3.7 Object Replacement Character, U+FFFC</a></h3>

<p><em>Short description</em>: The object replacement character is used to
stand in place of an object (e.g. an image) included in a text.</p>

<p><em>Reason for inclusion</em>: The object replacement character was
included in Unicode only in order to reserve a codepoint for a very frequent
application-internal use. Many text-processing applications store the text
and the associated markup (or in some cases styling information) of a
document in separate structures. The actual text is kept in a single linear
structure; additional information is kept separately with pointers to the
appropriate text positions. The overall implementation makes sure that these
two structures are kept in sync. If the text contains objects such as images,
it is extremely helpful for implementations to have a sentinel in the text
itself; any additional information is kept separately.</p>

<p><em>Problems when used in markup</em>: Including an object replacement
character in markup text does not work because the additional information
(what object to include,...) is not available.</p>

<p><em>Problems with other uses</em>: The object replacement character is
also problematic when used in plain text, because there is no way in plain
text to provide the actual object information or a reference to it.</p>

<p><em>Replacement markup</em>: The markup to be used in place of the Object
Replacement Character depends on the object in question and the markup
context it is used in. Typical cases are &lt;xhtml:img src='...' /&gt;,
&lt;xhtml:object ...&gt;, or &lt;html:applet ...&gt;. These constructs allow
providing all additional information needed to identify and use the object in
question.</p>

<p><em>What to do if detected</em>: Browsers may ignore this character. When
received in an editing context, if the actual object is accessible, editors
may either replace the character by the appropriate markup for that object,
or otherwise remove it, ideally providing a warning.</p>

<h3><a name="Musical">3.8 Musical Controls</a>, U+1D173..U+1D17A</h3>

<p><em>Short description</em>: A series of characters for controlling scope
in musical notation.</p>

<p><em>Reason for inclusion</em>: These characters designate the start and
end of common musical constructs. Full musical layout depends on additional
information, for example pitch, that cannot be encoded using Unicode.
However, many musical symbols may be depicted in isolation (and without
assigning pitch) as part of a textual discussion of music. Plain text use of
Unicode characters is primarily intended for this latter purpose. The scoping
operators can be used to support limited renderings of beams, slurs, phrases,
etc. in this context. However, in the context of markup languages, musical
scoring calls for a dedicated markup language (analogous to MathML) which
would be expected to contain markup for these constructs.</p>

<p><em>Problems when used in markup</em>: These characters duplicate
information that can in principle be expressed in markup.</p>

<p><em>Problems with other uses</em>: Their special code range allows them to
be easily filtered, but applications that do not expect them will treat them
as garbage characters.</p>

<p><em>Replacement markup</em>: Replace with equivalent markup if
available.</p>

<p><em>What to do if detected</em>: Browsers may ignore these characters.
When received in an editing context, editors may remove or replace them by
equivalent markup.</p>

<h3><a name="Language">3.9 Language Tag Characters</a>, U+E0000..U+E007F</h3>

<p><em>Short description</em>: A series of characters for expressing language
tags, based on existing standards for language tags using the rules in
Chapter 16 of [<a href="#Unicode">Unicode</a>].</p>

<p><em>Reason for inclusion</em>: These characters allow in-band language
tagging in situations where full markup is not available, while allowing easy
filtering by applications that do not support them. They were solely included
for the benefit of those Internet protocols, such as ACAP, which require a
standard mechanism for marking language in UTF-8 strings, and at the same
time to avoid the use of other tagging schemes that relied on specific
details of the encoding form used.</p>

<p><em>Problems when used in markup</em>: These characters duplicate
information that can be expressed in markup.</p>

<p><em>Problems with other uses</em>: Their special code range allows them to
be easily filtered, but applications that do not expect them will treat them
as garbage characters.</p>

<p><em>Replacement markup</em>: Replace with equivalent language markup. XML
and XHTML have the xml:lang attribute. HTML has the lang attribute. These
attributes follow different scoping rules than the tag characters, therefore
this replacement will generally not be a simple 1:1 substitution.</p>

<p><em>What to do if detected</em>: Browsers may ignore these characters.
When received in an editing context, editors may remove or replace them by
equivalent markup.</p>

<h3><a name="OtherDeprecated">3.10 Other Characters Deprecated in
Unicode</a></h3>

<p><em>Short description</em>: The Unicode Character Database [<a
href="#UnicodeData">UnicodeData</a>] lists all characters that have been
deprecated in [<a href="#Unicode">Unicode</a>]. This list may grow (slowly)
over time. Deprecated characters remain valid characters forever, but their
use is strongly discouraged. Deprecation of characters is applied only in
exceptional circumstances. It is never the result of historical changes of a
writing system: characters no longer in current, modern use are retained in
Unicode, as they are needed for the representation of historical
documents.</p>

<p><em>Reason for inclusion</em>: Usually, characters that are deprecated
were never needed, but were inadvertently added to the Unicode Standard,
perhaps based on incomplete information available at the time of encoding.</p>

<p><em>Problems when used in markup</em>: Except where noted elsewhere in
this document, their presence in markup presents the same problems as in
plain text, usually that of an unnecessary duplicate encoding.</p>

<p><em>Problems with other uses</em>: Depends on the character and the reason
for its deprecation. For more information see [<a
href="#Unicode">Unicode</a>].</p>

<p><em>Conversion for use with markup</em>: For deprecated characters not
discussed elsewhere in this document, see the relevant descriptions of those
characters in [<a href="#Unicode">Unicode</a>] for information on the
recommended alternatives.</p>

<p><em>What to do if detected</em>:  Unless a specific recommendation is
given elsewhere, deprecated characters are not ignored; where possible, in an
editing environment, a preferred alternate encoding may be substituted.</p>

<h2><a name="Format">4. Format Characters Suitable for Use with
Markup</a></h2>

<p>The following table contains format characters that do not exhibit the
problems discussed at the start of <a href="#Suitable">Section 3</a>. Despite
their apparent relation to or similarity with characters in table <a
href="#Charlist">3.1</a>, they are considered suitable for use with markup.
It is not acceptable for user agents to ignore the characters in table 4.1.
For a description of these characters see [<a
href="#Unicode">Unicode</a>].</p>

<p align="center"><b>Table 4.1: Some characters that affect text format but
are suitable for use with markup</b></p>

<table border="1" cellpadding="2" cellspacing="0" width="95%">
  <tbody>
    <tr>
      <th align="left" bgcolor="#ccffcc" width="198"><p align="left">Code
        points</p>
      </th>
      <th align="left" bgcolor="#ccffcc" width="362"><p
        align="left">Names/Description</p>
      </th>
      <th align="left" bgcolor="#ccffcc" width="280"><p align="left">Short
        Comment</p>
      </th>
    </tr>
    <tr>
      <td width="198">U+00A0</td>
      <td width="362">No-break Space</td>
      <td width="280">Line break control</td>
    </tr>
    <tr>
      <td width="198">U+00AD</td>
      <td width="362">Soft Hyphen</td>
      <td width="280">Line break control</td>
    </tr>
    <tr>
      <td width="198">U+034F</td>
      <td width="362">Combining Grapheme Joiner</td>
      <td width="280">Used in sorting</td>
    </tr>
    <tr>
      <td width="198">U+0600</td>
      <td width="362">Arabic Number Sign</td>
      <td width="280">Subtending mark</td>
    </tr>
    <tr>
      <td width="198">U+0601</td>
      <td width="362">Arabic Sign Sanah</td>
      <td width="280">Subtending mark</td>
    </tr>
    <tr>
      <td width="198">U+0602</td>
      <td width="362">Arabic Footnote Marker</td>
      <td width="280">Subtending mark</td>
    </tr>
    <tr>
      <td width="198">U+0603</td>
      <td width="362">Arabic Sign Safha</td>
      <td width="280">Subtending mark</td>
    </tr>
    <tr>
      <td width="198">U+06DD</td>
      <td width="362">Arabic End of Ayah</td>
      <td width="280">Enclosing mark</td>
    </tr>
    <tr>
      <td width="198">U+070F</td>
      <td width="362">Syriac Abbreviation Mark (SAM)</td>
      <td width="280">Supertending mark</td>
    </tr>
    <tr>
      <td width="198">U+0F0C</td>
      <td width="362">Tibetan Mark Delimiter Tsheg Bstar</td>
      <td width="280">Non-breaking form of 0F0B</td>
    </tr>
    <tr>
      <td width="198">U+115F..U+1160</td>
      <td width="362">Hangul Jamo Fillers</td>
      <td width="280">Filler</td>
    </tr>
    <tr>
      <td width="198">U+180B..U+180E</td>
      <td width="362">Mongolian Variation Selectors(FVS1..FVS3), Mongolian
        Vowel Separator</td>
      <td width="280">Required for Mongolian</td>
    </tr>
    <tr>
      <td width="198">U+200B</td>
      <td width="362">Zero-width Space</td>
      <td width="280">Line break control</td>
    </tr>
    <tr>
      <td width="198">U+200C..U+200D</td>
      <td width="362">Zero-width Join Controls (ZWJ and ZWNJ)</td>
      <td width="280">Required for a.o. Persian and many Indic scripts</td>
    </tr>
    <tr>
      <td width="198">U+200E..U+200F</td>
      <td width="362">Implicit Directional Marks (LRM and RLM)</td>
      <td width="280">LRM and RLM are allowed</td>
    </tr>
    <tr>
      <td width="198">U+2011</td>
      <td width="362">Non-breaking Hyphen</td>
      <td width="280">Line break control</td>
    </tr>
    <tr>
      <td width="198">U+202F</td>
      <td width="362">Narrow No-break Space</td>
      <td width="280">Line break control/Mongolian</td>
    </tr>
    <tr>
      <td width="198">U+2044</td>
      <td width="362">Fraction Slash</td>
      <td width="280">Or use markup (MathML)</td>
    </tr>
    <tr>
      <td width="198">U+2060</td>
      <td width="362">Word Joiner</td>
      <td width="280">Use for that purpose instead of U+FEFF ZWNBSP</td>
    </tr>
    <tr>
      <td width="198">U+2061..U+2064</td>
      <td width="362">Invisible Mathematical Operators</td>
      <td width="280">Mathematical use</td>
    </tr>
    <tr>
      <td width="198">U+2FF0..U+2FFB</td>
      <td width="362">Ideographic Character Description</td>
      <td width="280">Graphic characters (not controls)</td>
    </tr>
    <tr>
      <td width="198">U+303E</td>
      <td width="362">Ideographic Variation Indicator</td>
      <td width="280">Graphic character (not a control)</td>
    </tr>
    <tr>
      <td width="198">U+FF80</td>
      <td width="362">Halfwidth Hangul Filler</td>
      <td width="280">Filler, not generally required</td>
    </tr>
    <tr>
      <td width="198">FE00..FE0F</td>
      <td width="362">Variation Selectors</td>
      <td width="280">Modify graphic characters</td>
    </tr>
    <tr>
      <td width="198">E0100..E01DF</td>
      <td width="362">Variation Selectors</td>
      <td width="280">Modify graphic characters</td>
    </tr>
  </tbody>
</table>

<p>The following subsections briefly discuss some of the characters from the
above list, particularly those that affect more than their immediately
adjacent neighbors. Please see the Unicode Standard [<a
href="#Unicode">Unicode</a>] for full details.</p>

<h3><a name="Subtending">4.1 Subtending Marks</a></h3>

<p>Subtending marks are needed to represent a common feature in the Arabic
and Syriac scripts where a mark can be placed below a range of characters,
for example below a sequence of digits, to indicate a year. The Syriac
abbreviation mark is placed above a series of characters, making it
technically a supertending mark, and the <span
style="font-variant: small-caps;">ARABIC END OF AYAH</span> is an enclosing
mark. In the character stream, a subtending mark precedes the affected
characters. The end of affected range of characters is defined implicitly,
usually by the first non-alphanumeric character. </p>

<p align="left">Unlike subtending marks, the scope of combining enclosing
marks, such as <span
style="text-transform: uppercase; font-variant: small-caps;">combining
enclosing circle,</span> is limited to the preceding default grapheme
cluster. For details on grapheme clusters see Unicode Standard Annex #29:
"Text Boundaries"<i>,</i> [<a href="#UAX29">UAX 29</a>] .</p>

<p align="left">There is currently no existing markup that can represent the
scoping and layout functions defined by these characters, so they cannot be
substituted. It is unresolved to what degree intervening markup affects the
scope of these marks.</p>

<h3 align="left"><a name="Fraction">4.2 Fraction Slash</a></h3>

<p align="left">The fraction slash is used between sequences of decimal
digits to form fractions. Whether the resulting fraction has a horizontal or
diagonal fraction line is unspecified. The fallback is to leave the digits
unchanged and display a regular slash. In order to separate a digit from a
following fraction, as in 1¾, the use of <span
style="font-variant: small-caps;">U+2009 THIN SPACE</span> is recommended.</p>

<p align="left">For better control of fractions the use of [<a
href="#MathML">MathML</a>] is suggested where appropriate.</p>

<h3><a name="Variation">4.3 Variation Selectors</a></h3>

<p>A variation selector is intended to cause a specific variant form (or
range of variant forms) when applied to a base character. For a variation
selector to have an effect it must immediately follow its base character.
Only pre-determined combinations of selected base characters and specific
variation selectors have a defined effect. All other combinations are
ill-formed and are to be ignored. The list of standardized combinations is
documented in the Unicode Character Database, see [<a
href="#Variants">Variants</a>]. In addition to the 256 generic variation
selectors, there are 3 Mongolian <i>free variation selectors</i>. They
function in all other ways like variation selectors, except they only apply
to base characters from the Mongolian script. Since Mongolian, like Arabic,
has positional character shapes, the variations are limited to particular
shaping contexts.</p>

<h3><a name="Ideographic">4.4 Ideographic Description Characters</a></h3>

<p>Ideographic Description Characters are included in the Unicode Standard as
a means to indicate the composition of ideographs from a combination of
pieces (terms), where each piece or term is either a Unicode character or
composed. Ordinarily the result would be a human readable description of a
character, perhaps one for which a font is not available. However, at least
some vendors are interested in automatic conversion of these sequences into
single ideographs.</p>

<h3><a name="Invisible">4.5 Invisible Mathematical Operators</a></h3>

<p>These characters are needed to convey the intended meaning of a
mathematical expression to an automated parser whenever two elements are
simply written next to each other. See Unicode Technical Report #25: "Unicode
Support for Mathematics" [<a href="#UTR25">UTR25</a>] for more details.</p>

<h3><a name="LineBreak">4.6 Line Break Controls</a></h3>

<p>Most of these characters prevent line breaks adjacent to them, but ZWSP
and SHY provide invisible line break opportunities. The detailed function of
these characters is described in Unicode Standard Annex #14: "Line Breaking
Properties" [<a href="#UAX14">UAX14</a>]. While high-end applications may be
able to deduce line breaking opportunities automatically solely with the help
of very generic markup or styling properties, the use of these characters
currently provides the most reliable and straight-forward way to control line
breaking and hyphenation. Note that [<a href="#html4.01">HTML4.01</a>] uses
U+00A0 NO-BREAK SPACE also as a "hard space" (i.e. a space with a fixed
width), something that is not part of its character semantics in [<a
href="#Unicode">Unicode</a>].</p>

<p>U+2011 NON-BREAKING HYPHEN (NBHY) is used to encode a hyphen that does not
provide a line break opportunity. In several languages, the sequence &lt;SHY,
NBHY&gt; may be used to handle special line breaking behavior for explicit
hyphens, see  [<a href="#UAX14">UAX14</a>].</p>

<h3><a name="Fillers">4.7 Hangul Fillers</a></h3>

<p>These should not be needed except for texts that need to have a fixed
number of jamos per Korean syllable block. See the description of Korean
Syllable Blocks in [<a href="#Unicode">Unicode</a>].</p>

<h2><a name="Compatibility">5. Characters with Compatibility Mappings</a></h2>

<p>The Unicode Standard provides compatibility mappings for a number of
characters. Compatibility mappings indicate a relationship to another
character, but the exact nature of the relationship varies. In some cases the
relationship means "is based on" in some other cases it denotes a property.
When plain text is marked up, it may make sense to map some of these
characters to a combination of their compatibility equivalents <em
style="font-style: normal;">and</em> suitable markup. It is important to
understand the nature of the distinctions between characters and their
compatibility equivalents and the context in which these distinctions matter.
It is never advisable to apply compatibility mappings indiscriminately. This
section provides guidance on when and how to apply compatibility mappings in
the case of importing text from non-XML (non-marked-up) sources. The section
is organized by the "compatibility tag" associated with each compatibility
mapping.</p>

<h3><a name="Overview">5.1 Overview</a></h3>

<p>The following table gives an overview of the various compatibility
characters, organized by "compatibility tag". The first column, <i>Tag
value,</i> contains the value of the "compatibility tag" from the Unicode
Character Database [<a href="#UnicodeData">UnicodeData</a>]. Although these
tags use "&lt;" and "&gt;", they do not appear as such in markup and should
not be confused with XML tags. <em>Code range</em> indicates a further break
down by code points. <i>Action</i> summarizes the recommended action to be
taken whenever markup is first applied to non-XML text. Each entry indicates
whether the characters can be substituted using the compatibility equivalent
according to Normalization Form KC of [<a href="#UAX15">UAX 15</a>], can be
replaced by equivalent markup where available, or should be retained. For
some cases, instead of or in addition to markup, style information [<a
href="#CSS">CSS</a>] is needed. <i>Description and usage</i> provides
additional information. Sections <a href="#List">5.3</a> through <a
href="#Superscripts">5.6</a> provide additional information for some of these
sets of compatibility characters including detailed recommended actions.</p>

<p align="center"><b>Table 5.1 Characters with compatibility mappings</b></p>

<table border="1" cellpadding="2" cellspacing="0" width="95%">
  <tbody>
    <tr>
      <th align="left" bgcolor="#ccffcc" width="80">Tag value</th>
      <th align="left" bgcolor="#ccffcc" width="97">Code range</th>
      <th align="left" bgcolor="#ccffcc" width="83">Action</th>
      <th align="left" bgcolor="#ccffcc">Description and usage</th>
    </tr>
    <tr>
      <td valign="top" width="80">&lt;circled&gt;</td>
      <td valign="top" width="97">all</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">Circled letters and digits used for list
        item markers, and in running text</td>
    </tr>
    <tr>
      <td rowspan="12" valign="top" width="80">&lt;compat&gt;</td>
      <td valign="top" width="97">2002..200A</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">Fixed width spaces</td>
    </tr>
    <tr>
      <td valign="top" width="97">2100..2101</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">Variant letter forms that are used as
        symbols</td>
    </tr>
    <tr>
      <td valign="top" width="97">2105..2106</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">Variant letter forms that are used as
        symbols</td>
    </tr>
    <tr>
      <td valign="top" width="97">2121, 213B</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">For use as single code point in vertical
        layout</td>
    </tr>
    <tr>
      <td valign="top" width="97">2160..217F</td>
      <td valign="top" width="83">retain, or use list item marker style, or
        normalize</td>
      <td valign="top" width="572">For use as single code point in vertical
        layout, or as list item marker</td>
    </tr>
    <tr>
      <td valign="top" width="97">2474..249B</td>
      <td valign="top" width="83">retain, or use list item marker style, or
        normalize</td>
      <td valign="top" width="572">Parenthesized or dotted number used as
        list item marker</td>
    </tr>
    <tr>
      <td valign="top" width="97">249C..24B5</td>
      <td valign="top" width="83">retain, or use list item marker style, or
        normalize</td>
      <td valign="top" width="572">Parenthesized letters used as list item
        markers</td>
    </tr>
    <tr>
      <td valign="top" width="97">3131..318E</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">Compatibility Hangul Jamo. These do not
        conjoin</td>
    </tr>
    <tr>
      <td valign="top" width="97">3200..3229</td>
      <td valign="top" width="83">retain, or use list item marker style, or
        normalize</td>
      <td valign="top" width="572">Parenthesized characters used as list item
        markers</td>
    </tr>
    <tr>
      <td height="26" valign="top" width="97">322A..3243</td>
      <td height="26" valign="top" width="83">retain</td>
      <td height="26" valign="top" width="572">Parenthesized characters used
        as symbols in vertical layout</td>
    </tr>
    <tr>
      <td valign="top" width="97">32C0..32CB</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">String used as single code point in
        vertical layout</td>
    </tr>
    <tr>
      <td valign="top">all other</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">Maintain, semantic distinctions apply</td>
    </tr>
    <tr>
      <td valign="top" width="80">&lt;final&gt;</td>
      <td valign="top" width="97">all</td>
      <td valign="top" width="83">normalize</td>
      <td valign="top" width="572">Arabic Presentation forms</td>
    </tr>
    <tr>
      <td valign="top" width="80">&lt;font&gt;</td>
      <td valign="top" width="97">all</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">Variant letter forms that are used as
        symbols</td>
    </tr>
    <tr>
      <td valign="top" width="80">&lt;fraction&gt;</td>
      <td valign="top" width="97">all</td>
      <td valign="top" width="83">normalize</td>
      <td valign="top" width="572">As long as fraction slash is
      supported!</td>
    </tr>
    <tr>
      <td valign="top" width="80">&lt;initial&gt;</td>
      <td valign="top" width="97">all</td>
      <td valign="top" width="83">normalize</td>
      <td valign="top" width="572">Arabic Presentation forms</td>
    </tr>
    <tr>
      <td valign="top" width="80">&lt;isolated&gt;</td>
      <td valign="top" width="97">all</td>
      <td valign="top" width="83">normalize</td>
      <td valign="top" width="572">Arabic Presentation forms</td>
    </tr>
    <tr>
      <td valign="top" width="80">&lt;medial&gt;</td>
      <td valign="top" width="97">all</td>
      <td valign="top" width="83">normalize</td>
      <td valign="top" width="572">Arabic Presentation forms</td>
    </tr>
    <tr>
      <td valign="top" width="80">&lt;narrow&gt;</td>
      <td valign="top" width="97">all</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">Half-width characters</td>
    </tr>
    <tr>
      <td valign="top" width="80">&lt;noBreak&gt;</td>
      <td valign="top" width="97">all</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">The compatibility mapping merely indicates
        the equivalent breaking character. The noBreak distinction must be
        preserved</td>
    </tr>
    <tr>
      <td valign="top" width="80">&lt;small&gt;</td>
      <td valign="top" width="97">all</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">Precise usage unknown. Maintain, but do
        not generate</td>
    </tr>
    <tr>
      <td rowspan="4" valign="top" width="80">&lt;square&gt;</td>
      <td valign="top" width="97">3300..3357</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">Single display cell cluster containing
        multiple lines of kana for vertical layout</td>
    </tr>
    <tr>
      <td valign="top" width="97">3358..337D</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">For use as single code point in vertical
        layout</td>
    </tr>
    <tr>
      <td valign="top" width="97">33E0..33FE</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">For use as single code point in vertical
        layout</td>
    </tr>
    <tr>
      <td valign="top" width="97">all other</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">Variant letter form used as symbol in
        vertical layout</td>
    </tr>
    <tr>
      <td rowspan="2" valign="top" width="80">&lt;sub&gt;</td>
      <td valign="top" width="97">2080..208E</td>
      <td valign="top" width="83">retain, or use markup</td>
      <td valign="top" width="572">Subscript digits 0-9, as well as minus,
        plus, equal and parens</td>
    </tr>
    <tr>
      <td valign="top" width="97">all other</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">Subscript characters, usually used as
        modifier letters in phonetic notation</td>
    </tr>
    <tr>
      <td rowspan="5" valign="top" width="80">&lt;super&gt;</td>
      <td valign="top" width="97">00B2..00B3</td>
      <td rowspan="4" valign="top" width="83">retain, or use  markup</td>
      <td rowspan="4" valign="top" width="572">Superscript digits 0-9, as
        well as minus, plus, equal and parens</td>
    </tr>
    <tr>
      <td valign="top" width="97">00B9</td>
    </tr>
    <tr>
      <td valign="top" width="97">2070</td>
    </tr>
    <tr>
      <td valign="top" width="97">2074..207E</td>
    </tr>
    <tr>
      <td valign="top" width="97">all other</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">Superscript characters, usually used as
        modifier letters in phonetic notation</td>
    </tr>
    <tr>
      <td valign="top" width="80">&lt;vertical&gt;</td>
      <td valign="top" width="97">all</td>
      <td valign="top" width="83">normalize</td>
      <td valign="top" width="572">East Asian Presentation forms</td>
    </tr>
    <tr>
      <td valign="top" width="80">&lt;wide&gt;</td>
      <td valign="top" width="97">all</td>
      <td valign="top" width="83">retain</td>
      <td valign="top" width="572">Full-width characters</td>
    </tr>
  </tbody>
</table>

<blockquote>
  <p><b>Note: </b>Some symbols used in vertical layout exist as single code
  points in legacy systems, but can also be composed on the fly by more
  advanced display engines. There are currently no style properties that
  could be used to express squared Kana clusters (<i>kumimoji</i>) or
  horizontal in vertical writing mode (<i>tate-chu-yoko</i>).</p>
</blockquote>

<h3><a name="Generating">5.2 Generating New Text</a></h3>

<p>Presentation forms and characters for which adequate representation exists
as marked up text should never be entered into new data. Many of the
characters with &lt;font&gt; tag are however suitable for new data, as long
as they are used in the manner they are intended, that is as symbols, with
definite semantic differentiation between the different forms. The largest
set of these characters exists to carry essential semantic distinctions in
mathematical notation, where the any loss of markup during text export would
compromise the meaning of the text. Most of the characters with &lt;super&gt;
and &lt;sub&gt; tag have been encoded for use in phonetic or phonemic
transcriptions, where they act as ordinary letters and the use of style
markup is therefore deemed inappropriate. However, it is inappropriate to use
any of these classes of characters to create the appearance of styled text
runs.</p>

<p>For example to write <i>hello,</i> one should use &lt;i&gt;hello&lt;/i&gt;
and not the sequence of Unicode characters U+210E, U+212F, U+2113, U+2113,
U+2134. Conversely, to indicate <i>Planck's constant</i> one should use
U+210E and not &lt;i&gt;h&lt;/i&gt;.</p>

<p>When style is applied across entire words, sentences or paragraphs, the
use of markup is preferred. When style is applied to individual letters,
especially to letters inside a word, giving them a particular interpretation,
the use of character codes is preferred. See also <a
href="#Superscripts">Section 5.6</a>.</p>

<h3><a name="List">5.3 List Item Marker Characters</a></h3>

<p><em>Short description</em>: Characters with a &lt;circled&gt; tag or
characters with &lt;compat&gt; tag and compatibility mapping to a
parenthesized string.</p>

<p><em>Reason for inclusion</em>: They are most frequently used for marking
enumerated list items, but the characters with a &lt;circled&gt; tag often
occur as dingbats or footnote markers in tables. The same characters are used
in regular text when citing an item from a corresponding ordered list.</p>

<p><em>Problems when used in markup</em>: These characters do not cause undue
interaction with markup</p>

<p><em>Problems with other uses</em>: None</p>

<p><em>Replacement markup</em>: (in text use) these characters are often used
in running text; sometimes, but not exclusively, in situations where the text
is to be associated with an item from a nearby numbered list. Replacement
markup may not be available, and the support for such markup is much more
limited today than was anticipated when this document was first written.</p>

<p>(list item style) When generating marked up text these characters occur
only internal to the user agent when list item styles are rendered. When
marking up plain text data they could be converted to suitable list item
styles, if such use can be properly inferred. The default recommendation is
to retain the original character.</p>

<p>(characters with compatibility mappings of the form "(<em>n</em>)" or
"<em>n</em>." or roman numerals) Unlike circled characters, these could be
rendered by sequences of regular characters. Using a list item marker style
would in theory allow the support of longer lists (the Unicode characters are
limited to the set  (1) to (20) and "1." to "20."). Using regular character
sequences would also allow the use of fonts that match the text of the
list.</p>

<p><em>What to do if detected</em>: No action needs to be taken by browsers.
When received in an editing context, substitution of a list item marker style
may be appropriate. However, the same characters are very often used as
dingbat-like symbols in tables, or may appear in general text, whether or not
referring to an item from a list. Therefore the user must have the choice of
whether to replace the character.</p>

<h3><a name="Fractions">5.4 Fractions</a></h3>

<p><em>Short description</em>: Single character fractions such as ½ or ¼.</p>

<p><em>Reason for inclusion</em>: Subsets of these occur in practically all
legacy character sets.</p>

<p><em>Problems when used in markup</em>: The character repertoire is limited
to a few common fractions. When used with more general methods of generating
fractions such as MathML [<a href="#MathML">MathML</a>] the usual problem of
dual representation arises.</p>

<p><em>Problems with other uses</em>: Other than normalization issues, these
characters present no undue problems in plain text. Where fraction slash is
supported, these can be expressed by substituting their compatibility
mappings. </p>

<p><em>Replacement markup</em>: MathML can represent fractions unambiguously.
When using fraction slash, care must be taken such that values like 3½ do not
turn into 31/2 (=15.5).</p>

<p><em>What to do if detected</em>: No action needs to be taken by browsers
or editors, except when converting plain text to MathML.</p>

<h3><a name="Squared">5.5 Squared or Horizontal</a></h3>

<p><em>Short description</em>: Characters that are symbols composed of groups
of typically kana or Latin letters, digits plus slash for use in a single
display cell in vertical display of text. </p>

<p><em>Reason for inclusion</em>: Many existing character sets contain these
as precomposed characters since for simple implementations this is the only
way to support the common use of providing metric units and other
abbreviations in a single character cell for vertical text layout. </p>

<p><em>Problems when used in markup</em>: Proposed markup, including CSS
styling, would be able express an unbounded set of these abbreviations,
obviating the need of cataloguing these in the character encoding standard
and making them more directly accessible to text based processing, for
example searching.</p>

<p><em>Problems with other uses</em>: The repertoire of these legacy
characters is limited; many more combinations are in actual use than are
accounted for in character sets. Pre-composed symbols do not make their text
content available to search engines. They also require re-encoding for text
laid out horizontally.</p>

<p><em>Replacement markup</em>: None available.</p>

<p><em>What to do if detected</em>: No action required. (Subject to change
pending the outcome of current proposals.)</p>

<h3><a name="Superscripts">5.6 Superscripts and Subscripts</a></h3>

<p><em>Short description</em>: Mainly super and subscript digits, but also
signs, parentheses and a large number of letters.</p>

<p><em>Reason for inclusion</em>:  Super and subscripted letters and digits
are quite common in some forms of phonetic or phonemic transcriptions, where
the use of styles is both awkward and prone to data integrity issues when
exported to plain text. For super or subscripted letters in phonetic
transcription in particular, a change from superscript of subscript to
regular style would alter the meaning. Note that such use in transcription is
not limited to letters: superscripted small digits are often used to indicate
tone. When used for these purposes, these characters should be retained and
markup should <i>not</i> be used. </p>

<p>A few super and subscript characters, primarily the digits, also occur in
many legacy character sets, including Latin-1. Their use in pure plain text
is common for databases, e.g. including metric units for part descriptions 
(viz. cm<sup>2</sup>) or for (usually simplified) formulae as occur in titles
of scientific publications. </p>

<p>When used in mathematical context (MathML) it is recommended to
consistently use style markup for superscripts and subscripts. This is
because mathematical layout allows not just individual symbols, but entire
expressions to be superscripted or subscripted in a regular, nested
manner.</p>

<p><em>Problems when used in markup</em>: Mixing direct use of these
characters with the use of style markup provides multiple representations of
the same text, leading to potentially different treatment by search and
display engines.</p>

<p>However, when super and sub-scripts are to reflect semantic distinctions,
it is easier to work with these meanings encoded in text rather than markup,
for example, in phonetic or phonemic transcription. Otherwise, they would
require markup in the middle of words, and  they may also be inadvertently
changed to normal style text, when exporting to plain text. This applies to
the majority of super and subscripted characters in Unicode.  On the other
hand, some user agent may support certain superscripted or subscripted
characters only when used as marked up text for example, because of lack of
font support for them.</p>

<p><em>Problems with other uses</em>: none</p>

<p><em>Replacement markup</em>: Unless used as letters, &lt;xhtml:sup&gt; and
&lt;xhtml:sub&gt; or &lt;mathml:msup&gt; and &lt;mathml:msub&gt; may be
used.</p>

<p><em>What to do if detected</em>: Both representations (with or without
style markup) should be equivalent for search purposes. Input methods for
mathematical texts might enforce the use of styles.  If superscript
characters are encountered during display of mathematical formulae, it is
recommended that they be displayed in a manner indistinguishable from that
achieved by using regular characters with corresponding style markup.. </p>

<h3><a name="Other">5.7 Other Characters Marked &lt;compat&gt;</a></h3>

<p><em>Short description</em>: The &lt;compat&gt; label was given to a set of
compatibility characters whose further classification was not settled at the
time the standard was created. The largest components are list item marker
characters.</p>

<p><em>Reason for inclusion</em>: These characters occur in many legacy
character sets.</p>

<p><em>Problems when used in markup</em>: none. There usually is no
equivalent markup.</p>

<p><em>Problems with other uses</em>: none</p>

<p><em>Replacement markup</em>: none.</p>

<p><em>What to do if detected</em>: No action required.</p>

<h2><a name="Noncharacters">6.  Noncharacters</a></h2>

<p>The Unicode Standard defines 66 non-character code points, or
<i>noncharacters</i>. These are the last two positions on each of the 17
planes, in other words, all characters whose code points end in ...FFFE or
...FFFF, as well as the 32 code points from U+FDD0 to U+FDEF. Applications
are free to use any of these code points internally but should never attempt
to interchange them. In effect, noncharacters can be thought of as
application-internal private-use code points.</p>

<h2>7. <a name="White">White Space</a></h2>

<p>This section presents common issues with white space characters in markup
languages, mostly based on their difference in function as part of the
structure of the markup source (syntactic white space) on the one hand and as
part of the document content on the other hand.</p>

<p>The set of characters in the Unicode standard that have the property
"White_Space" (see 'White Space' in the [<a href="#UnicodeData">UCD</a>]) is
quite large. It includes white space characters with different line breaking
properties, different ligating properties, and different widths. It is
appropriate to use these characters as part of markup content for their very
specific purpose. It  is preferable to place them in the markup source so
that they are surrounded by ordinary characters rather than line breaks for
example.  The set of white space characters defined by typical markup
language specifications is a subset of the characters that are considered
white space by [<a href="#Unicode">Unicode</a>] .</p>

<p>Each markup language defines the set of characters that it accepts as part
of the markup syntax, this is usually a very small set. The XML [<a
href="#xml10">XML1.0</a>] and [<a href="#xml11">XML1.1</a>] specifications
define white space as a combination of one or more of the following
characters: U+0020 SPACE, carriage return (U+000D), line feed (U+000A), or
tab (U+0009). [<a href="#html4.01">HTML4.01</a>] adds to these the form feed
character (U+000C), but that character cannot be used in any XHTML
version.</p>

<p>In addition, markup languages may use conventions for converting or
removing some kinds of white space. XML processors replace some combinations
of end-of-line characters by a single line feed character. [<a
href="#xml10">XML1.0</a>] normalizes any two character sequences of (U+000D
U+000A) or any U+000D not followed by U+000A to a single U+000A. [<a
href="#xml11">XML1.1</a>] also normalizes NEL (U+0085) and U+2028 LINE
SEPARATOR, but U+2029 PARAGRAPH SEPARATOR is not treated that way. Additional
processing of white space before it is handled to an application also occurs
for attribute values: line breaks are replaced by spaces, leading and
trailing spaces are removed, and subsequent spaces are replaced by a single
space.</p>

<p>In XML, white space is purely syntactic inside tags, for example, to
separate the element name from attributes, and between elements in element
content models (as they are typical for data-oriented applications). White
space in element content models is used to lay out the markup source, using
line breaks and indentation, to improve readability. The same use of white
space is possible in many cases in mixed content (typical for text-oriented
applications).</p>

<p>Because XML is used for a very wide range of applications, after the
processing steps mentioned above it passes all white space to the
application. Some XML applications such as [<a href="#XHTML">XHTML</a>] may
have their own white space processing rules when processing white space
characters. Also, applications and software transforming XML (e.g. [<a
href="#XSLT">XSLT</a>]) have specific conventions of how they handle white
space, and specific ways of how to control this behavior. To appropriately
use white space characters, readers are advised to examine all involved
standards and software.</p>

<p>If the characters U+2028 and U+2029 appear in text, they may be treated as
zero-width characters without semantic meaning (see Section 3.2).</p>

<h3 id="converting-nl-to-ws">7.1 Converting Newline Functions to White
Space</h3>

<p>White space that is not purely syntactic, including control codes that
define a newline function (see <i>Section 5.8, Newline Guidelines,</i> in [<a
href="#Unicode">Unicode</a>]), can be handled in three main ways.</p>
<ol>
  <li>For data-oriented applications, the textual content of elements is
    treated according to the needs of the data type in question. In many
    cases, processing by the application includes aspects similar to those of
    the processing of attribute values by the XML parser itself. For some
    types of data, in particular small data items, some applications may also
    simply prohibit the use of white space.</li>
  <li>For running text in text-oriented applications, reflowing is used, i.e.
    the line breaks in the markup source are removed and the text is reflown
    into lines whose length is determined by the output medium and styling
    properties. In the context of Unicode, this reflowing process requires
    care; it is described in more detail below.</li>
  <li>For preformatted text, such as program source code, line breaks must be
    preserved. Text-oriented applications usually contain special markup for
    preformatted text, e.g. &lt;xhtml:pre&gt;. XML itself defines an
    xml:space attribute that applications may use for a similar purpose.</li>
</ol>

<p>When reflowing, line breaks and adjacent white space can be treated as
space, removed, collapsed with adjacent control characters of the same type,
or treated as zero-width space. Which choice is appropriate depends on the
script of the surrounding text. The assumption is that line breaks and
adjacent white space (in particular following white space, used for
indentation) was added to make the markup source more readable, in particular
to make each line fit on a line of a plain text editor. For scripts that use
spaces, line breaks will have been inserted where there originally was a
space; treating them as spaces therefore preserves the intended separation
between words. For scripts which do not use spaces, such as Ideographic
scripts or certain South East Asian scripts, such as Thai, line feeds should
be removed, or replaced by U+200B zero width space. The choice of treatment
can depend on the script value of the characters preceding and following the
line feed character, assuming these characters belong to the same run of
text.</p>

<blockquote>
  <p><b>Note:</b> The Unicode Standard [<a href="#Unicode">Unicode</a>]
  specifies that the zero width space is considered a valid line-break point
  and that if two characters with a zero width space in between are placed on
  the same line they are placed with no space between them; and that if they
  are placed on two lines no additional glyph area is created at the
  line-break.</p>
</blockquote>

<p>The details of reflowing are the responsibility of the various markup
applications (e.g. [<a href="#XHTML">XHTML</a>]). However, there is a
tendency to move this functionality from markup applications to styling, so
that it can be shared across applications.</p>

<p>Authors should be aware of the fact that the above script-specific
treatment of line breaks when reflowing text is not yet available in all
implementations (e.g. browsers). For scripts that do not use white space to
separate words, it may therefore still be advisable to not split long
lines.</p>

<p>Editing tools should try to support the user in the appropriate use of
white space. Some white space characters cannot easily be entered via a
keyboard, but some others, e.g. U+3000 Ideographic Space, can. Editing tools
should try to make sure that only line breaks and white space that is
accepted as syntactic white space by the relevant markup language are used to
improve markup source readability.</p>

<p>While the styling possibilities provided by CSS and its implementations
have not reached the level of professional typesetting systems, they offer a
wide range of ways to control layout and spacing of text. A very simple
example is text centering, which would have been done by inserting an
appropriate number of spaces on each line in pure plain text.</p>

<h2><a name="Versioning">8. Versioning</a></h2>

<p>This report will be updated by the Unicode Technical Committee in
cooperation with the W3C Internationalization Activity whenever the tables of
characters in this document need to be updated as a result of the addition of
characters to the Unicode Standard, as a result of a revised determination of
the suitability of a given character for use with markup, or when additional
background information or recommendations become available.</p>

<p>Each report carries a revision number, which may be used to refer to a
specific version of the report. Older versions of the report will remain
available. Each version of this report specifies the underlying version of
the Unicode Standard.</p>

<p>For more information on the Unicode Standard and its versions, see:</p>
<ul class="unicode">
  <li><a href="http://www.unicode.org/unicode/standard/versions/">Versions of
    the Unicode Standard</a> [<a
  href="#UnicodeVersions">UnicodeVersions</a>]</li>
  <li><a href="http://www.unicode.org/ucd/">About the Unicode Character
    Database</a> [<a href="#UCD">UCD</a>]</li>
  <li><a href="http://www.unicode.org/Public/UNIDATA/UCD.html">Unicode
    Character Database</a> [<a href="#UnicodeData">UnicodeData</a>]</li>
</ul>

<h2><a name="Conformance">9. Conformance</a></h2>

<p>In the context of the Unicode Standard, the material in this technical
report is <em>informative. </em>However, other documents, particularly markup
language specifications, may specify conformance including normative
references to this document. Such references may have to be updated as a
result of future updates to this report as discussed in Section 8<i>, <a
href="#Versioning">Versioning</a>.</i></p>

<h2><a name="References">10. References</a></h2>
<dl>
  <dt><a name="Charmod">[Charmod]</a></dt>
    <dd></dd>
    <dd>Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf, Tex
      Texin, Eds., <cite>Character Model for the World Wide Web 1.0:
      Fundamentals</cite>, W3C Recommendation, 15-February-2005, &lt;<a
      href="http://www.w3.org/TR/2005/REC-charmod-20050215/">http://www.w3.org/TR/2005/REC-charmod-20050215/</a>&gt;.</dd>
  <dt>[<a name="Charmodnorm">Charmodnorm</a>]</dt>
    <dd>François Yergeau, Martin J. Dürst, Richard Ishida, Addison Phillips,
      Misha Wolf, and Tex Texin, Eds., <i>Character Model for the World Wide
      Web 1.0: Normalization,</i> W3C Working Draft, 27-October-2005, &lt;<a
      href="http://www.w3.org/TR/2005/WD-charmod-norm-20051027/">http://www.w3.org/TR/2005/WD-charmod-norm-20051027/</a>&gt;.</dd>
  <dt><a name="CharReq">[CharReq]</a></dt>
    <dd>Martin J. Dürst, <cite>Requirements for String Identity and Character
      Indexing Definitions for the WWW</cite>, W3C Working Draft,
      10-July-1998, &lt;<a
      href="http://www.w3.org/TR/WD-charreq">http://www.w3.org/TR/WD-charreq</a>&gt;.</dd>
  <dt>[<a name="CSS">CSS</a>]</dt>
    <dd>For information on cascading style sheet specifications, see &lt;<a
      href="http://www.w3.org/Style/CSS/">http://www.w3.org/Style/CSS/</a>&gt;.</dd>
  <dt>[<a name="Feedback">Feedback</a>]</dt>
    <dd>Reporting Errors and Requesting Information Online to the Unicode
      Consortium,<i>&lt;</i><a
      href="http://www.unicode.org/reporting.html">http://www.unicode.org/reporting.html</a>&gt;.</dd>
  <dt><a name="html4.01">[HTML4.01]</a></dt>
    <dd>Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds., <cite>HTML 4.01
      Specification</cite>, W3C Recommendation, 18-Dec-1997 (revised on
      24-Dec-1999), &lt;<a
      href="http://www.w3.org/TR/1999/REC-html401-19991224/">http://www.w3.org/TR/1999/REC-html401-19991224/</a>&gt;.</dd>
  <dt><a name="HTML4.0-8.2">[HTML 4.0 - 8.2]</a></dt>
    <dd>Section 8.2 of [HTML4.0] <i>Specifying the direction of text and
      tables: the dir attribute</i> &lt;<a
      href="http://www.w3.org/TR/1999/REC-html401-19991224/struct/dirlang.html#h-8.2">http://www.w3.org/TR/1999/REC-html401-19991224/struct/dirlang.html#h-8.2</a>&gt;.</dd>
  <dt><a name="MathML">[MathML]</a></dt>
    <dd>David Carlisle, Patrick Ion, Robert Miner, Nico Poppelier, Eds.,
      <i>Mathematical Mathematical Markup Language (MathML) Version 2.0
      (Second Edition)</i>, W3C Recommendation, 21-Oct-2003, &lt;<a
      href="http://www.w3.org/TR/2003/REC-MathML2-20031021/">http://www.w3.org/TR/2003/REC-MathML2-20031021/</a>&gt;.</dd>
  <dt><a name="Namespace">[Namespace]</a></dt>
    <dd>Tim Bray, Dave Hollander, Andrew Layman, Eds., <i>Namespaces in XML
      (Second Edition)</i>, W3C Recommendation, 16-Aug-2006, &lt;<a
      href="http://www.w3.org/TR/2006/REC-xml-names-20060816/">http://www.w3.org/TR/2006/REC-xml-names-20060816/</a>&gt;.</dd>
  <dt><a name="Ruby">[Ruby]</a></dt>
    <dd>Marcin Sawicki, Michel Suignard, Masayasu Ishikawa, Martin Dürst, Tex
      Texin, Eds., <i>Ruby Annotation</i>, W3C Recommendation, 31-May-2001,
      &lt;<a
      href="http://www.w3.org/TR/2001/REC-ruby-20010531/">http://www.w3.org/TR/2001/REC-ruby-20010531/</a>&gt;.</dd>
  <dt><a name="UTR9">[UAX 9]</a></dt>
    <dd>Mark Davis, <cite>Unicode Standard Annex #9, The Bidirectional
      Algorithm</cite>, &lt;<a
      href="http://www.unicode.org/reports/tr9/">http://www.unicode.org/reports/tr9/</a>&gt;.</dd>
  <dt>[<a name="UAX14">UAX14</a>]</dt>
    <dd>Asmus Freytag,<i>Unicode Standard Annex #14,</i> <i>Line Breaking
      Properties</i> <a
      href="http://www.unicode.org/reports/tr14/">http://www.unicode.org/reports/tr14/</a></dd>
  <dt><a name="UTR15">[UAX 15]</a><a name="UAX15"></a></dt>
    <dd>Mark Davis, Martin Dürst, <cite>Unicode Standard Annex #15, Unicode
      Normalization Forms</cite>, &lt;<a
      href="http://www.unicode.org/reports/tr15/">http://www.unicode.org/reports/tr15/</a>&gt;.</dd>
  <dt>[<a name="UAX29">UAX 29</a>]</dt>
    <dd>Mark Davis,<i>Unicode Standard Annex #29</i>, <i>Text Boundaries</i>.
      <a
      href="http://www.unicode.org/reports/tr29/">http://www.unicode.org/reports/tr29/</a></dd>
  <dt>[<a name="UCD">UCD</a>]</dt>
    <dd><cite>About the Unicode Character Database</cite>, &lt;<a
      href="http://www.unicode.org/ucd/">http://www.unicode.org/ucd/</a>&gt;.</dd>
  <dt><a name="Unicode">[Unicode]</a></dt>
    <dd>The Unicode Consortium.<i><a
      href="http://www.unicode.org/versions/Unicode5.0.0/">The Unicode
      Standard, Version 5.0</a></i> (Boston, MA, Addison-Wesley, 2007. ISBN
      0-321-48091-0). </dd>
  <dt><a name="Unicode32">[Unicode32]</a></dt>
    <dd><cite>Unicode Standard Annex #28 <a
      href="http://www.unicode.org/reports/tr28/">Unicode 3.2</a></cite>, The
      Unicode Consortium, 2002.</dd>
  <dt><a name="Unicode40">[Unicode40]</a></dt>
    <dd><cite><a
      href="http://www.unicode.org/unicode/standard/standard.html">The
      Unicode Standard</a>, <a
      href="http://www.unicode.org/unicode/standard/versions/Unicode3.0.html">Version
      4.0</a></cite>, <i>The Unicode Standard, Version 4.0, </i>(Reading,
      Massachusetts: Addison-Wesley Developers Press, 2003, ISBN
      0-321-18578-1) or online as &lt;<a
      href="http://www.unicode.org/versions/Unicode4.0.0/">http://www.unicode.org/versions/Unicode4.0.0/</a>&gt;.</dd>
  <dt>[<a name="Unicode50">Unicode50</a>]</dt>
    <dd>The Unicode Consortium.<i><a
      href="http://www.unicode.org/versions/Unicode5.0.0/">The Unicode
      Standard, Version 5.0</a></i> (Boston, MA, Addison-Wesley, 2007. ISBN
      0-321-48091-0) or online as &lt;<a
      href="http://www.unicode.org/versions/Unicode5.0.0/">http://www.unicode.org/versions/Unicode5.0.0/</a>&gt;</dd>
  <dt><a name="UnicodeData">[UnicodeData]</a></dt>
    <dd><cite>Unicode Character Database</cite>, &lt;<a
      href="http://www.unicode.org/Public/UNIDATA/UCD.html">http://www.unicode.org/Public/UNIDATA/UCD.html</a>&gt;.</dd>
  <dt><a name="UnicodeVersions">[UnicodeVersions]</a></dt>
    <dd><cite>Versions of the Unicode Standard</cite>, &lt;<a
      href="http://www.unicode.org/unicode/standard/versions/">http://www.unicode.org/unicode/standard/versions/</a>&gt;.</dd>
  <dt>[<a name="UTR25">UTR25</a>]</dt>
    <dd>Asmus Freytag, Barbara Beeton, Murray Sargent, <i>Unicode Technical
      Report #25, Unicode Support for Mathematics, &lt;<a
      href="http://www.unicode.org/reports/tr25/">http://www.unicode.org/reports/tr25/</a>&gt;</i></dd>
  <dt>[<a name="Variants">Variants</a>]</dt>
    <dd>Standardized Variants &lt;<a
      href="http://www.unicode.org/Public/UNIDATA/StandardizedVariants.html">http://www.unicode.org/Public/UNIDATA/StandardizedVariants.html</a>&gt;.</dd>
  <dt><a name="XHTML">[XHTML]</a></dt>
    <dd>Steven Pemberton, et al., Eds.,
      <cite>XHTML</cite><i><cite>&trade;</cite></i><cite>1.0: The Extensible
      HyperText Markup Language - A Reformulation of HTML 4.0 in XML
      1.0</cite>, W3C Recommendation, 01-Aug-2002, &lt;<a
      href="http://www.w3.org/TR/2002/REC-xhtml1-20020801/">http://www.w3.org/TR/2002/REC-xhtml1-20020801/</a>&gt;.</dd>
  <dt><a name="xml10">[XML 1.0]</a></dt>
    <dd>Tim Bray, Jean Paoli, Eve Maler, C. M. Sperberg-McQueen, François
      Yergeau, Eds., <i>Extensible Markup Language (XML) 1.0 (Fourth
      Edition)</i>, W3C Recommendation, 16-August-2006, &lt;<a
      href="http://www.w3.org/TR/2006/REC-xml-20060816/">http://www.w3.org/TR/2006/REC-xml-20060816/</a>&gt;.</dd>
  <dt>[<a name="XSLT">XLST</a>]</dt>
    <dd>Michael Kay, Ed., <i>XSL Transformations (XSLT) Version 2.0</i>, W3C
      Recommendation, 23-January-2007, &lt;<a
      href="http://www.w3.org/TR/2007/REC-xslt20-20070123/">http://www.w3.org/TR/2007/REC-xslt20-20070123/</a>&gt;</dd>
  <dt><a name="xml11">[XML 1.1]</a></dt>
    <dd>Jean Paoli, Eve Maler, Tim Bray, C. M. Sperberg-McQueen, François
      Yergeau, John Cowan, Eds., <i>Extensible Markup Language (XML) 1.1
      (Second Edition)</i>, W3C Recommendation 16-August-2006, &lt;<a
      href="http://www.w3.org/TR/2006/REC-xml11-20060816/">http://www.w3.org/TR/2006/REC-xml11-20060816/</a>&gt;.
    </dd>
  <dt>[<a name="XMLSchema">XML Schema</a>]</dt>
    <dd>Henry S. Thompson, David Beech, Murray Maloney, Noah Mendelsohn,
      Eds., <i>XML Schema Part 1: Structures Second Edition</i>, W3C
      Recommendation 28-October-2004, &lt;<a
      href="http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/">http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/</a>&gt;
      . </dd>
</dl>

<h2><a name="Acknowledgements">11. Acknowledgements</a></h2>

<p>Mark Davis and Hideki Hiura contributed to the early drafts. Yukka Korpela
and Felix Sasaki provided input to the current document.</p>

<h2><a name="ChangeHistory">12. Change History (last changes first)</a></h2>

<p>Changes from <a class="unicode"
href="http://www.unicode.org/reports/tr20/tr20-7.html">http://www.unicode.org/reports/tr20/tr20-7.html</a>
: Added entries for new characters in Unicode 5.0. Updated references to use
new chapter/section numbers in Unicode 5.0. Updated the discussion of
superscript and subscript characters, accounting for the differences between
their use in phonetic or phonemic transcription and mathematics. Added
Section 3.10 and 4.5, 4.6 and 4.7. Added a Section 7 on handling white space.
Updated references to W3C publications (AF). More work on white space
section; moved everything about BOM to one place (MJD)</p>

<p>Changes from <a class="unicode"
href="http://www.unicode.org/reports/tr20/tr20-6.html">http://www.unicode.org/reports/tr20/tr20-6.html</a>
: Added entries for new characters in Unicode 4.0. Separated out, and
extended, the discussion of format characters suitable for markup. This
resulted in a new section 2.6, moving section 3.2 to 4, and renumbering, as
well as new sections 4.1, 4.2, 4.3, 4.4. Added a discussion on noncharacters
in a new section 6. Updated reference from Unicode 3.1 and 3.2 to Unicode
4.0. Improved the layout an description of what is now table 5.1. Changed the
recommended action in 5.6 to none. Updated the Unicode status section.
Changed http://www.unicode.org/unicode/reports/ to <a
href="http://www.unicode.org/reports/">http://www.unicode.org/reports</a>
throughout to reflect the preferred style of URL (older style URLs continue
to be valid). Updated references to W3C publications. (AF/MJD)</p>

<p>Changes from <a class="unicode"
href="http://www.unicode.org/reports/tr20/tr20-5.html">http://www.unicode.org/reports/tr20/tr20-5.html</a>
: Updated reference from Unicode 3.0 to 3.1 and 3.2 where appropriate. Added
sections 3.6 and  3.9. Minor wording fixes in sections 2.3, 3.1, 3.2, 3.6,
3.10, 4.3, 4.5 and 5. (AF/MJD)</p>

<p>Changes from <a class="unicode"
href="http://www.unicode.org/reports/tr20/tr20-4.html">http://www.unicode.org/reports/tr20/tr20-4.html</a>
: Added a note to the introduction to limit the scope. Reorganized section 3
and clarified the language. Renamed some sections and tables. Updated the
document to prepare for publication as Unicode Technical Report and W3C Note
(AF/MJD). Minor editorial changes to the text, added section 4.7, fixed some
dates, plus a few typos. (AF)</p>

<p>Changes from <a class="unicode"
href="http://www.unicode.org/reports/tr20/tr20-3.html">http://www.unicode.org/reports/tr20/tr20-3.html</a>
: Minor editorial changes to the introduction, fixed some references, links,
and dates, plus a few typos. (AF/MJD)</p>

<p>Changes from <a class="unicode"
href="http://www.unicode.org/reports/tr20/tr20-2.html">http://www.unicode.org/reports/tr20/tr20-2.html</a>
: Added sections 2.1-2.6 (MJD), sections 3.1-3.5, and 3.8, as well as
sections 4.4-4.6 and 8 (AF). Edited text for publication as DRAFT Unicode
Technical Report. (AF)</p>

<p>Changes from <a class="unicode"
href="http://www.unicode.org/reports/tr20/tr20-1.html">http://www.unicode.org/reports/tr20/tr20-1.html</a>
: Completed references, linked TOC. Various wording changes. Added W3C WD
stylesheet, logo, copyright, status of this document. Streamlined authors'
section. (MJD) Added material on compatibility characters. (AF)</p>

<p>Changes from the initial draft: Fixed the header. Fixed the numbering.
Fixed the title. Put references to final version of data files based on
naming conventions. Minor wording changes. Added proposed language on
annotation characters to match example on FFFC. Posted for internal review by
UTC and W3C. (AF)</p>

<h2><a name="Copyright">13. Copyright</a></h2>

<p>Copyright © 1999-2007 Unicode<sup>®</sup>, Inc. and <a
href="http://www.w3.org/">W3C</a><sup>®</sup> (<a
href="http://www.csail.mit.edu/index.php"><acronym
title="Massachussetts Institute of Technology">MIT</acronym></a>, <a
href="http://www.ercim.org/"><acronym
title="European Research   Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
<a href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved.</p>

<p>This document is available under the <a
href="http://www.w3.org/Consortium/Legal/copyright-documents-19990405">W3C
Document License</a> or the <a
href="http://www.unicode.org/unicode/copyright.html">Unicode License</a>.
Documents available from the W3C have additional <a
href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Legal_Disclaimer">warranties,
liability</a>, and <a
href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#W3C_Trademarks">trademark</a>
policies associated with them. The <a
href="http://www.unicode.org/unicode/copyright.html">Unicode License</a>
specifies warranty/liability and trademark terms including:</p>

<blockquote>
  <p class="unicode">The Unicode Consortium makes no expressed or implied
  warranty of any kind, and assumes no liability for errors or omissions. No
  liability is assumed for incidental and consequential damages in connection
  with or arising out of the use of the information or programs contained or
  accompanying this technical report.</p>

  <p class="unicode">Unicode and the Unicode logo are trademarks of Unicode,
  Inc., and are registered in some jurisdictions.</p>
</blockquote>
</body>
</html>