index.html
101 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="ProgId" content="FrontPage.Editor.Document">
<style type="text/css">
.unicode { font-style: normal }
.unicode:link { color: #FF0000; background-color: #FFFFFF }
.unicode:visited { color: #808080; background-color: #FFFFFF }
.unicode:active { color: #0000FF; background-color: #FFFFFF }
em.unicode { font-style: normal }
</style>
<title>Unicode in XML and other Markup Languages</title>
<link rel="stylesheet" type="text/css"
href="http://www.w3.org/StyleSheets/TR/W3C-WG-NOTE.css">
</head>
<body>
<div class="head">
<p><a href="http://www.w3.org/"><img alt="W3C"
src="http://www.w3.org/Icons/w3c_home" align="middle" border="0" height="48"
width="72"></a> <a href="http://www.unicode.org/"><img alt="Unicode"
src="http://www.unicode.org/img/unilogo-72.gif" align="middle" border="0"
height="72" width="72"></a> </p>
<h1>Unicode in XML and other Markup Languages</h1>
<h2 class="unicode" id="utr20">Unicode Technical Report #20</h2>
<h2>W3C Working Group Note 16 May 2007</h2>
<dl>
<dt class="unicode">Revision (Unicode):</dt>
<dd>8</dd>
<dt>This version:</dt>
<dd class="unicode"><a
href="http://www.unicode.org/reports/tr20/tr20-8.html">http://www.unicode.org/reports/tr20/tr20-8.html</a></dd>
<dd><a
href="http://www.w3.org/TR/2007/NOTE-unicode-xml-20070516/">http://www.w3.org/TR/2007/NOTE-unicode-xml-20070516/</a></dd>
<dt>Latest version:</dt>
<dd class="unicode"><a
href="http://www.unicode.org/reports/tr20/">http://www.unicode.org/reports/tr20/</a></dd>
<dd><a
href="http://www.w3.org/TR/unicode-xml/">http://www.w3.org/TR/unicode-xml/</a></dd>
<dt>Previous version:</dt>
<dd class="unicode"><a
href="http://www.unicode.org/reports/tr20/tr20-7.html">http://www.unicode.org/reports/tr20/tr20-7.html</a></dd>
<dd><a
href="http://www.w3.org/TR/2003/NOTE-unicode-xml-20030613/">http://www.w3.org/TR/2003/NOTE-unicode-xml-20030613/</a></dd>
<dt>Date (Unicode):</dt>
<dd>2007-05-16</dd>
<dt>Authors:</dt>
<dd>Martin Dürst (<a
href="mailto:duerst@it.aoyama.ac.jp">duerst@it.aoyama.ac.jp</a>)</dd>
<dd>Asmus Freytag (<a
href="mailto:asmus@unicode.org">asmus@unicode.org</a>)</dd>
</dl>
<p class="copyright">Copyright © 2007 Unicode®, and <a
href="http://www.w3.org/"><acronym
title="World Wide Web Consortium">W3C</acronym></a><sup>®</sup> (<a
href="http://www.csail.mit.edu/"><acronym
title="Massachusetts Institute of Technology">MIT</acronym></a>, <a
href="http://www.ercim.org/"><acronym
title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
<a href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved. <a
href="#Copyright">Detailed copyright information</a> is available.</p>
<hr title="Separator from Header">
</div>
<h2><a name="Abstract" id="Abstract"></a>Abstract</h2>
<p>This document contains guidelines on the use of the Unicode Standard in
conjunction with markup languages such as XML.</p>
<h2><a name="CommonStatus">Status of This Document (common)</a></h2>
<!--PROPOSED UPDATE
<p><font color="#FF0000">This is a proposed update to a Technical Report
published jointly by the <a href="http://www.unicode.org/unicode/consortium/utc.html">Unicode
Technical Committee</a> and by the <a href="http://www.w3.org/International/Group/">W3C
Internationalization Working Group/Interest Group</a> (<a href="http://cgi.w3.org/MemberAccess/AccessRequest">W3C
Members only</a>) in the context of the <a href="http://www.w3.org/International/Activity">W3C
Internationalization Activity</a>. This is a draft document which may be
updated, replaced, or superseded by other documents at any time. This is not a
stable document; it is inappropriate to cite this document as other than a work
in progress. </font></p>
-->
<!-- APPROVED -->
<p>This is a Technical Report published jointly by the <a
href="http://www.unicode.org/unicode/consortium/utc.html">Unicode Technical
Committee</a> and by the <a href="http://www.w3.org/International/core/">W3C
Internationalization Core Working Group</a>, which is part of the <a
href="http://www.w3.org/International/Activity">W3C Internationalization
Activity</a>.</p>
<p>The base version of the Unicode Standard for this document is <a
href="#Unicode50">Version 5.0</a>. For more information about versions of the
Unicode Standard, see <a
href="http://www.unicode.org/unicode/standard/versions/">http://www.unicode.org/unicode/standard/versions/</a>.
Both the Unicode Standard and markup technologies are evolving. When
appropriate, a new version of this document may be published.</p>
Please mail corrigenda and other comments to the authors or use the <a
href="http://www.unicode.org/reporting.html">reporting form</a>.
<h2 class="unicode"><a name="UnicodeStatus">Status of This Document (Unicode
Consortium)</a></h2>
<div>
<!-- PROPOSED UPDATE <font color="#FF0000">This document is a proposed
update of a previously approved <b>Unicode Technical Report</b>. Publication
does not imply endorsement by the Unicode Consortium. </font>
-->
<!-- APPROVED -->
This document has been reviewed by Unicode members and other interested
parties, and has been approved by the Unicode Technical Committee as a
<b>Unicode Technical Report</b>. It is a stable document and may be used as
reference material or cited as a normative reference from another document. <!-- -->
</div>
<div>
<blockquote>
<p><b>A Unicode Technical Report (UTR) </b>contains informative material.
Conformance to the Unicode Standard does not imply conformance to any UTR.
Other specifications, however, are free to make normative references to a
UTR.</p>
</blockquote>
</div>
<div>
For a list of current Unicode Technical Reports see <a
href="http://www.unicode.org/reports/">http://www.unicode.org/reports</a>.
<h2><a name="W3CStatus">Status of This Document (W3C)</a></h2>
<p><em>This section describes the status of this document at the time of its
publication. Other documents may supersede this document. A list of current
W3C publications and the latest revision of this technical report can be
found in the <a href="http://www.w3.org/TR/">W3C technical reports index</a>
at http://www.w3.org/TR/.</em></p>
<!--PROPOSED UPDATE
<p><font color="#FF0000">This is a proposed update to a Note that has been
previously endorsed by the W3C Internationalization Working Group/Interest
Group, but has not been reviewed or endorsed by W3C Members.</font></p>
-->
<!--APPROVED -->
<p>This document contains guidelines on the use of the Unicode Standard in
conjunction with markup languages such as XML.</p>
<p>This <a href="http://www.w3.org/2005/10/Process-20051014/tr.html#q75">W3C
Working Group Note</a> was produced by the <a
href="http://www.w3.org/International/core/" shape="rect">i18n Core Working
Group</a>, part of the <a
href="http://www.w3.org/International/">Internationalization Activity</a>.
Please send comments related to this document to <a
href="mailto:www-i18n-comments@w3.org?subject=%5Bunicode-xml%5D"
shape="rect">www-i18n-comments@w3.org</a> (<a
href="http://lists.w3.org/Archives/Public/www-i18n-comments/"
shape="rect">public archive</a>). Use "[unicode-xml]" in the subject line of
your email.</p>
<p>Publication as a <a
href="http://www.w3.org/2005/10/Process-20051014/tr.html#tr-end">Working
Group Note</a> does not imply endorsement by the W3C Membership. At the time
of publication, work on this document was considered complete and no further
revisions are anticipated. It is a stable document and may be used as
reference material or cited from another document. However, this document may
be updated, replaced, or made obsolete by other documents at any time.</p>
<p>This document was produced by a group operating under the <a
href="http://www.w3.org/Consortium/Patent-Policy-20040205/">5 February 2004
W3C Patent Policy</a>. W3C maintains a <a
href="http://www.w3.org/2004/01/pp-impl/32113/status">public list of any
patent disclosures</a> made in connection with the deliverables of the group;
that page also includes instructions for disclosing a patent. An individual
who has actual knowledge of a patent which the individual believes contains
<a
href="http://www.w3.org/Consortium/Patent-Policy-20040205/#def-essential">Essential
Claim(s)</a> must disclose the information in accordance with <a
href="http://www.w3.org/Consortium/Patent-Policy-20040205/#sec-Disclosure">section
6 of the W3C Patent Policy</a>.</p>
</div>
<!-- -->
<h2><a name="Contents">Table of Contents</a></h2>
<ol>
<li><a href="#Introduction">Introduction</a><br>
1.1 <a href="#Notation">Notation</a></li>
<li><a href="#General">General Considerations</a><br>
2.1 <a href="#Linearity">Linearity versus Structure</a><br>
2.2 <a href="#Overlap">Overlap of Control Code and Markup
Semantics</a><br>
2.3 <a href="#Markup">Markup and Styling</a><br>
2.4 <a href="#Coincidence">Coincidence of Markup and Functions</a><br>
2.5 <a href="#Extensibility">Extensibility of Markup</a><br>
2.6 <a href="#Suitability">Suitability of Characters in Markup</a></li>
<li><a href="#Suitable">Characters not Suitable for Use With Markup</a><br>
3.1 <a href="#Charlist">Table of Characters not Suitable for Use With
Markup</a><br>
3.2 <a href="#Line">Line and Paragraph Separator</a><br>
3.3 <a href="#Bidi">Bidi Embedding Controls</a><br>
3.4 <a href="#Deprecated">Deprecated Formatting Characters</a><br>
3.5 <a href="#BOM">Byte Order Mark</a><br>
3.6 <a href="#Interlinear">Interlinear Annotation Characters</a><br>
3.7 <a href="#Object">Object Replacement Character</a><br>
3.8 <a href="#Musical">Musical Controls</a><br>
3.9 <a href="#Language">Language Tag Characters</a><br>
3.10 <a href="#OtherDeprecated">Other Deprecated Characters</a></li>
<li><a href="#Format">Format Characters Suitable for Use With Markup</a>
<br>
4.1 <a href="#Subtending">Subtending Marks</a><br>
4.2 <a href="#Fraction">Fraction Slash</a><br>
4.3 <a href="#Variation">Variation Selector</a><br>
4.4 <a href="#Ideographic">Ideographic Description Characters</a><br>
4.5 <a href="#Invisible">Invisible Mathematical Operators</a><br>
4.6 <a href="#LineBreak">Line Break Controls</a><br>
4.7 <a href="#Fillers">Hangul Fillers</a></li>
<li><a href="#Compatibility">Characters with Compatibility Mappings</a><br>
5.1 <a href="#Overview">Overview</a><br>
5.2 <a href="#Generating">Generating New Text</a><br>
5.3 <a href="#List">List item Marker Characters</a><br>
5.4 <a href="#Fractions">Fractions</a><br>
5.5 <a href="#Squared">Squared or Horizontal</a><br>
5.6 <a href="#Superscripts">Superscripts and Subscripts</a><br>
5.7 <a href="#Other">Other Characters Marked <compat></a></li>
<li><a href="#Noncharacters">Noncharacters</a></li>
<li><a href="#White">White Space</a><br>
<a href="#converting-nl-to-ws">7.1 Converting Newline Functions to White
Space</a></li>
<li><a href="#Versioning">Versioning</a></li>
<li><a href="#Conformance">Conformance</a></li>
<li><a href="#References">References</a></li>
<li><a href="#Acknowledgements">Acknowledgements</a></li>
<li><a href="#ChangeHistory">Change History</a></li>
<li><a href="#Copyright">Copyright</a></li>
</ol>
<h2><a name="Introduction">1. Introduction</a></h2>
<p>The Unicode Standard [<a href="#Unicode">Unicode</a>] defines the
universal character set. Its primary goal is to provide an unambiguous
encoding of the content of plain text, ultimately covering all languages in
the world, but also major text-based notational systems for science,
technology, music, and scholarship.</p>
<p>Currently in its <a href="#Unicode50">fifth major version</a>, Unicode
contains a large number of characters covering most of the currently used
scripts in the world. It also contains additional characters for
interoperability with older character encodings, and characters with
control-like functions included primarily for reasons of providing
unambiguous interpretation of plain text. Unicode provides specifications for
use of all of these characters.</p>
<p>For document and data interchange, the Internet and the World Wide Web
make extensive use of marked-up text such as <a href="#html4.01">HTML4.01</a>
and <a href="#xml10">XML</a>. In many instances, markup provides the same, or
essentially similar features to those provided by format characters in the
Unicode Standard for use in plain text. Another special character category
provided by Unicode are compatibility characters. While there may be valid
reasons to support these characters and their specifications in plain text,
their use in marked-up text can conflict with the rules of the markup
language. Formatting characters are discussed in Section 3, <i><a
href="#Suitable">Characters not Suitable for Use With Markup</a></i> and
Section 4, <i><a href="#Format">Format Characters Suitable for Use With
Markup</a>, </i>compatibility characters in Section 5,<i><a
href="#Compatibility">Characters with Compatibility Mappings</a> </i>.
Section 6 briefly discusses noncharacters, and Section 7 is devoted to white
space.</p>
<p>Issues resulting from canonical equivalences and Normalization [<a
href="#UTR15">Normalization</a>] as well as the interaction of character
encoding and methods of escaping characters in markup are discussed in the
Character Model for the World Wide Web [<a href="#Charmod">Charmod</a>] and
[<a href="#Charmodnorm">Charmodnorm</a>].</p>
<p>The issues of using Unicode characters with marked-up text depend to some
degree on the rules of the markup language in question and the set of
elements it contains. In a narrow sense, this document concerns itself only
with XML, and to some extent HTML. However, much of the general information
presented here should be useful in a broader context, including some page
layout languages.</p>
<blockquote>
<p><b><a name="Note">Note:</a></b> Many of the recommendations of this
report depend on the availability of particular markup or styling. Where
possible, appropriate DTDs or Schemas should be used or designed to make
such markup or styling available, or the DTDs or Schemas used should be
appropriately extended. The current version of this document makes no
specific recommendations for the design of DTDs or Schemas, or for the use
of particular DTDs or Schemas, but the information presented here may be
useful to designers of DTDs and Schemas, and to people selecting DTDs or
Schemas for their applications. </p>
<p><b>Note: </b>The recommendations of this report do not apply in the case
of XML used for blind data transport and similar cases.</p>
</blockquote>
<h3><a name="Notation">1.1 Notation</a></h3>
<p>This report uses XML [<a href="#xml10">XML</a>] as a prominent and general
example of markup. The XML namespace notation [<a
href="#Namespace">Namespace</a>] is used to indicate that a certain element
is taken from a specific markup language. As an example, the prefix 'xhtml:'
indicates that this element is taken from [<a href="#XHTML">XHTML</a>]. This
means that the examples containing the namespace prefix 'xhtml:' are assumed
to include a namespace declaration of xmlns:xhtml="..." </p>
<p>Characters are denoted using the notation used in the Unicode Standard,
that is, an optional U+ followed by their hexadecimal number, using at least
4 digits, such as "U+1234" or "U+10FFFD". In XML or HTML this could be
expressed as "&#x1234;" or "&#x10FFFD;".</p>
<h2><a name="General">2. General Considerations</a></h2>
<p>There are several general points to consider when looking at the
interaction between character encoding and markup. </p>
<ul>
<li>Linearity of text vs. hierarchy of markup structure</li>
<li>Overlap of control codes and markup semantics</li>
<li>Markup <i>vs.</i> Styling</li>
<li>Coincidence of semantic markup and functions </li>
<li>Extensibility of markup</li>
</ul>
<h3 align="left"><a name="Linearity">2.1 Linearity versus Structure</a></h3>
<p align="left">Encoding text as a sequence of characters without further
information leads to a linear sequence, commonly called plain text. Character
follows character, without any particular structure. Markup, on the other
hand, defines a hierarchical structure for the text or data. In the case of
XML and most other, similar markup languages, the markup defines a tree
structure. While this tree structure is linearized for transmission in the
XML document, once the document has been parsed, the tree is available
directly.</p>
<p align="left">Operations that are easy to perform on trees are often
difficult to perform on linear sequences and vice versa. By separating
functionality between character encoding and markup appropriately, the
architecture becomes simpler, more powerful and longer-lasting.</p>
<p align="left">In particular, operations on hierarchical structures can
easily make sure that information is kept in context. Attributes assigned to
parts of a document are moved together with the associated part of the
document. Assigning an attribute to a part of a document limits the scope of
the attribute to that part of the document. Performing the same operations on
linear sequences of characters using control codes to set attributes and to
delimit their scope requires much more work and is error prone. Locating the
start or end of a span of text of the same attribute requires scanning
backwards and forwards for the embedded delimiter or control code. Moving or
editing text often results in mismatched control codes, so that an attribute
might suddenly apply to text it was not intended for.</p>
<h3 align="left"><a name="Overlap">2.2 Overlap of Control Code and Markup
Semantics</a></h3>
<p align="left">When markup is not available, plain text may require control
characters. This is usually the case where plain text must contain some
scoping or attribute information in order to be legible, <i>i.e.</i> to be
able to transmit the same content between originator and receiver. Many of
these control characters have direct equivalents in particular markup
languages, since markup handles these concerns efficiently. If both
characters and their markup equivalents may be present in the same text, the
question of priority is raised. Therefore it is important to identify and
resolve these ambiguities at the time markup is first applied.</p>
<h3 align="left"><a name="Markup">2.3 Markup and Styling</a></h3>
<p align="left">Besides the basic character encoding and text markup there is
a third contributor to text functionality, namely styling. Markup is
concerned with the logical structure of the text or data, <i>e.g. </i>to
indicate sections, subsections, and headers in a document, or to indicate the
various fields of an address record. Styling is used to present the
information in various ways, <i>e.g.</i> in different fonts, different type
styles (italic, bold), different colors, <i>etc. </i>Some character codes do
not encode a generic character, but a styled character. Where these
characters are used, styling information is frozen, <i>i.e.</i> it is no
longer possible to alter the appearance of the text by applying style
information. However, there are many examples where a historically free
stylistic variation has over time become a semantic distinction that is
properly encoded as plain text. Sometimes, what is a free variation in some
contexts, implies strict semantic differentiation in others. In all such
instances, altering the appearance of the text by styling information would
irreparably alter the content of the text. This is of particular concern with
mathematical notation or systems for phonetic and phonemic transcription
which make extensive semantic use of styles on a character by character
basis.</p>
<h3 align="left"><a name="Coincidence">2.4 Coincidence of Markup and
Functions</a></h3>
<p align="left">Dealing with various functionalities on the markup level has
the additional advantage that in most cases, text portions that need some
particular attribute (or styling) are actually those text portions identified
by markup. A paragraph may be in French, a citation may need a bidi
embedding, a keyword may be in italics, a list number may be circled, and so
on. This makes it very efficient to associate those attributes with
markup.</p>
<p align="left">However, where local or point-like functionality is needed,
markup is <i>not</i> very efficient and its main benefit, easy manipulation
of scope, is not required. On the contrary, the intrusion of markup in the
middle of words can make search or sort operations more difficult. For these
cases expressing the information as character codes is not only a viable, but
often the preferred alternative, which needs to be considered in the design
of markup languages.</p>
<h3 align="left"><a name="Extensibility">2.5 Extensibility of Markup</a></h3>
<p align="left">Character encoding works with a range of integers used as
character codes. This is extremely efficient, but has some limitations.
Markup, on the other hand, is much more extensible. Using technologies such
as XML Namespaces [<a href="#Namespace">Namespace</a>] and their application
in schema languages like [<a href="#XMLSchema">XML Schema</a>], various
vocabularies can be mixed.</p>
<h3><a name="Suitability">2.6 Suitability of Characters in Markup</a></h3>
<p>The suitability of a particular character for markup depends on its status
in the Unicode Standard, the nature of its behavior in text and the
availability of equivalent markup. Many format characters that are needed for
advanced plain text are not suitable for use with markup. <a
href="#Suitable">Section 3</a> gives a list and detailed descriptions.
However, not all format characters are unsuitable for use with markup. <a
href="#Format">Section 4</a> provides a list of format characters that are
suitable for use with markup and gives some discussion about their use. In
addition to format characters, the Unicode Standard also has compatibility
characters, some of which may be replaceable by suitable markup. These
characters are discussed in <a href="#Compatibility">Section 5</a>.</p>
<h2><a name="Suitable">3. Characters not Suitable for use With Markup</a></h2>
<p>There are characters which are unsuitable in the context of markup in
XML/HTML and whose use is discouraged, because one or more of the following
conditions apply:</p>
<ul>
<li>They are deprecated in the Unicode Standard.</li>
<li>They are unsupportable without additional data.</li>
<li>They are difficult to handle because they are stateful.</li>
<li>They are better handled by markup.</li>
<li>They are undesirable because of conflict with equivalent markup.</li>
</ul>
<p><a href="#Charlist">Section 3.1</a> provides a list of such characters.
Sections <a href="#Line">3.2</a> through <a href="#OtherDeprecated">3.10</a>
discuss in more detail the following points for the discouraged
characters.</p>
<ul>
<li>Short description of semantics</li>
<li>Reason for inclusion in Unicode</li>
<li>Specific problems when used with markup</li>
<li>Other areas where problems may occur (<i>e.g.</i> plain text)</li>
<li>What kind of markup to use instead</li>
<li>What to do if detected in a particular context</li>
</ul>
<h3><a name="Charlist">3.1 Table of Characters not Suitable for use With
Markup</a></h3>
<p>The following table contains the characters currently considered not
suitable for use with markup in XML or HTML. (See however the <a
href="#Note">note</a> in the <a href="#Introduction">Introduction</a>.) They
may also be unsuitable for other markup or page layout languages. For
determining possible conflict this report uses the markup available in
HTML.</p>
<p align="center"><b>Table 3.1 Characters not suitable for use with
markup</b></p>
<table border="1" cellpadding="2" cellspacing="0" width="95%">
<tbody>
<tr>
<th align="left" bgcolor="#ccffcc" width="210"><p
align="left">Codepoints</p>
</th>
<th align="left" bgcolor="#ccffcc" width="273"><p
align="left">Names/Description</p>
</th>
<th align="left" bgcolor="#ccffcc" width="341"><p align="left">Short
Comment</p>
</th>
</tr>
<tr>
<td width="210">U+0340..U+0341</td>
<td width="273">Clones of grave and accent</td>
<td width="341">Deprecated in Unicode</td>
</tr>
<tr>
<td width="210">U+17A3, U+17D3</td>
<td width="273">Obsolete characters for Khmer</td>
<td width="341">Deprecated in Unicode</td>
</tr>
<tr>
<td width="210">U+2028..U+2029</td>
<td width="273">Line and paragraph separator</td>
<td width="341">use <xhtml:br />,
<xhtml:p></xhtml:p>, or equivalent</td>
</tr>
<tr>
<td width="210">U+202A..U+202E</td>
<td width="273">BIDI embedding controls <br>
(LRE, RLE, LRO, RLO, PDF)</td>
<td width="341">Strongly discouraged in [<a
href="#html4.01">HTML4.01</a>]</td>
</tr>
<tr>
<td width="210">U+206A..U+206B</td>
<td width="273">Activate/Inhibit Symmetric swapping</td>
<td width="341">Deprecated in Unicode</td>
</tr>
<tr>
<td width="210">U+206C..U+206D</td>
<td width="273">Activate/Inhibit Arabic form shaping</td>
<td width="341">Deprecated in Unicode</td>
</tr>
<tr>
<td width="210">U+206E..U+206F</td>
<td width="273">Activate/Inhibit National digit shapes</td>
<td width="341">Deprecated in Unicode</td>
</tr>
<tr>
<td width="210">U+FFF9..U+FFFB</td>
<td width="273">Interlinear annotation characters</td>
<td width="341">Use ruby markup [<a href="#Ruby">Ruby</a>]</td>
</tr>
<tr>
<td rowspan="2" width="210">U+FEFF</td>
<td width="273">as ZWNBSP</td>
<td width="341">Use U+2060 Word Joiner instead</td>
</tr>
<tr>
<td width="273">as Byte Order Mark</td>
<td width="341">Use only at the start of a file, not as part of
markup</td>
</tr>
<tr>
<td width="210">U+FFFC</td>
<td width="273">Object replacement character</td>
<td width="341">Use markup, e.g. HTML <object> or HTML
<img></td>
</tr>
<tr>
<td width="210">U+1D173..U+1D17A</td>
<td width="273">Scoping for Musical Notation</td>
<td width="341">Use an appropriate markup language</td>
</tr>
<tr>
<td width="210">U+E0000..U+E007F</td>
<td width="273">Language Tag code points </td>
<td width="341">Use xhtml:lang or xml:lang</td>
</tr>
</tbody>
</table>
<p>Except for Line and Paragraph Separator, or the Byte Order Mark, it is
acceptable for browsers and similar user agents to ignore the presence of
discouraged characters in HTML or XML. It is up to authoring tools to ensure
proper conversion between these characters and equivalent markup where it
exists.</p>
<h3><a name="Line">3.2 Line and Paragraph Separator, U+2028..U+2029</a></h3>
<p><em>Short description</em>: The line and paragraph separator provide
unambiguous means to denote hard line breaks and paragraph delimiters in
plain text.</p>
<p><em>Reason for inclusion</em>: These characters were introduced into the
Unicode Standard to overcome the ambiguous and widely divergent use of
control codes for this purpose.<font color="#00ffff"></font> See <i>Section
5.8, Newline Guidelines,</i> in [<a href="#Unicode">Unicode</a>].</p>
<p><em>Problems when used in markup</em>: Including these characters in
markup text does not work where it would duplicate the existing markup
commands for delimiting paragraphs and lines.</p>
<p><em>Problems with other uses</em>: The separator characters can also
problematic when used in plain text, because legacy data is usually converted
code point for code point into Unicode and all receivers of Unicode plain
text have to effectively be able to interpret the existing use of control
codes for this purpose. As a result, fewer Unicode implementations support
these characters, than would be the case otherwise.</p>
<p><em>Replacement markup</em>: In HTML, use <xhtml:br /> instead of
U+2028 and surround paragraphs by <xhtml:p> and </xhtml:p>
instead of separating them with U+2029.</p>
<p><em>What to do if detected</em>: In a browser context, treat as white
space, or ignore. When received in an editing context, replace the character
by the corresponding markup. </p>
<h3><a name="Bidi">3.3 Bidi Embedding Controls (LRE, RLE, LRO, RLO, PDF),
U+202A..U+202E</a></h3>
<p><em>Short description</em>: The bidi embedding controls are required to
supplement the Unicode Bidirectional Algorithm in plain text</p>
<p><em>Reason for inclusion</em>: The Unicode Bidirectional algorithm
unambiguously resolves the display direction for bidirectional text. It does
so by assigning all characters directional categories and then resolving
these in context. In a small number of circumstances this <i>implicit </i>
method does not produce satisfactory results and embedding controls are
needed to ensure that sender and receiver agree on the display direction for
a given text. See Unicode Technical Report #9, The Bidirectional Algorithm <a
href="#UTR9">[UAX 9]</a>.</p>
<p><em>Problems when used in markup</em>: These characters duplicate
available markup, which is better suited to handle the stateful nature of
their effect. </p>
<p><em>Problems with other uses</em>: The embedding controls introduce a
state into the plain text, which must be maintained when editing or
displaying the text. Processes that are modifying the text without being
aware of this state may inadvertently affect the rendering of large portions
of the text, for example by removing a PDF.</p>
<p><em>Replacement markup</em>: The following table gives the replacement
markup:<br>
</p>
<blockquote>
<table border="1" cellspacing="0">
<tbody>
<tr>
<td bgcolor="#ccffcc" width="15"><b>Unicode</b></td>
<td bgcolor="#ccffcc" width="30%"><b>Equivalent markup</b></td>
<td bgcolor="#ccffcc" width="55%"><b>Comment</b></td>
</tr>
<tr>
<td width="15"><p>RLO</p>
</td>
<td width="30%"><xhtml:bdo dir = "rtl"></td>
<td width="55%"> </td>
</tr>
<tr>
<td width="15"><p>LRO</p>
</td>
<td width="30%"><xhtml:bdo dir = "ltr"></td>
<td width="55%"> </td>
</tr>
<tr>
<td width="15">PDF</td>
<td width="30%"></xhtml:bdo></td>
<td width="55%">when used to terminate RLO or LRO only, otherwise
ignore</td>
</tr>
<tr>
<td width="15">RLE</td>
<td width="30%">dir = "rtl"</td>
<td width="55%">attribute on block or inline element</td>
</tr>
<tr>
<td width="15">LRE</td>
<td width="30%">dir = "ltr"</td>
<td width="55%">attribute on block or inline element</td>
</tr>
</tbody>
</table>
</blockquote>
<p>For details on bidi markup, please see Section 8.2 of HTML [<a
href="#HTML4.0-8.2">HMTL 4.0-8.2</a>]. The text of HTML 4.0 gives this
recommendation: </p>
<blockquote>
<p><em><strong>Using HTML directionality markup with Unicode
characters.</strong> Authors and designers of authoring software should be
aware that conflicts can arise if the <a
href="http://www.w3.org/TR/html401/struct/dirlang.html#adef-dir"
class="noxref"><samp class="ainst">dir</samp></a> attribute is used on
inline elements (including <a
href="http://www.w3.org/TR/html401/struct/dirlang.html#edef-BDO"
class="noxref"><samp class="einst">BDO</samp></a>) concurrently with the
corresponding<a rel="biblioentry" href="#Unicode"
class="normref">[UNICODE]</a> formatting characters. Preferably one or the
other should be used exclusively. The markup method offers a better
guarantee of document structural integrity and alleviates some problems
when editing bidirectional HTML text with a simple text editor, but some
software may be more apt at using the<a rel="biblioentry" href="#Unicode"
class="normref">[UNICODE]</a> characters. If both methods are used, great
care should be exercised to insure proper nesting of markup and directional
embedding or override, otherwise, rendering results are undefined.</em></p>
</blockquote>
<p>This document goes beyond HTML and recommends that <i>only</i> the markup
should be used.</p>
<blockquote>
<p><b>Note:</b> The interpretation of how to handle directionality markup
for block level elements differs in different versions of [<a
href="#CSS">CSS</a>].</p>
</blockquote>
<p><em>What to do if detected</em>: In a browser context, ignore. When
received in an editing context, replace the characters by the appropriate
markup. </p>
<h3><a name="Deprecated">3.4<em></em>Deprecated Formatting Characters,
U+206A..U+206F</a></h3>
<p><em>Short description</em>: These characters are deprecated. They were
originally intended to allow explicit activation of contextual shaping,
numeric digit rendering and symmetric swapping.</p>
<p><em>Reason for inclusion</em>: These characters were retained from draft
versions of ISO 10646.</p>
<p><em>Problems when used in markup</em>: The processing model for these
characters is not supported in markup.</p>
<p><em>Problems with other uses</em>: The Unicode Standard requires that
symmetric swapping, contextual shaping, and alternate digit shapes are
enabled by default and no longer supports inhibiting any of them by use of
these character codes. The most likely effect of their occurrence in
generated text would be that of a 'garbage' character.</p>
<p><em>Conversion for use with markup</em>: Apply the appropriate conversion
to bring the data stream in line with the Unicode text model for
bidirectional text and cursively-connected scripts.</p>
<p><em>What to do if detected</em>: When received by a browser as part of
marked up text, they may be ignored. When received in an editing context,
they may be removed, possibly with a warning. Alternatively, an appropriate
conversion from the legacy text model may be provided. This will most likely
be limited to applications directly interfacing with and knowledgeable of the
particular legacy implementation that inspired these characters.</p>
<h3><a name="BOM">3.5 Byte Order Mark, ZWNBSP, U+FEFF</a></h3>
<p><em>Short description</em>: U+FEFF has two functions. It is formally known
as <span style="font-variant: small-caps;">zero width no-break space</span>
(ZWNBSP), and can act as a word joiner, but its primary use is as <i>byte
order mark (BOM)</i>, to indicate in a file signature at the start of a file
that a file is in a particular Unicode encoding form and of a particular byte
order. Using U+FEFF as a word joiner in new data is deprecated as of [<a
href="#Unicode32">Unicode3.2</a>] in favor of U+2060 <span
style="font-variant: small-caps;">word joiner</span> (WJ). The use as byte
order mark remains unaffected.</p>
<p><em>Reason for inclusion</em>: Originally included in Unicode for the sole
purpose of indicating byte order or use in file signatures, the character
acquired the ZWNBSP semantics as part of the merger between ISO/IEC 10646 and
Unicode. When used as a byte order mark the character is placed at the
beginning of a file. If a recipient views it as FEFF then the byte order
between sender and receiver match. If the recipient views it as FFFE (a
non-character code point) then the sender used opposite byte order from the
recipient, and the recipient needs to invert the byte order or refuse to read
the file. When used as a ZWNBSP the character is intended to prevent breaks
between adjacent characters. This function is now provided by U+2060 <span
style="font-variant: small-caps;">word joiner</span> (WJ) making it
unnecessary to insert U+FEFF in the middle of a file. For more information
see Chapter 16 of [<a href="#Unicode">Unicode</a>].</p>
<p><em>Problems when used in markup</em>: Using U+FEFF as ZWNBSP makes it
impossible to distinguish it from the case where a byte order mark was left
in the middle of a file inadvertently due to incorrect splicing. U+FEFF can
and in some cases (XML encoded in UTF-16) must be used at the start of a file
containing markup, but as a signature, this is not part of actual markup or
marked-up content. Some older versions of browsers and parsers may not
correctly recognize U+FEFF at the start of a file encoded in UTF-8. For
details of how U+FEFF participates in encoding detection of XML files, see
Appendix F of <a href="#xml10">[XML 1.0]</a>. </p>
<p><em>Problems with other uses</em>: The use of byte order mark as ZWNBSP is
also problematic when used in plain text, and has been deprecated for that
purpose in favor of U+2060 <span style="font-variant: small-caps;">word
joiner</span>. The use of U+FEFF in file signatures to indicate byte order is
the only recommended use of this character.</p>
<p><em>Replacement markup</em>: None. In locations other than the beginning
of a text file, U+FEFF can be removed or replaced by U+2060 in an editing
environment.</p>
<p><em>What to do if detected</em>: When received by a browser as part of
marked-up text, treat depending on location. At the start of an external
entity, treat as byte order mark (i.e. as part of the character encoding, not
as part of the parsed character stream, see e.g. Section 4.3.3 of <a
href="#xml10">[XML 1.0]</a>). Otherwise, assume it is older data using it as
ZWNBSP. When receiving plain text in an editing environment, editors may take
one or more of several actions: replace ZWNBSP in the middle of a file with
WJ or issue a warning to the user.</p>
<h3><a name="Interlinear">3.6 Interlinear Annotation Characters,
U+FFF9-U+FFFB</a></h3>
<p><em>Short description</em>: The interlinear annotation characters are used
to delimit interlinear annotations in certain circumstances. They are
intended to provide text anchors and delimiters for interlinear annotation
for in-process use and are not intended for interchange.</p>
<p><em>Reason for inclusion</em>: The interlinear annotation characters were
included in Unicode only in order to reserve code points for very frequent
application-internal use. The interlinear annotation characters are used to
delimit interlinear annotations in contexts where other delimiters are not
available, and where non-textual means exist to carry formatting information.
Many text-processing applications store the text and the associated markup
(or in some cases styling information) of a document in separate structures.
The actual text is kept in a single linear structure; additional information
is kept separately with pointers to the appropriate text positions. This is
called out-of-band information. The overall implementation makes sure that
these two structures are kept in sync. If the text contains interlinear
annotations, it is extremely helpful for implementations to have delimiters
in the text itself; even though delimiters are not otherwise used for style
markup. With this method, and unlike the case of the object replacement
character, all textual information can remain in the standard text stream,
but any additional formatting information is kept separately. In addition,
the Interlinear Annotation Anchor serves as a placeholder for formatting
information for the whole annotation object, the same way a paragraph mark
can be a placeholder to attach paragraph formatting information.</p>
<p><em>Problems when used in markup</em>: Including interlinear annotation
characters in marked-up text does not work because the additional formatting
information (how to position the annotation,...) is not available.</p>
<p><em>Problems with other uses</em>: The interlinear annotation characters
are also problematic when used in plain text, and are not intended for that
purpose. In particular, on older display systems that simply ignore or
replace the Interlinear Annotation Characters, the meaning of the text may be
changed.</p>
<p><em>Replacement markup</em>: The markup to be used in place of the
Interlinear Annotation Characters depends on the formatting and nature of the
interlinear annotation in question. For ruby, please see [<a
href="#Ruby">Ruby</a>].</p>
<p><em>What to do if detected</em>: When received by a browser as part of
marked-up text, they may be ignored. When receiving plain text in an editing
environment, editors may take one or more of several actions: remove U+FFF9
together with removing all characters between U+FFFA and following U+FFFB;
ignore U+FFF9 and turn U+FFFA and U+FFFB into "[" and "]" respectively, or
into similar characters; issue a warning to the user; or tentatively convert
into appropriate ruby markup for further editing and formatting by the
user.</p>
<h3><a name="Object">3.7 Object Replacement Character, U+FFFC</a></h3>
<p><em>Short description</em>: The object replacement character is used to
stand in place of an object (e.g. an image) included in a text.</p>
<p><em>Reason for inclusion</em>: The object replacement character was
included in Unicode only in order to reserve a codepoint for a very frequent
application-internal use. Many text-processing applications store the text
and the associated markup (or in some cases styling information) of a
document in separate structures. The actual text is kept in a single linear
structure; additional information is kept separately with pointers to the
appropriate text positions. The overall implementation makes sure that these
two structures are kept in sync. If the text contains objects such as images,
it is extremely helpful for implementations to have a sentinel in the text
itself; any additional information is kept separately.</p>
<p><em>Problems when used in markup</em>: Including an object replacement
character in markup text does not work because the additional information
(what object to include,...) is not available.</p>
<p><em>Problems with other uses</em>: The object replacement character is
also problematic when used in plain text, because there is no way in plain
text to provide the actual object information or a reference to it.</p>
<p><em>Replacement markup</em>: The markup to be used in place of the Object
Replacement Character depends on the object in question and the markup
context it is used in. Typical cases are <xhtml:img src='...' />,
<xhtml:object ...>, or <html:applet ...>. These constructs allow
providing all additional information needed to identify and use the object in
question.</p>
<p><em>What to do if detected</em>: Browsers may ignore this character. When
received in an editing context, if the actual object is accessible, editors
may either replace the character by the appropriate markup for that object,
or otherwise remove it, ideally providing a warning.</p>
<h3><a name="Musical">3.8 Musical Controls</a>, U+1D173..U+1D17A</h3>
<p><em>Short description</em>: A series of characters for controlling scope
in musical notation.</p>
<p><em>Reason for inclusion</em>: These characters designate the start and
end of common musical constructs. Full musical layout depends on additional
information, for example pitch, that cannot be encoded using Unicode.
However, many musical symbols may be depicted in isolation (and without
assigning pitch) as part of a textual discussion of music. Plain text use of
Unicode characters is primarily intended for this latter purpose. The scoping
operators can be used to support limited renderings of beams, slurs, phrases,
etc. in this context. However, in the context of markup languages, musical
scoring calls for a dedicated markup language (analogous to MathML) which
would be expected to contain markup for these constructs.</p>
<p><em>Problems when used in markup</em>: These characters duplicate
information that can in principle be expressed in markup.</p>
<p><em>Problems with other uses</em>: Their special code range allows them to
be easily filtered, but applications that do not expect them will treat them
as garbage characters.</p>
<p><em>Replacement markup</em>: Replace with equivalent markup if
available.</p>
<p><em>What to do if detected</em>: Browsers may ignore these characters.
When received in an editing context, editors may remove or replace them by
equivalent markup.</p>
<h3><a name="Language">3.9 Language Tag Characters</a>, U+E0000..U+E007F</h3>
<p><em>Short description</em>: A series of characters for expressing language
tags, based on existing standards for language tags using the rules in
Chapter 16 of [<a href="#Unicode">Unicode</a>].</p>
<p><em>Reason for inclusion</em>: These characters allow in-band language
tagging in situations where full markup is not available, while allowing easy
filtering by applications that do not support them. They were solely included
for the benefit of those Internet protocols, such as ACAP, which require a
standard mechanism for marking language in UTF-8 strings, and at the same
time to avoid the use of other tagging schemes that relied on specific
details of the encoding form used.</p>
<p><em>Problems when used in markup</em>: These characters duplicate
information that can be expressed in markup.</p>
<p><em>Problems with other uses</em>: Their special code range allows them to
be easily filtered, but applications that do not expect them will treat them
as garbage characters.</p>
<p><em>Replacement markup</em>: Replace with equivalent language markup. XML
and XHTML have the xml:lang attribute. HTML has the lang attribute. These
attributes follow different scoping rules than the tag characters, therefore
this replacement will generally not be a simple 1:1 substitution.</p>
<p><em>What to do if detected</em>: Browsers may ignore these characters.
When received in an editing context, editors may remove or replace them by
equivalent markup.</p>
<h3><a name="OtherDeprecated">3.10 Other Characters Deprecated in
Unicode</a></h3>
<p><em>Short description</em>: The Unicode Character Database [<a
href="#UnicodeData">UnicodeData</a>] lists all characters that have been
deprecated in [<a href="#Unicode">Unicode</a>]. This list may grow (slowly)
over time. Deprecated characters remain valid characters forever, but their
use is strongly discouraged. Deprecation of characters is applied only in
exceptional circumstances. It is never the result of historical changes of a
writing system: characters no longer in current, modern use are retained in
Unicode, as they are needed for the representation of historical
documents.</p>
<p><em>Reason for inclusion</em>: Usually, characters that are deprecated
were never needed, but were inadvertently added to the Unicode Standard,
perhaps based on incomplete information available at the time of encoding.</p>
<p><em>Problems when used in markup</em>: Except where noted elsewhere in
this document, their presence in markup presents the same problems as in
plain text, usually that of an unnecessary duplicate encoding.</p>
<p><em>Problems with other uses</em>: Depends on the character and the reason
for its deprecation. For more information see [<a
href="#Unicode">Unicode</a>].</p>
<p><em>Conversion for use with markup</em>: For deprecated characters not
discussed elsewhere in this document, see the relevant descriptions of those
characters in [<a href="#Unicode">Unicode</a>] for information on the
recommended alternatives.</p>
<p><em>What to do if detected</em>: Unless a specific recommendation is
given elsewhere, deprecated characters are not ignored; where possible, in an
editing environment, a preferred alternate encoding may be substituted.</p>
<h2><a name="Format">4. Format Characters Suitable for Use with
Markup</a></h2>
<p>The following table contains format characters that do not exhibit the
problems discussed at the start of <a href="#Suitable">Section 3</a>. Despite
their apparent relation to or similarity with characters in table <a
href="#Charlist">3.1</a>, they are considered suitable for use with markup.
It is not acceptable for user agents to ignore the characters in table 4.1.
For a description of these characters see [<a
href="#Unicode">Unicode</a>].</p>
<p align="center"><b>Table 4.1: Some characters that affect text format but
are suitable for use with markup</b></p>
<table border="1" cellpadding="2" cellspacing="0" width="95%">
<tbody>
<tr>
<th align="left" bgcolor="#ccffcc" width="198"><p align="left">Code
points</p>
</th>
<th align="left" bgcolor="#ccffcc" width="362"><p
align="left">Names/Description</p>
</th>
<th align="left" bgcolor="#ccffcc" width="280"><p align="left">Short
Comment</p>
</th>
</tr>
<tr>
<td width="198">U+00A0</td>
<td width="362">No-break Space</td>
<td width="280">Line break control</td>
</tr>
<tr>
<td width="198">U+00AD</td>
<td width="362">Soft Hyphen</td>
<td width="280">Line break control</td>
</tr>
<tr>
<td width="198">U+034F</td>
<td width="362">Combining Grapheme Joiner</td>
<td width="280">Used in sorting</td>
</tr>
<tr>
<td width="198">U+0600</td>
<td width="362">Arabic Number Sign</td>
<td width="280">Subtending mark</td>
</tr>
<tr>
<td width="198">U+0601</td>
<td width="362">Arabic Sign Sanah</td>
<td width="280">Subtending mark</td>
</tr>
<tr>
<td width="198">U+0602</td>
<td width="362">Arabic Footnote Marker</td>
<td width="280">Subtending mark</td>
</tr>
<tr>
<td width="198">U+0603</td>
<td width="362">Arabic Sign Safha</td>
<td width="280">Subtending mark</td>
</tr>
<tr>
<td width="198">U+06DD</td>
<td width="362">Arabic End of Ayah</td>
<td width="280">Enclosing mark</td>
</tr>
<tr>
<td width="198">U+070F</td>
<td width="362">Syriac Abbreviation Mark (SAM)</td>
<td width="280">Supertending mark</td>
</tr>
<tr>
<td width="198">U+0F0C</td>
<td width="362">Tibetan Mark Delimiter Tsheg Bstar</td>
<td width="280">Non-breaking form of 0F0B</td>
</tr>
<tr>
<td width="198">U+115F..U+1160</td>
<td width="362">Hangul Jamo Fillers</td>
<td width="280">Filler</td>
</tr>
<tr>
<td width="198">U+180B..U+180E</td>
<td width="362">Mongolian Variation Selectors(FVS1..FVS3), Mongolian
Vowel Separator</td>
<td width="280">Required for Mongolian</td>
</tr>
<tr>
<td width="198">U+200B</td>
<td width="362">Zero-width Space</td>
<td width="280">Line break control</td>
</tr>
<tr>
<td width="198">U+200C..U+200D</td>
<td width="362">Zero-width Join Controls (ZWJ and ZWNJ)</td>
<td width="280">Required for a.o. Persian and many Indic scripts</td>
</tr>
<tr>
<td width="198">U+200E..U+200F</td>
<td width="362">Implicit Directional Marks (LRM and RLM)</td>
<td width="280">LRM and RLM are allowed</td>
</tr>
<tr>
<td width="198">U+2011</td>
<td width="362">Non-breaking Hyphen</td>
<td width="280">Line break control</td>
</tr>
<tr>
<td width="198">U+202F</td>
<td width="362">Narrow No-break Space</td>
<td width="280">Line break control/Mongolian</td>
</tr>
<tr>
<td width="198">U+2044</td>
<td width="362">Fraction Slash</td>
<td width="280">Or use markup (MathML)</td>
</tr>
<tr>
<td width="198">U+2060</td>
<td width="362">Word Joiner</td>
<td width="280">Use for that purpose instead of U+FEFF ZWNBSP</td>
</tr>
<tr>
<td width="198">U+2061..U+2064</td>
<td width="362">Invisible Mathematical Operators</td>
<td width="280">Mathematical use</td>
</tr>
<tr>
<td width="198">U+2FF0..U+2FFB</td>
<td width="362">Ideographic Character Description</td>
<td width="280">Graphic characters (not controls)</td>
</tr>
<tr>
<td width="198">U+303E</td>
<td width="362">Ideographic Variation Indicator</td>
<td width="280">Graphic character (not a control)</td>
</tr>
<tr>
<td width="198">U+FF80</td>
<td width="362">Halfwidth Hangul Filler</td>
<td width="280">Filler, not generally required</td>
</tr>
<tr>
<td width="198">FE00..FE0F</td>
<td width="362">Variation Selectors</td>
<td width="280">Modify graphic characters</td>
</tr>
<tr>
<td width="198">E0100..E01DF</td>
<td width="362">Variation Selectors</td>
<td width="280">Modify graphic characters</td>
</tr>
</tbody>
</table>
<p>The following subsections briefly discuss some of the characters from the
above list, particularly those that affect more than their immediately
adjacent neighbors. Please see the Unicode Standard [<a
href="#Unicode">Unicode</a>] for full details.</p>
<h3><a name="Subtending">4.1 Subtending Marks</a></h3>
<p>Subtending marks are needed to represent a common feature in the Arabic
and Syriac scripts where a mark can be placed below a range of characters,
for example below a sequence of digits, to indicate a year. The Syriac
abbreviation mark is placed above a series of characters, making it
technically a supertending mark, and the <span
style="font-variant: small-caps;">ARABIC END OF AYAH</span> is an enclosing
mark. In the character stream, a subtending mark precedes the affected
characters. The end of affected range of characters is defined implicitly,
usually by the first non-alphanumeric character. </p>
<p align="left">Unlike subtending marks, the scope of combining enclosing
marks, such as <span
style="text-transform: uppercase; font-variant: small-caps;">combining
enclosing circle,</span> is limited to the preceding default grapheme
cluster. For details on grapheme clusters see Unicode Standard Annex #29:
"Text Boundaries"<i>,</i> [<a href="#UAX29">UAX 29</a>] .</p>
<p align="left">There is currently no existing markup that can represent the
scoping and layout functions defined by these characters, so they cannot be
substituted. It is unresolved to what degree intervening markup affects the
scope of these marks.</p>
<h3 align="left"><a name="Fraction">4.2 Fraction Slash</a></h3>
<p align="left">The fraction slash is used between sequences of decimal
digits to form fractions. Whether the resulting fraction has a horizontal or
diagonal fraction line is unspecified. The fallback is to leave the digits
unchanged and display a regular slash. In order to separate a digit from a
following fraction, as in 1¾, the use of <span
style="font-variant: small-caps;">U+2009 THIN SPACE</span> is recommended.</p>
<p align="left">For better control of fractions the use of [<a
href="#MathML">MathML</a>] is suggested where appropriate.</p>
<h3><a name="Variation">4.3 Variation Selectors</a></h3>
<p>A variation selector is intended to cause a specific variant form (or
range of variant forms) when applied to a base character. For a variation
selector to have an effect it must immediately follow its base character.
Only pre-determined combinations of selected base characters and specific
variation selectors have a defined effect. All other combinations are
ill-formed and are to be ignored. The list of standardized combinations is
documented in the Unicode Character Database, see [<a
href="#Variants">Variants</a>]. In addition to the 256 generic variation
selectors, there are 3 Mongolian <i>free variation selectors</i>. They
function in all other ways like variation selectors, except they only apply
to base characters from the Mongolian script. Since Mongolian, like Arabic,
has positional character shapes, the variations are limited to particular
shaping contexts.</p>
<h3><a name="Ideographic">4.4 Ideographic Description Characters</a></h3>
<p>Ideographic Description Characters are included in the Unicode Standard as
a means to indicate the composition of ideographs from a combination of
pieces (terms), where each piece or term is either a Unicode character or
composed. Ordinarily the result would be a human readable description of a
character, perhaps one for which a font is not available. However, at least
some vendors are interested in automatic conversion of these sequences into
single ideographs.</p>
<h3><a name="Invisible">4.5 Invisible Mathematical Operators</a></h3>
<p>These characters are needed to convey the intended meaning of a
mathematical expression to an automated parser whenever two elements are
simply written next to each other. See Unicode Technical Report #25: "Unicode
Support for Mathematics" [<a href="#UTR25">UTR25</a>] for more details.</p>
<h3><a name="LineBreak">4.6 Line Break Controls</a></h3>
<p>Most of these characters prevent line breaks adjacent to them, but ZWSP
and SHY provide invisible line break opportunities. The detailed function of
these characters is described in Unicode Standard Annex #14: "Line Breaking
Properties" [<a href="#UAX14">UAX14</a>]. While high-end applications may be
able to deduce line breaking opportunities automatically solely with the help
of very generic markup or styling properties, the use of these characters
currently provides the most reliable and straight-forward way to control line
breaking and hyphenation. Note that [<a href="#html4.01">HTML4.01</a>] uses
U+00A0 NO-BREAK SPACE also as a "hard space" (i.e. a space with a fixed
width), something that is not part of its character semantics in [<a
href="#Unicode">Unicode</a>].</p>
<p>U+2011 NON-BREAKING HYPHEN (NBHY) is used to encode a hyphen that does not
provide a line break opportunity. In several languages, the sequence <SHY,
NBHY> may be used to handle special line breaking behavior for explicit
hyphens, see [<a href="#UAX14">UAX14</a>].</p>
<h3><a name="Fillers">4.7 Hangul Fillers</a></h3>
<p>These should not be needed except for texts that need to have a fixed
number of jamos per Korean syllable block. See the description of Korean
Syllable Blocks in [<a href="#Unicode">Unicode</a>].</p>
<h2><a name="Compatibility">5. Characters with Compatibility Mappings</a></h2>
<p>The Unicode Standard provides compatibility mappings for a number of
characters. Compatibility mappings indicate a relationship to another
character, but the exact nature of the relationship varies. In some cases the
relationship means "is based on" in some other cases it denotes a property.
When plain text is marked up, it may make sense to map some of these
characters to a combination of their compatibility equivalents <em
style="font-style: normal;">and</em> suitable markup. It is important to
understand the nature of the distinctions between characters and their
compatibility equivalents and the context in which these distinctions matter.
It is never advisable to apply compatibility mappings indiscriminately. This
section provides guidance on when and how to apply compatibility mappings in
the case of importing text from non-XML (non-marked-up) sources. The section
is organized by the "compatibility tag" associated with each compatibility
mapping.</p>
<h3><a name="Overview">5.1 Overview</a></h3>
<p>The following table gives an overview of the various compatibility
characters, organized by "compatibility tag". The first column, <i>Tag
value,</i> contains the value of the "compatibility tag" from the Unicode
Character Database [<a href="#UnicodeData">UnicodeData</a>]. Although these
tags use "<" and ">", they do not appear as such in markup and should
not be confused with XML tags. <em>Code range</em> indicates a further break
down by code points. <i>Action</i> summarizes the recommended action to be
taken whenever markup is first applied to non-XML text. Each entry indicates
whether the characters can be substituted using the compatibility equivalent
according to Normalization Form KC of [<a href="#UAX15">UAX 15</a>], can be
replaced by equivalent markup where available, or should be retained. For
some cases, instead of or in addition to markup, style information [<a
href="#CSS">CSS</a>] is needed. <i>Description and usage</i> provides
additional information. Sections <a href="#List">5.3</a> through <a
href="#Superscripts">5.6</a> provide additional information for some of these
sets of compatibility characters including detailed recommended actions.</p>
<p align="center"><b>Table 5.1 Characters with compatibility mappings</b></p>
<table border="1" cellpadding="2" cellspacing="0" width="95%">
<tbody>
<tr>
<th align="left" bgcolor="#ccffcc" width="80">Tag value</th>
<th align="left" bgcolor="#ccffcc" width="97">Code range</th>
<th align="left" bgcolor="#ccffcc" width="83">Action</th>
<th align="left" bgcolor="#ccffcc">Description and usage</th>
</tr>
<tr>
<td valign="top" width="80"><circled></td>
<td valign="top" width="97">all</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">Circled letters and digits used for list
item markers, and in running text</td>
</tr>
<tr>
<td rowspan="12" valign="top" width="80"><compat></td>
<td valign="top" width="97">2002..200A</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">Fixed width spaces</td>
</tr>
<tr>
<td valign="top" width="97">2100..2101</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">Variant letter forms that are used as
symbols</td>
</tr>
<tr>
<td valign="top" width="97">2105..2106</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">Variant letter forms that are used as
symbols</td>
</tr>
<tr>
<td valign="top" width="97">2121, 213B</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">For use as single code point in vertical
layout</td>
</tr>
<tr>
<td valign="top" width="97">2160..217F</td>
<td valign="top" width="83">retain, or use list item marker style, or
normalize</td>
<td valign="top" width="572">For use as single code point in vertical
layout, or as list item marker</td>
</tr>
<tr>
<td valign="top" width="97">2474..249B</td>
<td valign="top" width="83">retain, or use list item marker style, or
normalize</td>
<td valign="top" width="572">Parenthesized or dotted number used as
list item marker</td>
</tr>
<tr>
<td valign="top" width="97">249C..24B5</td>
<td valign="top" width="83">retain, or use list item marker style, or
normalize</td>
<td valign="top" width="572">Parenthesized letters used as list item
markers</td>
</tr>
<tr>
<td valign="top" width="97">3131..318E</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">Compatibility Hangul Jamo. These do not
conjoin</td>
</tr>
<tr>
<td valign="top" width="97">3200..3229</td>
<td valign="top" width="83">retain, or use list item marker style, or
normalize</td>
<td valign="top" width="572">Parenthesized characters used as list item
markers</td>
</tr>
<tr>
<td height="26" valign="top" width="97">322A..3243</td>
<td height="26" valign="top" width="83">retain</td>
<td height="26" valign="top" width="572">Parenthesized characters used
as symbols in vertical layout</td>
</tr>
<tr>
<td valign="top" width="97">32C0..32CB</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">String used as single code point in
vertical layout</td>
</tr>
<tr>
<td valign="top">all other</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">Maintain, semantic distinctions apply</td>
</tr>
<tr>
<td valign="top" width="80"><final></td>
<td valign="top" width="97">all</td>
<td valign="top" width="83">normalize</td>
<td valign="top" width="572">Arabic Presentation forms</td>
</tr>
<tr>
<td valign="top" width="80"><font></td>
<td valign="top" width="97">all</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">Variant letter forms that are used as
symbols</td>
</tr>
<tr>
<td valign="top" width="80"><fraction></td>
<td valign="top" width="97">all</td>
<td valign="top" width="83">normalize</td>
<td valign="top" width="572">As long as fraction slash is
supported!</td>
</tr>
<tr>
<td valign="top" width="80"><initial></td>
<td valign="top" width="97">all</td>
<td valign="top" width="83">normalize</td>
<td valign="top" width="572">Arabic Presentation forms</td>
</tr>
<tr>
<td valign="top" width="80"><isolated></td>
<td valign="top" width="97">all</td>
<td valign="top" width="83">normalize</td>
<td valign="top" width="572">Arabic Presentation forms</td>
</tr>
<tr>
<td valign="top" width="80"><medial></td>
<td valign="top" width="97">all</td>
<td valign="top" width="83">normalize</td>
<td valign="top" width="572">Arabic Presentation forms</td>
</tr>
<tr>
<td valign="top" width="80"><narrow></td>
<td valign="top" width="97">all</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">Half-width characters</td>
</tr>
<tr>
<td valign="top" width="80"><noBreak></td>
<td valign="top" width="97">all</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">The compatibility mapping merely indicates
the equivalent breaking character. The noBreak distinction must be
preserved</td>
</tr>
<tr>
<td valign="top" width="80"><small></td>
<td valign="top" width="97">all</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">Precise usage unknown. Maintain, but do
not generate</td>
</tr>
<tr>
<td rowspan="4" valign="top" width="80"><square></td>
<td valign="top" width="97">3300..3357</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">Single display cell cluster containing
multiple lines of kana for vertical layout</td>
</tr>
<tr>
<td valign="top" width="97">3358..337D</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">For use as single code point in vertical
layout</td>
</tr>
<tr>
<td valign="top" width="97">33E0..33FE</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">For use as single code point in vertical
layout</td>
</tr>
<tr>
<td valign="top" width="97">all other</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">Variant letter form used as symbol in
vertical layout</td>
</tr>
<tr>
<td rowspan="2" valign="top" width="80"><sub></td>
<td valign="top" width="97">2080..208E</td>
<td valign="top" width="83">retain, or use markup</td>
<td valign="top" width="572">Subscript digits 0-9, as well as minus,
plus, equal and parens</td>
</tr>
<tr>
<td valign="top" width="97">all other</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">Subscript characters, usually used as
modifier letters in phonetic notation</td>
</tr>
<tr>
<td rowspan="5" valign="top" width="80"><super></td>
<td valign="top" width="97">00B2..00B3</td>
<td rowspan="4" valign="top" width="83">retain, or use markup</td>
<td rowspan="4" valign="top" width="572">Superscript digits 0-9, as
well as minus, plus, equal and parens</td>
</tr>
<tr>
<td valign="top" width="97">00B9</td>
</tr>
<tr>
<td valign="top" width="97">2070</td>
</tr>
<tr>
<td valign="top" width="97">2074..207E</td>
</tr>
<tr>
<td valign="top" width="97">all other</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">Superscript characters, usually used as
modifier letters in phonetic notation</td>
</tr>
<tr>
<td valign="top" width="80"><vertical></td>
<td valign="top" width="97">all</td>
<td valign="top" width="83">normalize</td>
<td valign="top" width="572">East Asian Presentation forms</td>
</tr>
<tr>
<td valign="top" width="80"><wide></td>
<td valign="top" width="97">all</td>
<td valign="top" width="83">retain</td>
<td valign="top" width="572">Full-width characters</td>
</tr>
</tbody>
</table>
<blockquote>
<p><b>Note: </b>Some symbols used in vertical layout exist as single code
points in legacy systems, but can also be composed on the fly by more
advanced display engines. There are currently no style properties that
could be used to express squared Kana clusters (<i>kumimoji</i>) or
horizontal in vertical writing mode (<i>tate-chu-yoko</i>).</p>
</blockquote>
<h3><a name="Generating">5.2 Generating New Text</a></h3>
<p>Presentation forms and characters for which adequate representation exists
as marked up text should never be entered into new data. Many of the
characters with <font> tag are however suitable for new data, as long
as they are used in the manner they are intended, that is as symbols, with
definite semantic differentiation between the different forms. The largest
set of these characters exists to carry essential semantic distinctions in
mathematical notation, where the any loss of markup during text export would
compromise the meaning of the text. Most of the characters with <super>
and <sub> tag have been encoded for use in phonetic or phonemic
transcriptions, where they act as ordinary letters and the use of style
markup is therefore deemed inappropriate. However, it is inappropriate to use
any of these classes of characters to create the appearance of styled text
runs.</p>
<p>For example to write <i>hello,</i> one should use <i>hello</i>
and not the sequence of Unicode characters U+210E, U+212F, U+2113, U+2113,
U+2134. Conversely, to indicate <i>Planck's constant</i> one should use
U+210E and not <i>h</i>.</p>
<p>When style is applied across entire words, sentences or paragraphs, the
use of markup is preferred. When style is applied to individual letters,
especially to letters inside a word, giving them a particular interpretation,
the use of character codes is preferred. See also <a
href="#Superscripts">Section 5.6</a>.</p>
<h3><a name="List">5.3 List Item Marker Characters</a></h3>
<p><em>Short description</em>: Characters with a <circled> tag or
characters with <compat> tag and compatibility mapping to a
parenthesized string.</p>
<p><em>Reason for inclusion</em>: They are most frequently used for marking
enumerated list items, but the characters with a <circled> tag often
occur as dingbats or footnote markers in tables. The same characters are used
in regular text when citing an item from a corresponding ordered list.</p>
<p><em>Problems when used in markup</em>: These characters do not cause undue
interaction with markup</p>
<p><em>Problems with other uses</em>: None</p>
<p><em>Replacement markup</em>: (in text use) these characters are often used
in running text; sometimes, but not exclusively, in situations where the text
is to be associated with an item from a nearby numbered list. Replacement
markup may not be available, and the support for such markup is much more
limited today than was anticipated when this document was first written.</p>
<p>(list item style) When generating marked up text these characters occur
only internal to the user agent when list item styles are rendered. When
marking up plain text data they could be converted to suitable list item
styles, if such use can be properly inferred. The default recommendation is
to retain the original character.</p>
<p>(characters with compatibility mappings of the form "(<em>n</em>)" or
"<em>n</em>." or roman numerals) Unlike circled characters, these could be
rendered by sequences of regular characters. Using a list item marker style
would in theory allow the support of longer lists (the Unicode characters are
limited to the set (1) to (20) and "1." to "20."). Using regular character
sequences would also allow the use of fonts that match the text of the
list.</p>
<p><em>What to do if detected</em>: No action needs to be taken by browsers.
When received in an editing context, substitution of a list item marker style
may be appropriate. However, the same characters are very often used as
dingbat-like symbols in tables, or may appear in general text, whether or not
referring to an item from a list. Therefore the user must have the choice of
whether to replace the character.</p>
<h3><a name="Fractions">5.4 Fractions</a></h3>
<p><em>Short description</em>: Single character fractions such as ½ or ¼.</p>
<p><em>Reason for inclusion</em>: Subsets of these occur in practically all
legacy character sets.</p>
<p><em>Problems when used in markup</em>: The character repertoire is limited
to a few common fractions. When used with more general methods of generating
fractions such as MathML [<a href="#MathML">MathML</a>] the usual problem of
dual representation arises.</p>
<p><em>Problems with other uses</em>: Other than normalization issues, these
characters present no undue problems in plain text. Where fraction slash is
supported, these can be expressed by substituting their compatibility
mappings. </p>
<p><em>Replacement markup</em>: MathML can represent fractions unambiguously.
When using fraction slash, care must be taken such that values like 3½ do not
turn into 31/2 (=15.5).</p>
<p><em>What to do if detected</em>: No action needs to be taken by browsers
or editors, except when converting plain text to MathML.</p>
<h3><a name="Squared">5.5 Squared or Horizontal</a></h3>
<p><em>Short description</em>: Characters that are symbols composed of groups
of typically kana or Latin letters, digits plus slash for use in a single
display cell in vertical display of text. </p>
<p><em>Reason for inclusion</em>: Many existing character sets contain these
as precomposed characters since for simple implementations this is the only
way to support the common use of providing metric units and other
abbreviations in a single character cell for vertical text layout. </p>
<p><em>Problems when used in markup</em>: Proposed markup, including CSS
styling, would be able express an unbounded set of these abbreviations,
obviating the need of cataloguing these in the character encoding standard
and making them more directly accessible to text based processing, for
example searching.</p>
<p><em>Problems with other uses</em>: The repertoire of these legacy
characters is limited; many more combinations are in actual use than are
accounted for in character sets. Pre-composed symbols do not make their text
content available to search engines. They also require re-encoding for text
laid out horizontally.</p>
<p><em>Replacement markup</em>: None available.</p>
<p><em>What to do if detected</em>: No action required. (Subject to change
pending the outcome of current proposals.)</p>
<h3><a name="Superscripts">5.6 Superscripts and Subscripts</a></h3>
<p><em>Short description</em>: Mainly super and subscript digits, but also
signs, parentheses and a large number of letters.</p>
<p><em>Reason for inclusion</em>: Super and subscripted letters and digits
are quite common in some forms of phonetic or phonemic transcriptions, where
the use of styles is both awkward and prone to data integrity issues when
exported to plain text. For super or subscripted letters in phonetic
transcription in particular, a change from superscript of subscript to
regular style would alter the meaning. Note that such use in transcription is
not limited to letters: superscripted small digits are often used to indicate
tone. When used for these purposes, these characters should be retained and
markup should <i>not</i> be used. </p>
<p>A few super and subscript characters, primarily the digits, also occur in
many legacy character sets, including Latin-1. Their use in pure plain text
is common for databases, e.g. including metric units for part descriptions
(viz. cm<sup>2</sup>) or for (usually simplified) formulae as occur in titles
of scientific publications. </p>
<p>When used in mathematical context (MathML) it is recommended to
consistently use style markup for superscripts and subscripts. This is
because mathematical layout allows not just individual symbols, but entire
expressions to be superscripted or subscripted in a regular, nested
manner.</p>
<p><em>Problems when used in markup</em>: Mixing direct use of these
characters with the use of style markup provides multiple representations of
the same text, leading to potentially different treatment by search and
display engines.</p>
<p>However, when super and sub-scripts are to reflect semantic distinctions,
it is easier to work with these meanings encoded in text rather than markup,
for example, in phonetic or phonemic transcription. Otherwise, they would
require markup in the middle of words, and they may also be inadvertently
changed to normal style text, when exporting to plain text. This applies to
the majority of super and subscripted characters in Unicode. On the other
hand, some user agent may support certain superscripted or subscripted
characters only when used as marked up text for example, because of lack of
font support for them.</p>
<p><em>Problems with other uses</em>: none</p>
<p><em>Replacement markup</em>: Unless used as letters, <xhtml:sup> and
<xhtml:sub> or <mathml:msup> and <mathml:msub> may be
used.</p>
<p><em>What to do if detected</em>: Both representations (with or without
style markup) should be equivalent for search purposes. Input methods for
mathematical texts might enforce the use of styles. If superscript
characters are encountered during display of mathematical formulae, it is
recommended that they be displayed in a manner indistinguishable from that
achieved by using regular characters with corresponding style markup.. </p>
<h3><a name="Other">5.7 Other Characters Marked <compat></a></h3>
<p><em>Short description</em>: The <compat> label was given to a set of
compatibility characters whose further classification was not settled at the
time the standard was created. The largest components are list item marker
characters.</p>
<p><em>Reason for inclusion</em>: These characters occur in many legacy
character sets.</p>
<p><em>Problems when used in markup</em>: none. There usually is no
equivalent markup.</p>
<p><em>Problems with other uses</em>: none</p>
<p><em>Replacement markup</em>: none.</p>
<p><em>What to do if detected</em>: No action required.</p>
<h2><a name="Noncharacters">6. Noncharacters</a></h2>
<p>The Unicode Standard defines 66 non-character code points, or
<i>noncharacters</i>. These are the last two positions on each of the 17
planes, in other words, all characters whose code points end in ...FFFE or
...FFFF, as well as the 32 code points from U+FDD0 to U+FDEF. Applications
are free to use any of these code points internally but should never attempt
to interchange them. In effect, noncharacters can be thought of as
application-internal private-use code points.</p>
<h2>7. <a name="White">White Space</a></h2>
<p>This section presents common issues with white space characters in markup
languages, mostly based on their difference in function as part of the
structure of the markup source (syntactic white space) on the one hand and as
part of the document content on the other hand.</p>
<p>The set of characters in the Unicode standard that have the property
"White_Space" (see 'White Space' in the [<a href="#UnicodeData">UCD</a>]) is
quite large. It includes white space characters with different line breaking
properties, different ligating properties, and different widths. It is
appropriate to use these characters as part of markup content for their very
specific purpose. It is preferable to place them in the markup source so
that they are surrounded by ordinary characters rather than line breaks for
example. The set of white space characters defined by typical markup
language specifications is a subset of the characters that are considered
white space by [<a href="#Unicode">Unicode</a>] .</p>
<p>Each markup language defines the set of characters that it accepts as part
of the markup syntax, this is usually a very small set. The XML [<a
href="#xml10">XML1.0</a>] and [<a href="#xml11">XML1.1</a>] specifications
define white space as a combination of one or more of the following
characters: U+0020 SPACE, carriage return (U+000D), line feed (U+000A), or
tab (U+0009). [<a href="#html4.01">HTML4.01</a>] adds to these the form feed
character (U+000C), but that character cannot be used in any XHTML
version.</p>
<p>In addition, markup languages may use conventions for converting or
removing some kinds of white space. XML processors replace some combinations
of end-of-line characters by a single line feed character. [<a
href="#xml10">XML1.0</a>] normalizes any two character sequences of (U+000D
U+000A) or any U+000D not followed by U+000A to a single U+000A. [<a
href="#xml11">XML1.1</a>] also normalizes NEL (U+0085) and U+2028 LINE
SEPARATOR, but U+2029 PARAGRAPH SEPARATOR is not treated that way. Additional
processing of white space before it is handled to an application also occurs
for attribute values: line breaks are replaced by spaces, leading and
trailing spaces are removed, and subsequent spaces are replaced by a single
space.</p>
<p>In XML, white space is purely syntactic inside tags, for example, to
separate the element name from attributes, and between elements in element
content models (as they are typical for data-oriented applications). White
space in element content models is used to lay out the markup source, using
line breaks and indentation, to improve readability. The same use of white
space is possible in many cases in mixed content (typical for text-oriented
applications).</p>
<p>Because XML is used for a very wide range of applications, after the
processing steps mentioned above it passes all white space to the
application. Some XML applications such as [<a href="#XHTML">XHTML</a>] may
have their own white space processing rules when processing white space
characters. Also, applications and software transforming XML (e.g. [<a
href="#XSLT">XSLT</a>]) have specific conventions of how they handle white
space, and specific ways of how to control this behavior. To appropriately
use white space characters, readers are advised to examine all involved
standards and software.</p>
<p>If the characters U+2028 and U+2029 appear in text, they may be treated as
zero-width characters without semantic meaning (see Section 3.2).</p>
<h3 id="converting-nl-to-ws">7.1 Converting Newline Functions to White
Space</h3>
<p>White space that is not purely syntactic, including control codes that
define a newline function (see <i>Section 5.8, Newline Guidelines,</i> in [<a
href="#Unicode">Unicode</a>]), can be handled in three main ways.</p>
<ol>
<li>For data-oriented applications, the textual content of elements is
treated according to the needs of the data type in question. In many
cases, processing by the application includes aspects similar to those of
the processing of attribute values by the XML parser itself. For some
types of data, in particular small data items, some applications may also
simply prohibit the use of white space.</li>
<li>For running text in text-oriented applications, reflowing is used, i.e.
the line breaks in the markup source are removed and the text is reflown
into lines whose length is determined by the output medium and styling
properties. In the context of Unicode, this reflowing process requires
care; it is described in more detail below.</li>
<li>For preformatted text, such as program source code, line breaks must be
preserved. Text-oriented applications usually contain special markup for
preformatted text, e.g. <xhtml:pre>. XML itself defines an
xml:space attribute that applications may use for a similar purpose.</li>
</ol>
<p>When reflowing, line breaks and adjacent white space can be treated as
space, removed, collapsed with adjacent control characters of the same type,
or treated as zero-width space. Which choice is appropriate depends on the
script of the surrounding text. The assumption is that line breaks and
adjacent white space (in particular following white space, used for
indentation) was added to make the markup source more readable, in particular
to make each line fit on a line of a plain text editor. For scripts that use
spaces, line breaks will have been inserted where there originally was a
space; treating them as spaces therefore preserves the intended separation
between words. For scripts which do not use spaces, such as Ideographic
scripts or certain South East Asian scripts, such as Thai, line feeds should
be removed, or replaced by U+200B zero width space. The choice of treatment
can depend on the script value of the characters preceding and following the
line feed character, assuming these characters belong to the same run of
text.</p>
<blockquote>
<p><b>Note:</b> The Unicode Standard [<a href="#Unicode">Unicode</a>]
specifies that the zero width space is considered a valid line-break point
and that if two characters with a zero width space in between are placed on
the same line they are placed with no space between them; and that if they
are placed on two lines no additional glyph area is created at the
line-break.</p>
</blockquote>
<p>The details of reflowing are the responsibility of the various markup
applications (e.g. [<a href="#XHTML">XHTML</a>]). However, there is a
tendency to move this functionality from markup applications to styling, so
that it can be shared across applications.</p>
<p>Authors should be aware of the fact that the above script-specific
treatment of line breaks when reflowing text is not yet available in all
implementations (e.g. browsers). For scripts that do not use white space to
separate words, it may therefore still be advisable to not split long
lines.</p>
<p>Editing tools should try to support the user in the appropriate use of
white space. Some white space characters cannot easily be entered via a
keyboard, but some others, e.g. U+3000 Ideographic Space, can. Editing tools
should try to make sure that only line breaks and white space that is
accepted as syntactic white space by the relevant markup language are used to
improve markup source readability.</p>
<p>While the styling possibilities provided by CSS and its implementations
have not reached the level of professional typesetting systems, they offer a
wide range of ways to control layout and spacing of text. A very simple
example is text centering, which would have been done by inserting an
appropriate number of spaces on each line in pure plain text.</p>
<h2><a name="Versioning">8. Versioning</a></h2>
<p>This report will be updated by the Unicode Technical Committee in
cooperation with the W3C Internationalization Activity whenever the tables of
characters in this document need to be updated as a result of the addition of
characters to the Unicode Standard, as a result of a revised determination of
the suitability of a given character for use with markup, or when additional
background information or recommendations become available.</p>
<p>Each report carries a revision number, which may be used to refer to a
specific version of the report. Older versions of the report will remain
available. Each version of this report specifies the underlying version of
the Unicode Standard.</p>
<p>For more information on the Unicode Standard and its versions, see:</p>
<ul class="unicode">
<li><a href="http://www.unicode.org/unicode/standard/versions/">Versions of
the Unicode Standard</a> [<a
href="#UnicodeVersions">UnicodeVersions</a>]</li>
<li><a href="http://www.unicode.org/ucd/">About the Unicode Character
Database</a> [<a href="#UCD">UCD</a>]</li>
<li><a href="http://www.unicode.org/Public/UNIDATA/UCD.html">Unicode
Character Database</a> [<a href="#UnicodeData">UnicodeData</a>]</li>
</ul>
<h2><a name="Conformance">9. Conformance</a></h2>
<p>In the context of the Unicode Standard, the material in this technical
report is <em>informative. </em>However, other documents, particularly markup
language specifications, may specify conformance including normative
references to this document. Such references may have to be updated as a
result of future updates to this report as discussed in Section 8<i>, <a
href="#Versioning">Versioning</a>.</i></p>
<h2><a name="References">10. References</a></h2>
<dl>
<dt><a name="Charmod">[Charmod]</a></dt>
<dd></dd>
<dd>Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf, Tex
Texin, Eds., <cite>Character Model for the World Wide Web 1.0:
Fundamentals</cite>, W3C Recommendation, 15-February-2005, <<a
href="http://www.w3.org/TR/2005/REC-charmod-20050215/">http://www.w3.org/TR/2005/REC-charmod-20050215/</a>>.</dd>
<dt>[<a name="Charmodnorm">Charmodnorm</a>]</dt>
<dd>François Yergeau, Martin J. Dürst, Richard Ishida, Addison Phillips,
Misha Wolf, and Tex Texin, Eds., <i>Character Model for the World Wide
Web 1.0: Normalization,</i> W3C Working Draft, 27-October-2005, <<a
href="http://www.w3.org/TR/2005/WD-charmod-norm-20051027/">http://www.w3.org/TR/2005/WD-charmod-norm-20051027/</a>>.</dd>
<dt><a name="CharReq">[CharReq]</a></dt>
<dd>Martin J. Dürst, <cite>Requirements for String Identity and Character
Indexing Definitions for the WWW</cite>, W3C Working Draft,
10-July-1998, <<a
href="http://www.w3.org/TR/WD-charreq">http://www.w3.org/TR/WD-charreq</a>>.</dd>
<dt>[<a name="CSS">CSS</a>]</dt>
<dd>For information on cascading style sheet specifications, see <<a
href="http://www.w3.org/Style/CSS/">http://www.w3.org/Style/CSS/</a>>.</dd>
<dt>[<a name="Feedback">Feedback</a>]</dt>
<dd>Reporting Errors and Requesting Information Online to the Unicode
Consortium,<i><</i><a
href="http://www.unicode.org/reporting.html">http://www.unicode.org/reporting.html</a>>.</dd>
<dt><a name="html4.01">[HTML4.01]</a></dt>
<dd>Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds., <cite>HTML 4.01
Specification</cite>, W3C Recommendation, 18-Dec-1997 (revised on
24-Dec-1999), <<a
href="http://www.w3.org/TR/1999/REC-html401-19991224/">http://www.w3.org/TR/1999/REC-html401-19991224/</a>>.</dd>
<dt><a name="HTML4.0-8.2">[HTML 4.0 - 8.2]</a></dt>
<dd>Section 8.2 of [HTML4.0] <i>Specifying the direction of text and
tables: the dir attribute</i> <<a
href="http://www.w3.org/TR/1999/REC-html401-19991224/struct/dirlang.html#h-8.2">http://www.w3.org/TR/1999/REC-html401-19991224/struct/dirlang.html#h-8.2</a>>.</dd>
<dt><a name="MathML">[MathML]</a></dt>
<dd>David Carlisle, Patrick Ion, Robert Miner, Nico Poppelier, Eds.,
<i>Mathematical Mathematical Markup Language (MathML) Version 2.0
(Second Edition)</i>, W3C Recommendation, 21-Oct-2003, <<a
href="http://www.w3.org/TR/2003/REC-MathML2-20031021/">http://www.w3.org/TR/2003/REC-MathML2-20031021/</a>>.</dd>
<dt><a name="Namespace">[Namespace]</a></dt>
<dd>Tim Bray, Dave Hollander, Andrew Layman, Eds., <i>Namespaces in XML
(Second Edition)</i>, W3C Recommendation, 16-Aug-2006, <<a
href="http://www.w3.org/TR/2006/REC-xml-names-20060816/">http://www.w3.org/TR/2006/REC-xml-names-20060816/</a>>.</dd>
<dt><a name="Ruby">[Ruby]</a></dt>
<dd>Marcin Sawicki, Michel Suignard, Masayasu Ishikawa, Martin Dürst, Tex
Texin, Eds., <i>Ruby Annotation</i>, W3C Recommendation, 31-May-2001,
<<a
href="http://www.w3.org/TR/2001/REC-ruby-20010531/">http://www.w3.org/TR/2001/REC-ruby-20010531/</a>>.</dd>
<dt><a name="UTR9">[UAX 9]</a></dt>
<dd>Mark Davis, <cite>Unicode Standard Annex #9, The Bidirectional
Algorithm</cite>, <<a
href="http://www.unicode.org/reports/tr9/">http://www.unicode.org/reports/tr9/</a>>.</dd>
<dt>[<a name="UAX14">UAX14</a>]</dt>
<dd>Asmus Freytag,<i>Unicode Standard Annex #14,</i> <i>Line Breaking
Properties</i> <a
href="http://www.unicode.org/reports/tr14/">http://www.unicode.org/reports/tr14/</a></dd>
<dt><a name="UTR15">[UAX 15]</a><a name="UAX15"></a></dt>
<dd>Mark Davis, Martin Dürst, <cite>Unicode Standard Annex #15, Unicode
Normalization Forms</cite>, <<a
href="http://www.unicode.org/reports/tr15/">http://www.unicode.org/reports/tr15/</a>>.</dd>
<dt>[<a name="UAX29">UAX 29</a>]</dt>
<dd>Mark Davis,<i>Unicode Standard Annex #29</i>, <i>Text Boundaries</i>.
<a
href="http://www.unicode.org/reports/tr29/">http://www.unicode.org/reports/tr29/</a></dd>
<dt>[<a name="UCD">UCD</a>]</dt>
<dd><cite>About the Unicode Character Database</cite>, <<a
href="http://www.unicode.org/ucd/">http://www.unicode.org/ucd/</a>>.</dd>
<dt><a name="Unicode">[Unicode]</a></dt>
<dd>The Unicode Consortium.<i><a
href="http://www.unicode.org/versions/Unicode5.0.0/">The Unicode
Standard, Version 5.0</a></i> (Boston, MA, Addison-Wesley, 2007. ISBN
0-321-48091-0). </dd>
<dt><a name="Unicode32">[Unicode32]</a></dt>
<dd><cite>Unicode Standard Annex #28 <a
href="http://www.unicode.org/reports/tr28/">Unicode 3.2</a></cite>, The
Unicode Consortium, 2002.</dd>
<dt><a name="Unicode40">[Unicode40]</a></dt>
<dd><cite><a
href="http://www.unicode.org/unicode/standard/standard.html">The
Unicode Standard</a>, <a
href="http://www.unicode.org/unicode/standard/versions/Unicode3.0.html">Version
4.0</a></cite>, <i>The Unicode Standard, Version 4.0, </i>(Reading,
Massachusetts: Addison-Wesley Developers Press, 2003, ISBN
0-321-18578-1) or online as <<a
href="http://www.unicode.org/versions/Unicode4.0.0/">http://www.unicode.org/versions/Unicode4.0.0/</a>>.</dd>
<dt>[<a name="Unicode50">Unicode50</a>]</dt>
<dd>The Unicode Consortium.<i><a
href="http://www.unicode.org/versions/Unicode5.0.0/">The Unicode
Standard, Version 5.0</a></i> (Boston, MA, Addison-Wesley, 2007. ISBN
0-321-48091-0) or online as <<a
href="http://www.unicode.org/versions/Unicode5.0.0/">http://www.unicode.org/versions/Unicode5.0.0/</a>></dd>
<dt><a name="UnicodeData">[UnicodeData]</a></dt>
<dd><cite>Unicode Character Database</cite>, <<a
href="http://www.unicode.org/Public/UNIDATA/UCD.html">http://www.unicode.org/Public/UNIDATA/UCD.html</a>>.</dd>
<dt><a name="UnicodeVersions">[UnicodeVersions]</a></dt>
<dd><cite>Versions of the Unicode Standard</cite>, <<a
href="http://www.unicode.org/unicode/standard/versions/">http://www.unicode.org/unicode/standard/versions/</a>>.</dd>
<dt>[<a name="UTR25">UTR25</a>]</dt>
<dd>Asmus Freytag, Barbara Beeton, Murray Sargent, <i>Unicode Technical
Report #25, Unicode Support for Mathematics, <<a
href="http://www.unicode.org/reports/tr25/">http://www.unicode.org/reports/tr25/</a>></i></dd>
<dt>[<a name="Variants">Variants</a>]</dt>
<dd>Standardized Variants <<a
href="http://www.unicode.org/Public/UNIDATA/StandardizedVariants.html">http://www.unicode.org/Public/UNIDATA/StandardizedVariants.html</a>>.</dd>
<dt><a name="XHTML">[XHTML]</a></dt>
<dd>Steven Pemberton, et al., Eds.,
<cite>XHTML</cite><i><cite>™</cite></i><cite>1.0: The Extensible
HyperText Markup Language - A Reformulation of HTML 4.0 in XML
1.0</cite>, W3C Recommendation, 01-Aug-2002, <<a
href="http://www.w3.org/TR/2002/REC-xhtml1-20020801/">http://www.w3.org/TR/2002/REC-xhtml1-20020801/</a>>.</dd>
<dt><a name="xml10">[XML 1.0]</a></dt>
<dd>Tim Bray, Jean Paoli, Eve Maler, C. M. Sperberg-McQueen, François
Yergeau, Eds., <i>Extensible Markup Language (XML) 1.0 (Fourth
Edition)</i>, W3C Recommendation, 16-August-2006, <<a
href="http://www.w3.org/TR/2006/REC-xml-20060816/">http://www.w3.org/TR/2006/REC-xml-20060816/</a>>.</dd>
<dt>[<a name="XSLT">XLST</a>]</dt>
<dd>Michael Kay, Ed., <i>XSL Transformations (XSLT) Version 2.0</i>, W3C
Recommendation, 23-January-2007, <<a
href="http://www.w3.org/TR/2007/REC-xslt20-20070123/">http://www.w3.org/TR/2007/REC-xslt20-20070123/</a>></dd>
<dt><a name="xml11">[XML 1.1]</a></dt>
<dd>Jean Paoli, Eve Maler, Tim Bray, C. M. Sperberg-McQueen, François
Yergeau, John Cowan, Eds., <i>Extensible Markup Language (XML) 1.1
(Second Edition)</i>, W3C Recommendation 16-August-2006, <<a
href="http://www.w3.org/TR/2006/REC-xml11-20060816/">http://www.w3.org/TR/2006/REC-xml11-20060816/</a>>.
</dd>
<dt>[<a name="XMLSchema">XML Schema</a>]</dt>
<dd>Henry S. Thompson, David Beech, Murray Maloney, Noah Mendelsohn,
Eds., <i>XML Schema Part 1: Structures Second Edition</i>, W3C
Recommendation 28-October-2004, <<a
href="http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/">http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/</a>>
. </dd>
</dl>
<h2><a name="Acknowledgements">11. Acknowledgements</a></h2>
<p>Mark Davis and Hideki Hiura contributed to the early drafts. Yukka Korpela
and Felix Sasaki provided input to the current document.</p>
<h2><a name="ChangeHistory">12. Change History (last changes first)</a></h2>
<p>Changes from <a class="unicode"
href="http://www.unicode.org/reports/tr20/tr20-7.html">http://www.unicode.org/reports/tr20/tr20-7.html</a>
: Added entries for new characters in Unicode 5.0. Updated references to use
new chapter/section numbers in Unicode 5.0. Updated the discussion of
superscript and subscript characters, accounting for the differences between
their use in phonetic or phonemic transcription and mathematics. Added
Section 3.10 and 4.5, 4.6 and 4.7. Added a Section 7 on handling white space.
Updated references to W3C publications (AF). More work on white space
section; moved everything about BOM to one place (MJD)</p>
<p>Changes from <a class="unicode"
href="http://www.unicode.org/reports/tr20/tr20-6.html">http://www.unicode.org/reports/tr20/tr20-6.html</a>
: Added entries for new characters in Unicode 4.0. Separated out, and
extended, the discussion of format characters suitable for markup. This
resulted in a new section 2.6, moving section 3.2 to 4, and renumbering, as
well as new sections 4.1, 4.2, 4.3, 4.4. Added a discussion on noncharacters
in a new section 6. Updated reference from Unicode 3.1 and 3.2 to Unicode
4.0. Improved the layout an description of what is now table 5.1. Changed the
recommended action in 5.6 to none. Updated the Unicode status section.
Changed http://www.unicode.org/unicode/reports/ to <a
href="http://www.unicode.org/reports/">http://www.unicode.org/reports</a>
throughout to reflect the preferred style of URL (older style URLs continue
to be valid). Updated references to W3C publications. (AF/MJD)</p>
<p>Changes from <a class="unicode"
href="http://www.unicode.org/reports/tr20/tr20-5.html">http://www.unicode.org/reports/tr20/tr20-5.html</a>
: Updated reference from Unicode 3.0 to 3.1 and 3.2 where appropriate. Added
sections 3.6 and 3.9. Minor wording fixes in sections 2.3, 3.1, 3.2, 3.6,
3.10, 4.3, 4.5 and 5. (AF/MJD)</p>
<p>Changes from <a class="unicode"
href="http://www.unicode.org/reports/tr20/tr20-4.html">http://www.unicode.org/reports/tr20/tr20-4.html</a>
: Added a note to the introduction to limit the scope. Reorganized section 3
and clarified the language. Renamed some sections and tables. Updated the
document to prepare for publication as Unicode Technical Report and W3C Note
(AF/MJD). Minor editorial changes to the text, added section 4.7, fixed some
dates, plus a few typos. (AF)</p>
<p>Changes from <a class="unicode"
href="http://www.unicode.org/reports/tr20/tr20-3.html">http://www.unicode.org/reports/tr20/tr20-3.html</a>
: Minor editorial changes to the introduction, fixed some references, links,
and dates, plus a few typos. (AF/MJD)</p>
<p>Changes from <a class="unicode"
href="http://www.unicode.org/reports/tr20/tr20-2.html">http://www.unicode.org/reports/tr20/tr20-2.html</a>
: Added sections 2.1-2.6 (MJD), sections 3.1-3.5, and 3.8, as well as
sections 4.4-4.6 and 8 (AF). Edited text for publication as DRAFT Unicode
Technical Report. (AF)</p>
<p>Changes from <a class="unicode"
href="http://www.unicode.org/reports/tr20/tr20-1.html">http://www.unicode.org/reports/tr20/tr20-1.html</a>
: Completed references, linked TOC. Various wording changes. Added W3C WD
stylesheet, logo, copyright, status of this document. Streamlined authors'
section. (MJD) Added material on compatibility characters. (AF)</p>
<p>Changes from the initial draft: Fixed the header. Fixed the numbering.
Fixed the title. Put references to final version of data files based on
naming conventions. Minor wording changes. Added proposed language on
annotation characters to match example on FFFC. Posted for internal review by
UTC and W3C. (AF)</p>
<h2><a name="Copyright">13. Copyright</a></h2>
<p>Copyright © 1999-2007 Unicode<sup>®</sup>, Inc. and <a
href="http://www.w3.org/">W3C</a><sup>®</sup> (<a
href="http://www.csail.mit.edu/index.php"><acronym
title="Massachussetts Institute of Technology">MIT</acronym></a>, <a
href="http://www.ercim.org/"><acronym
title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
<a href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved.</p>
<p>This document is available under the <a
href="http://www.w3.org/Consortium/Legal/copyright-documents-19990405">W3C
Document License</a> or the <a
href="http://www.unicode.org/unicode/copyright.html">Unicode License</a>.
Documents available from the W3C have additional <a
href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Legal_Disclaimer">warranties,
liability</a>, and <a
href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#W3C_Trademarks">trademark</a>
policies associated with them. The <a
href="http://www.unicode.org/unicode/copyright.html">Unicode License</a>
specifies warranty/liability and trademark terms including:</p>
<blockquote>
<p class="unicode">The Unicode Consortium makes no expressed or implied
warranty of any kind, and assumes no liability for errors or omissions. No
liability is assumed for incidental and consequential damages in connection
with or arising out of the use of the information or programs contained or
accompanying this technical report.</p>
<p class="unicode">Unicode and the Unicode logo are trademarks of Unicode,
Inc., and are registered in some jurisdictions.</p>
</blockquote>
</body>
</html>