index.html
146 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
<!--OFFLINE
<!DOCTYPE html SYSTEM "DTD/xhtml1-strict.dtd">
OFFLINE-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Speech Synthesis Markup Language (SSML) Version 1.0</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style type="text/css">
/*<![CDATA[*/
pre.example {
font-family: monospace;
white-space: pre;
background: #CCCCFF;
border: solid black thin;
margin-left: 0;
padding: 0.5em;
font-size: 85%;
width: 97%;
}
pre.dtd {
font-family: "Lucida Console", "Courier New", monospace;
white-space: pre;
background: #CCFFCC;
border: solid black thin;
margin-left: 0;
padding: 0.5em;
}
.ipa { font-family: "Lucida Sans Unicode", monospace; }
table { width: 100% }
td { background: #EAFFEA }
.tocline { list-style: disc; list-style: none; }
.hide { display: none }
.issues { font-style: italic; color: green }
.recentremove {
text-decoration: line-through;
color: black;
}
.recentnew {
color: red;
}
.remove {
text-decoration: line-through;
color: maroon;
}
.new {
color: fuchsia;
}
.elements {
font-family: monospace;
font-weight: bold;
}
.attributes {
font-family: monospace;
font-weight: bold;
}
code.att {
font-family: monospace;
font-weight: bold;
}
a.adef {
font-family: monospace;
font-weight: bold;
}
a.aref {
font-family: monospace;
font-weight: bold;
}
a.edef {
font-family: monospace;
font-weight: bold;
}
a.eref {
font-family: monospace;
font-weight: bold;
}
/*]]>*/
</style>
<link rel="stylesheet" type="text/css" href="http://www.w3.org/StyleSheets/TR/W3C-REC" />
</head>
<body>
<div class="head">
<p><a href="http://www.w3.org/"><img height="48" alt="W3C" src="http://www.w3.org/Icons/w3c_home" width="72" />
</a></p>
<h1 class="notoc" id="h1">Speech Synthesis Markup Language (SSML) Version 1.0</h1>
<h2 class="notoc" id="date">W3C Recommendation 7 September 2004</h2>
<dl>
<dt>This version:</dt>
<dd><a href="http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/">http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/</a></dd>
<dt>Latest version:</dt>
<dd><a href="http://www.w3.org/TR/speech-synthesis/">http://www.w3.org/TR/speech-synthesis/</a></dd>
<dt>Previous version:</dt>
<dd><a href="http://www.w3.org/TR/2004/PR-speech-synthesis-20040715/">http://www.w3.org/TR/2004/PR-speech-synthesis-20040715/</a></dd>
<dt><br />
Editors:</dt>
<dd>Daniel C. Burnett, Nuance Communications</dd>
<dd>Mark R. Walker, Intel</dd>
<dd>Andrew Hunt, ScanSoft</dd>
</dl>
<p>Please refer to the <a href="
http://www.w3.org/2004/09/ssml-errata.html"><strong>errata</strong></a>
for this document, which may include some normative corrections.</p>
<p>See also <a href="http://www.w3.org/2003/03/Translations/byTechnology?technology=speech-synthesis"><strong>translations</strong></a>.</p>
<p class="copyright"><a href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a> ©1999 - 2004 <a href="http://www.w3.org/"><abbr title="World Wide Web Consortium">W3C</abbr></a> <sup>®</sup> (<a href="http://www.csail.mit.edu/"><abbr title="Massachusetts Institute of Technology">MIT</abbr></a> , <a href="http://www.ercim.org/"><abbr lang="fr" title="European Research Consortium for Informatics and Mathematics">ERCIM</abbr></a> , <a href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved. W3C <a href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>, <a href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>, <a href="http://www.w3.org/Consortium/Legal/copyright-documents">document use</a> rules apply.</p>
<!--
<p>Sun, Sun Microsystems, Inc., the Sun logo, Java and all
Java-based marks and logos are trademarks or registered trademarks
of Sun Microsystems, Inc. in the United States and other countries.
©2000 Sun Microsystems.</p>
-->
<hr title="Separator from Header" />
</div>
<h2 class="notoc" id="abstr"><a id="abstract" name="abstract">Abstract</a></h2>
<p>The Voice Browser Working Group has sought to develop standards to enable access to the Web using spoken interaction. The Speech Synthesis Markup Language Specification is one of these standards and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.</p>
<h2 class="notoc" id="status">Status of this Document</h2>
<p><em>This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the <a href="http://www.w3.org/TR/">W3C technical reports index</a> at http://www.w3.org/TR/.</em></p>
<p>This document contains the Speech Synthesis Markup Language (SSML) 1.0
specification and is a <a href="http://www.w3.org/2004/02/Process-20040205/tr.html#RecsW3C">W3C Recommendation</a>. It has been produced as part of the
<a href="http://www.w3.org/Voice/Activity.html">Voice Browser Activity</a>.
The authors of this document are participants in the <a href="http://www.w3.org/Voice/Group/">Voice
Browser Working Group</a> (<a href="http://cgi.w3.org/MemberAccess/AccessRequest">W3C members only</a>).
For more information see the <a href="http://www.w3.org/Voice/#faq">Voice Browser FAQ</a>. This is a stable document and has been endorsed by the W3C Membership
and the participants of the Voice Browser working group.</p>
<p>The design of SSML 1.0 has been widely reviewed (see the
<a href="http://www.w3.org/TR/2004/PR-speech-synthesis-20040715/ssml-disposition.html">
disposition of comments</a>) and satisfies the Working Group's
technical requirements. A list of implementations is included in the
<a href="http://www.w3.org/Voice/2004/ssml-ir/">SSML 1.0 Implementation
Report</a>, along with the associated test suite.</p>
<p>Comments are welcome on <a
href="mailto:www-voice@w3.org">www-voice@w3.org</a> (<a
href="http://lists.w3.org/Archives/Public/www-voice/">archive</a>).
See <a href="http://www.w3.org/Mail/">W3C mailing list and archive usage
guidelines</a>.</p>
<p>Patent disclosures relevant to this specification may be found on the
Working Group's <a href="http://www.w3.org/2001/09/voice-disclosures.html">patent disclosure page</a>. This document has been produced under the <a href="http://www.w3.org/TR/2002/NOTE-patent-practice-20020124">24 January 2002 CPP</a> as amended by the <a href="http://www.w3.org/2004/02/05-pp-transition">W3C Patent Policy Transition Procedure</a>. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) with respect to this specification should disclose the information in accordance with <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/#sec-Disclosure">section 6 of the W3C Patent Policy</a>.</p>
<h2><a id="S0" name="S0">0.</a> Table of Contents</h2>
<ul class="toc">
<li class="tocline">1. <a href="#S1">Introduction</a>
<ul class="toc">
<li class="tocline">1.1 <a href="#S1.1">Design Concepts</a></li>
<li class="tocline">1.2 <a href="#S1.2">Speech Synthesis Process Steps</a></li>
<li class="tocline">1.3 <a href="#S1.3">Document Generation, Applications and Contexts</a></li>
<li class="tocline">1.4 <a href="#S1.4">Platform-Dependent Output Behavior of SSML Content</a></li>
<li class="tocline">1.5 <a href="#S1.5">Terminology</a></li>
</ul>
</li>
<li class="tocline">2.<a href="#S2">SSML Documents</a>
<ul class="toc">
<li class="tocline">2.1 <a href="#S2.1">Document Form</a></li>
<li class="tocline">2.2 <a href="#S2.2">Conformance</a>
<ul class="toc">
<li class="tocline">2.2.1 <a href="#S2.2.1">Conforming Speech Synthesis Markup Language Fragments</a></li>
<li class="tocline">2.2.2 <a href="#S2.2.2">Conforming Stand-Alone Speech Synthesis Markup Language Documents</a></li>
<li class="tocline">2.2.3 <a href="#S2.2.3">Using SSML With Other Namespaces</a></li>
<li class="tocline">2.2.4 <a href="#S2.2.4">Conforming Speech Synthesis Markup Language Processors</a></li>
<li class="tocline">2.2.5 <a href="#S2.2.5">Conforming User Agent</a></li>
</ul>
</li>
<li class="tocline">2.3 <a href="#S2.3">Integration With Other Markup Languages</a>
<ul class="toc">
<li class="tocline">2.3.1 <a href="#S2.3.1">SMIL</a></li>
<li class="tocline">2.3.2 <a href="#S2.3.2">ACSS</a></li>
<li class="tocline">2.3.3 <a href="#S2.3.3">VoiceXML</a></li>
</ul>
</li>
<li class="tocline">2.4 <a href="#S2.4">Fetching SSML Documents</a></li>
</ul>
</li>
<li class="tocline">3. <a href="#S3">Elements and Attributes</a>
<ul class="toc">
<li class="tocline">3.1 <a href="#S3.1">Document Structure, Text Processing and Pronunciation</a>
<ul class="toc">
<li class="tocline">3.1.1 <a href="#S3.1.1">"speak" Root Element</a></li>
<li class="tocline">3.1.2 <a href="#S3.1.2">Language: "xml:lang" Attribute</a></li>
<li class="tocline">3.1.3 <a href="#S3.1.3">Base URI: "xml:base" Attribute</a></li>
<li class="tocline">3.1.4 <a href="#S3.1.4">Pronunciation Lexicon: "lexicon" Element</a></li>
<li class="tocline">3.1.5 <a href="#S3.1.5">"meta" Element</a></li>
<li class="tocline">3.1.6 <a href="#S3.1.6">"metadata" Element</a></li>
<li class="tocline">3.1.7 <a href="#S3.1.7">Text Structure: "p" and "s" Elements</a></li>
<li class="tocline">3.1.8 <a href="#S3.1.8">"say-as" Element</a></li>
<li class="tocline">3.1.9 <a href="#S3.1.9">"phoneme" Element</a></li>
<li class="tocline">3.1.10 <a href="#S3.1.10">"sub" Element</a></li>
</ul>
</li>
<li class="tocline">3.2 <a href="#S3.2">Prosody and Style</a>
<ul class="toc">
<li class="tocline">3.2.1 <a href="#S3.2.1">"voice" Element</a></li>
<li class="tocline">3.2.2 <a href="#S3.2.2">"emphasis" Element</a></li>
<li class="tocline">3.2.3 <a href="#S3.2.3">"break" Element</a></li>
<li class="tocline">3.2.4 <a href="#S3.2.4">"prosody" Element</a></li>
</ul>
</li>
<li class="tocline">3.3 <a href="#S3.3">Other Elements</a>
<ul class="toc">
<li class="tocline">3.3.1 <a href="#S3.3.1">"audio" Element</a></li>
<li class="tocline">3.3.2 <a href="#S3.3.2">"mark" Element</a></li>
<li class="tocline">3.3.3 <a href="#S3.3.3">"desc" Element</a></li>
</ul>
</li>
</ul>
</li>
<li class="tocline">4. <a href="#S4">References</a></li>
<li class="tocline">5. <a href="#S5">Acknowledgments</a></li>
<li class="tocline">Appendix A. <a href="#AppA">Audio File Formats</a> (normative)</li>
<li class="tocline">Appendix B. <a href="#AppB">Internationalization</a> (normative)</li>
<li class="tocline">Appendix C. <a href="#AppC">MIME Types and File Suffix</a> (normative)</li>
<li class="tocline">Appendix D. <a href="#AppD">Schema for the Speech Synthesis Markup Language</a> (normative)</li>
<li class="tocline">Appendix E. <a href="#AppE">DTD for the Speech Synthesis Markup Language</a> (informative)</li>
<li class="tocline">Appendix F. <a href="#AppF">Example SSML</a> (informative)</li>
<li class="tocline">Appendix G. <a href="#AppG">Summary of changes since the Candidate Recommendation</a> (informative)</li>
</ul>
<h2><a id="S1" name="S1">1.</a> Introduction</h2>
<p>This W3C specification is known as the Speech Synthesis Markup Language specification (SSML) and is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc., California, U.S.A. The JSML specification can be found at [<a href="#ref-jsml">JSML</a>].</p>
<p>SSML is part of a larger set of markup specifications for <a href="#term-voicebrowser">voice browsers</a> developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms. A related initiative to establish a standard system for marking up text input is SABLE [<a href="#ref-sable">SABLE</a>], which tried to integrate many different XML-based markups for <a href="#term-synthesis">speech synthesis</a> into a new one. The activity carried out in SABLE was also used as the main starting point for defining the Speech Synthesis Markup Requirements for Voice Markup Languages [<a href="#ref-reqs">REQS</a>]. Since then, SABLE itself has not undergone any further development.</p>
<p>The intended use of SSML is to improve the quality of synthesized content. Different markup elements impact different stages of the synthesis process (see <a href="#S1.2">Section 1.2</a>). The markup may be produced either automatically, for instance via XSLT or CSS3 from an XHTML document, or by human authoring. Markup may be present within a complete SSML document (see <a href="#S2.2.2">Section 2.2.2</a>) or as part of a fragment (see <a href="#S2.2.1">Section 2.2.1</a>) embedded in another language, although no interactions with other languages are specified as part of SSML itself. Most of the markup included in SSML is suitable for use by the majority of content developers; however, some advanced features like <a href="#edef_phoneme" class="eref">phoneme</a> and <a href="#edef_prosody" class="eref">prosody</a> (e.g. for speech contour design) may require specialized knowledge.</p>
<h3><a id="S1.1" name="S1.1">1.1</a> Design Concepts</h3>
<p>The design and standardization process has followed from the Speech Synthesis Markup Requirements for Voice Markup Languages [<a href="#ref-reqs">REQS</a>].</p>
<p>The following items were the key design criteria.</p>
<ul>
<li><em>Consistency:</em> provide predictable control of voice output across platforms and across <a href="#term-synthesis">speech synthesis</a> implementations.</li>
<li><em>Interoperability:</em> support use along with other W3C specifications including (but not limited to) VoiceXML, aural Cascading Style Sheets and SMIL.</li>
<li><em>Generality:</em> support speech output for a wide range of applications with varied speech content.</li>
<li><em>Internationalization:</em> Enable speech output in a large number of languages within or across documents.</li>
<li><em>Generation and Readability:</em> Support automatic generation and hand authoring of documents. The documents should be human-readable.</li>
<li><em>Implementable:</em> The specification should be implementable with existing, generally available technology, and the number of optional features should be minimal.</li>
</ul>
<h3><a id="S1.2" name="S1.2">1.2</a> Speech Synthesis Process Steps</h3>
<p>A <a href="#term-tts">Text-To-Speech</a> system (a <a href="#term-processor">synthesis processor</a>) that supports SSML will be responsible for rendering a document as spoken output and for using the information contained in the markup to render the document as intended by the author.</p>
<p><em>Document creation:</em> A text document provided as input to the <a href="#term-processor">synthesis processor</a> may be produced automatically, by human authoring, or through a combination of these forms. SSML defines the form of the document.</p>
<p><em>Document processing:</em> The following are the six major processing steps undertaken by a <a href="#term-processor">synthesis processor</a> to convert marked-up text input into automatically generated voice output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control the final voice output. Although each step below is divided into "markup support" and "non-markup behavior", actual behavior is usually a mix of the two and varies depending on the tag. The processor has the ultimate authority to ensure that what it produces is pronounceable (and ideally intelligible). In general the markup provides a way for the author to make prosodic and other information available to the processor, typically information the processor would be unable to acquire on its own. It is then up to the processor to determine whether and in what way to use the information.</p>
<ol>
<li>
<p><b>XML parse:</b> An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps. Tokens (words) in SSML cannot span markup tags. A simple English example is "cup<break/>board"; the <a href="#term-processor">synthesis processor</a> will treat this as the two words "cup" and "board" rather than as one word with a pause in the middle. Breaking one token into multiple tokens this way will likely affect how the processor treats it.</p>
</li>
<li>
<p><b>Structure analysis:</b> The structure of a document influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.</p>
<ul>
<li>
<p><em>Markup support:</em> The <a href="#edef_paragraph" class="eref">p</a> and <a href="#edef_sentence" class="eref">s</a> elements defined in SSML explicitly indicate document structures that affect the speech output.</p>
</li>
<li>
<p><em>Non-markup behavior:</em> In documents and parts of documents where these elements are not used, the <a href="#term-processor">synthesis processor</a> is responsible for inferring the structure by automated analysis of the text, often using punctuation and other language-specific data.</p>
</li>
</ul>
</li>
<li>
<p><a name="text_normalization" id="text_normalization"><b>Text normalization:</b></a> All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of the <a href="#term-processor">synthesis processor</a> that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on. By the end of this step the text to be spoken has been converted completely into tokens. The exact details of what constitutes a token are language-specific. In English, tokens are usually separated by white space and are typically words. For languages with different tokenization behavior, the term "word" in this specification is intended to mean an appropriately comparable unit.</p>
<ul>
<li>
<p><em>Markup support:</em> The <a href="#edef_say-as" class="eref">say-as</a> element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. The set of constructs that can be marked has not yet been defined but might include dates, times, numbers, acronyms, currency amounts and more. Note that many acronyms and abbreviations can be handled by the author via direct text replacement or by use of the <a href="#edef_sub" class="eref">sub</a> element, e.g. "BBC" can be written as "B B C" and "AAA" can be written as "triple A". These replacement written forms will likely be pronounced as one would want the original acronyms to be pronounced. In the case of Japanese text, if you have a <a href="#term-processor">synthesis processor</a> that supports both Kanji and kana, you may be able to use the <a href="#edef_sub" class="eref">sub</a> element to identify whether 今日は should be spoken as きょうは ("kyou wa" = "today") or こんにちは ("konnichiwa" = "hello").</p>
</li>
<li>
<p><em>Non-markup behavior:</em> For text content that is not marked with the <a href="#edef_say-as" class="eref">say-as</a> element the <a href="#term-processor">synthesis processor</a> is expected to make a reasonable effort to automatically locate and convert these constructs to a speakable form. Because of inherent ambiguities (such as the "1/2" example above) and because of the wide range of possible constructs in any language, this process may introduce errors in the speech output and may cause different processors to render the same document differently.</p>
</li>
</ul>
</li>
<li>
<p><b>Text-to-phoneme conversion:</b> Once the <a href="#term-processor">synthesis processor</a> has determined the set of words to be spoken, it must derive pronunciations for each word. Word pronunciations may be conveniently described as sequences of phonemes, which are units of sound in a language that serve to distinguish one word from another. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g., most US English dialects have around 45 phonemes, Hawai'ian has between 12 and 18 (depending on who you ask), and some languages have more than 100! This conversion is made complex by a number of issues. One issue is that there are differences between written and spoken forms of a language, and these differences can lead to indeterminacy or ambiguity in the pronunciation of written words. For example, compared with their spoken form, words in Hebrew and Arabic are usually written with no vowels, or only a few vowels specified. In many languages the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book). Both human speakers and synthesis processors can pronounce these words correctly in context but may have difficulty without context (see "Non-markup behavior" below). Another issue is the handling of words with non-standard spellings or pronunciations. For example, an English <a href="#term-processor">synthesis processor</a> will often have trouble determining how to speak some non-English-origin names, e.g. "Caius College" (pronounced "keys college") and President Tito (pronounced "sutto"), the president of the Republic of Kiribati (pronounced "kiribass").</p>
<ul>
<li>
<p><em>Markup support:</em> The <a href="#edef_phoneme" class="eref">phoneme</a> element allows a phonemic sequence to be provided for any word or word sequence. This provides the content creator with explicit control over pronunciations. The <a href="#edef_say-as" class="eref">say-as</a> element might also be used to indicate that text is a proper name that may allow a <a href="#term-processor">synthesis processor</a> to apply special rules to determine a pronunciation. The <a href="#edef_lexicon" class="eref">lexicon</a> element can be used to reference external definitions of pronunciations. These elements can be particularly useful for acronyms and abbreviations that the processor is unable to resolve via its own <a href="#text_normalization">text normalization</a> and that are not addressable via direct text substitution or the <a href="#edef_sub" class="eref">sub</a> element (see paragraph 3, above).</p>
</li>
<li>
<p><em>Non-markup behavior:</em> In the absence of a <a href="#edef_phoneme" class="eref">phoneme</a> element the <a href="#term-processor">synthesis processor</a> must apply automated capabilities to determine pronunciations. This is typically achieved by looking up words in a pronunciation dictionary (which may be language-dependent) and applying rules to determine other pronunciations. <a href="#term-processor">Synthesis processors</a> are designed to perform text-to-phoneme conversions so most words of most documents can be handled automatically. As an alternative to relying upon the processor, authors may choose to perform some conversions themselves prior to encoding in SSML. Written words with indeterminate or ambiguous pronunciations could be replaced by words with an unambiguous pronunciation; for example, in the case of "read", "I will reed the book". Authors should be aware, however, that the resulting SSML document may not be optimal for visual display.</p>
</li>
</ul>
</li>
<li>
<p><b>Prosody analysis:</b> Prosody is the set of features of speech output that includes the pitch (also called intonation or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Producing human-like prosody is important for making speech sound natural and for correctly conveying the meaning of spoken language.</p>
<ul>
<li>
<p><em>Markup support:</em> The <a href="#edef_emphasis" class="eref">emphasis</a> element, <a href="#edef_break" class="eref">break</a> element and <a href="#edef_prosody" class="eref">prosody</a> element may all be used by document creators to guide the <a href="#term-processor">synthesis processor</a> in generating appropriate prosodic features in the speech output.</p>
</li>
<li>
<p><em>Non-markup behavior:</em> In the absence of these elements, <a href="#term-processor">synthesis processors</a> are expert (but not perfect) in automatically generating suitable prosody. This is achieved through analysis of the document structure, sentence syntax, and other information that can be inferred from the text input.</p>
</li>
</ul>
<p>While most of the elements of SSML can be considered high-level in that they provide either content to be spoken or logical descriptions of style, the <a href="#edef_break" class="eref">break</a> and <a href="#edef_prosody" class="eref">prosody</a> elements mentioned above operate at a later point in the process and thus must coexist both with uses of the <a href="#edef_emphasis" class="eref">emphasis</a> element and with the processor's own determinations of prosodic behavior. Unless specified in the appropriate sections, details of the interactions between the processor's own determinations and those provided by the author at this level are processor-specific. Authors are encouraged not to casually or arbitrarily mix these two levels of control.</p>
</li>
<li>
<p><b>Waveform production:</b> The phonemes and prosodic information are used by the <a href="#term-processor">synthesis processor</a> in the production of the audio waveform. There are many approaches to this processing step so there may be considerable processor-specific variation.</p>
<ul>
<li>
<p><em>Markup support:</em> The <a href="#edef_voice" class="eref">voice</a> element allows the document creator to request a particular voice or specific voice qualities (e.g. a young male voice). The <a href="#edef_audio" class="eref">audio</a> element allows for insertion of recorded audio data into the output stream.</p>
</li>
</ul>
</li>
</ol>
<h3><a id="S1.3" name="S1.3">1.3</a> Document Generation, Applications and Contexts</h3>
<p>There are many classes of document creator that will produce marked-up documents to be spoken by a <a href="#term-processor">synthesis processor</a>. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the <a href="#S1.2">previous section</a>. The following are some of the common cases.</p>
<ul>
<li>
<p>The document creator has no access to information to mark up the text. All processing steps in the <a href="#term-processor">synthesis processor</a> must be performed fully automatically on <em>raw text</em>. The document requires only the containing <a href="#edef_speak" class="eref">speak</a> element to indicate the content is to be spoken.</p>
</li>
<li>
<p>When marked text is generated programmatically the creator may have specific knowledge of the structure and/or special text constructs in some or all of the document. For example, an email reader can mark the location of the time and date of receipt of email. Such applications may use elements that affect structure, <a href="#text_normalization">text normalization</a>, prosody and possibly text-to-phoneme conversion.</p>
</li>
<li>
<p>Some document creators make considerable effort to mark as many details of the document as possible to ensure consistent speech quality across platforms and to more precisely specify output qualities. In these cases, the markup may use any or all of the available elements to tightly control the speech output. For example, prompts generated in telephony and <a href="#term-voicebrowser">voice browser</a> applications may be fine-tuned to maximize the effectiveness of the overall system.</p>
</li>
<li>
<p>The most advanced document creators may skip the higher-level markup (structure, <a href="#text_normalization">text normalization</a>, text-to-phoneme conversion, and prosody analysis) and produce low-level <a href="#term-synthesis">speech synthesis</a> markup for segments of documents or for entire documents. This typically requires tools to generate sequences of phonemes, plus pitch and timing information. For instance, tools that do "copy synthesis" or "prosody transplant" try to emulate human speech by copying properties from recordings.</p>
</li>
</ul>
<p>The following are important instances of architectures or designs from which marked-up synthesis documents will be generated. The language design is intended to facilitate each of these approaches.</p>
<ul>
<li>
<p><em>Dialog language</em>: It is a requirement that it should be possible to include documents marked with SSML into the dialog description document to be produced by the Voice Browser Working Group.</p>
</li>
<li>
<p><em>Interoperability with aural CSS (ACSS)</em>: Any HTML processor that is aural CSS-enabled can produce SSML. ACSS is covered in <a href="http://www.w3.org/TR/1998/REC-CSS2-19980512/aural.html">Section 19 of the Cascading Style Sheets, level 2 (CSS2) Specification</a> [<a href="#ref-css2">CSS2</a> §19]. This usage of <a href="#term-synthesis">speech synthesis</a> facilitates improved accessibility to existing HTML and XHTML content.</p>
</li>
<li>
<p><em>Application-specific style sheet processing</em>: As mentioned above,
there are classes of applications that have knowledge of text content to be spoken, and that can be incorporated into the <a href="#term-synthesis">speech synthesis</a> markup to enhance rendering of the document. In many cases, it is expected that the application will use style sheets to perform transformations of existing XML documents to SSML. This is equivalent to the use of ACSS with HTML and once again SSML is the resulting representation to be passed to the <a href="#term-processor">synthesis processor</a>. In this context, SSML may be viewed as a superset of <a href="http://www.w3.org/TR/1998/REC-CSS2-19980512/aural.html">ACSS</a> [<a href="#ref-css2">CSS2</a>§19] capabilities, excepting spatial audio.</p>
</li>
</ul>
<h3><a id="S1.4" name="S1.4">1.4</a> Platform-Dependent Output Behavior of SSML Content</h3>
<p>SSML provides a standard way to specify gross properties of synthetic speech production such as pronunciation, volume, pitch, rate, etc. Exact specification of synthetic speech output behavior across disparate processors, however, is beyond the scope of this document.</p>
<p>Unless otherwise specified, markup values are merely indications rather than absolutes. For example, it is possible for an author to explicitly indicate the duration of a text segment and also indicate an explicit duration for a subset of that text segment. If the two durations result in a text segment that the <a href="#term-processor">synthesis processor</a> cannot reasonably render, the processor is permitted to modify the durations as needed to render the text segment.</p>
<h3><a id="S1.5" name="S1.5">1.5</a> Terminology</h3>
<dl>
<dt><br />
<b><em><a id="term-requirements" name="term-requirements">Requirements terms</a></em></b></dt>
<dd>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [<a href="#ref-rfc2119">RFC2119</a>]. However, for readability, these words do not appear in all uppercase letters in this specification.</dd>
<dt><br />
<b><em><a id="term-useroption" name="term-useroption">At user option</a></em></b></dt>
<dd>A conforming <a href="#term-processor">synthesis processor</a> may or must (depending on the modal verb in the sentence) behave as described; if it does, it must provide users a means to enable or disable the behavior described.</dd>
<dt><br />
<b><em><a id="term-error" name="term-error">Error</a></em></b></dt>
<dd>Results are undefined. A conforming <a href="#term-processor">synthesis processor</a> may detect and report an error and may recover from it.</dd>
<dt><br />
<b><em><a id="term-media-type" name="term-media-type">Media Type</a></em></b></dt>
<dd>A <em>media type</em> (defined in [<a href="#ref-rfc2045">RFC2045</a>] and [<a href="#ref-rfc2046">RFC2046</a>]) specifies the nature of a linked resource. Media types are case insensitive. A list of registered media types is available for download [<a href="#ref-mimetypes">TYPES</a>].
See <a href="#AppC">Appendix C</a> for information on media types for SSML.
</dd>
<dt><br />
<b><em><a id="term-synthesis" name="term-synthesis">Speech Synthesis</a></em></b></dt>
<dd>The process of automatic generation of speech output from data input which may include plain text, marked up text or binary objects.</dd>
<dt><br />
<b><em><a id="term-processor" name="term-processor">Synthesis Processor</a></em></b></dt>
<dd>A <a href="#term-tts">Text-To-Speech</a> system that accepts SSML documents as input and renders them as spoken output.</dd>
<dt><br />
<b><em><a id="term-tts" name="term-tts">Text-To-Speech</a></em></b></dt>
<dd>The process of automatic generation of speech output from text or annotated text input.</dd>
<dt><br />
<b><em><a id="term-uri" name="term-uri">URI: Uniform Resource Identifier</a></em></b></dt>
<dd>A URI is a unifying syntax for the expression of names and addresses of objects on the network as used in the World Wide Web. A URI is defined as any legal <code><a href="http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#anyURI">anyURI</a></code> primitive as defined in XML Schema Part 2: Datatypes [<a href="#ref-schema2">SCHEMA2</a> §3.2.17]. For informational purposes only, [<a href="#ref-rfc2396">RFC2396</a>] and [<a href="#ref-rfc2732">RFC2732</a>] may be useful in understanding the structure, format, and use of URIs. Any relative URI reference must be resolved according to the rules given in <a href="#S3.1.3.1">Section 3.1.3.1</a>. In this specification URIs are provided as attributes to elements, for example in the <a href="#edef_audio" class="eref">audio</a> and <a href="#edef_lexicon" class="eref">lexicon</a> elements.</dd>
<dt><br />
<b><em><a id="term-voicebrowser" name="term-voicebrowser">Voice Browser</a></em></b></dt>
<dd>A device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input, and possibly other input/output modalities.</dd>
</dl>
<h2 id="g28"><a id="S2" name="S2">2.</a> SSML Documents</h2>
<h3 id="g29"><a id="S2.1" name="S2.1">2.1</a> Document Form</h3>
<p>A legal stand-alone Speech Synthesis Markup Language document must have a legal <a href="http://www.w3.org/TR/2000/REC-xml-20001006#sec-prolog-dtd">XML Prolog</a> [<a href="#ref-xml">XML</a> §2.8]. If present, the optional DOCTYPE must read as follows:</p>
<pre>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
"http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
</pre>
<p>The XML prolog is followed by the root <a href="#edef_speak" class="eref">speak</a> element. See <a href="#S3.1.1">Section 3.1.1</a> for details on this element.</p>
<p>The <a href="#edef_speak" class="eref">speak</a> element must designate the SSML namespace. This can be achieved by declaring an <code class="att">xmlns</code> attribute or an attribute with an "xmlns" prefix. See [<a href="#ref-xmlns">XMLNS</a> §2] for details. Note that when the <code class="att">xmlns</code> attribute is used alone, it sets the default namespace for the element on which it appears and for any child elements. The namespace for SSML is defined to be <a href="http://www.w3.org/2001/10/synthesis">http://www.w3.org/2001/10/synthesis</a>.</p>
<p>It is recommended that the <a href="#edef_speak" class="eref">speak</a> element also indicate the location of the SSML schema (see <a href="#AppD">Appendix D</a>) via the <code class="att"><a href="http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/#xsi_schemaLocation">xsi:schemaLocation</a></code> attribute from [<a href="#ref-schema1">SCHEMA1</a> §2.6.3]. Although such indication is not required, to encourage it this document provides such indication on all of the examples.</p>
<p>The following are two examples of legal SSML headers:</p>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
</pre>
<pre class="example">
<?xml version="1.0"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
"http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
</pre>
<p>The <a href="#edef_meta" class="eref">meta</a>, <a href="#edef_metadata" class="eref">metadata</a> and <a href="#edef_lexicon" class="eref">lexicon</a> elements must occur before all other elements and text contained within the root <a href="#edef_speak" class="eref">speak</a> element. There are no other ordering constraints on the elements in this specification.</p>
<h2 id="g34"><a id="S2.2" name="S2.2">2.2.</a> Conformance</h2>
<h3 id="g35"><a id="S2.2.1" name="S2.2.1">2.2.1</a> Conforming Speech Synthesis Markup Language Fragments</h3>
<p>A document fragment is a <em>Conforming Speech Synthesis Markup Language Fragment</em> if:</p>
<ul>
<li>it conforms to the criteria for <a href="#S2.2.2">Conforming Stand-Alone Speech Synthesis Markup Language Documents</a> after:
<ul>
<li>with the exception of <a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> and <a href="#adef_xmlbase" class="aref"><code class="att">xml:base</code></a> , all non-synthesis namespace elements and attributes and all <code class="att">xmlns</code> attributes which refer to non-synthesis namespace elements are removed from the document,</li>
<li>and, if the <a href="#edef_speak" class="eref">speak</a> element does not already designate the synthesis namespace using the <code class="att">xmlns</code> attribute, then <code>xmlns="http://www.w3.org/2001/10/synthesis"</code> is added to the element.</li>
</ul>
</li>
</ul>
<h3 id="g36"><a id="S2.2.2" name="S2.2.2">2.2.2</a> Conforming Stand-Alone Speech Synthesis Markup Language Documents</h3>
<p>A document is a <em>Conforming Stand-Alone Speech Synthesis Markup Language Document</em> if it meets both the following conditions:</p>
<ul>
<li>It is a <a href="http://www.w3.org/TR/2000/REC-xml-20001006#sec-well-formed">well-formed XML document</a> [<a href="#ref-xml">XML</a> §2.1] conforming to Namespaces in XML <a href="#ref-xmlns">[XMLNS]</a>.</li>
<li>It is a <a href="http://www.w3.org/TR/2000/REC-xml-20001006#sec-prolog-dtd" shape="rect">valid XML document</a> [<a href="#ref-xml">XML</a> §2.8] which adheres to the specification described in this document (<a href="#S1" shape="rect">Speech Synthesis Markup Language Specification</a>) including the constraints expressed in the Schema (see <a href="#AppD" shape="rect">Appendix D</a>) and having an XML Prolog and <a href="#edef_speak" class="eref">speak</a> root element as specified in <a href="#S2.1">Section 2.1</a>.</li>
</ul>
<p>The SSML specification and these conformance criteria provide no designated size limits on any aspect of synthesis documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.</p>
<h3><a id="S2.2.3" name="S2.2.3">2.2.3</a> Using SSML with other Namespaces</h3>
<p>The synthesis namespace may be used with other XML namespaces as per the Namespaces in XML Recommendation [<a href="#ref-xmlns">XMLNS</a>]. Future work by W3C is expected to address ways to specify conformance for documents involving multiple namespaces.</p>
<h3 id="g38"><a id="S2.2.4" name="S2.2.4">2.2.4</a> Conforming Speech Synthesis Markup Language Processors</h3>
<p>A Speech Synthesis Markup Language processor is a program that can parse and process <a href="#S2.2.2">Conforming Stand-Alone Speech Synthesis Markup Language documents</a>.</p>
<p>In a <em>Conforming Speech Synthesis Markup Language Processor</em>, the XML parser must be able to parse and process all XML constructs defined by XML 1.0 [<a href="#ref-xml">XML</a>] and Namespaces in XML [<a href="#ref-xmlns">XMLNS</a>]. This XML parser is not required to perform validation of an SSML document as per its schema or DTD; this implies that during processing of an SSML document it is optional to apply or expand external entity references defined in an external DTD.</p>
<p>A Conforming Speech Synthesis Markup Language Processor must correctly understand and apply the semantics of each markup element as described by this document.</p>
<p>A Conforming Speech Synthesis Markup Language Processor must meet the following requirements for handling of natural (human) languages:</p>
<ul>
<li>A Conforming Speech Synthesis Markup Language Processor is required to parse all legal natural language declarations successfully.</li>
<li>A Conforming Speech Synthesis Markup Language Processor may be able to apply the semantics of markup languages which refer to more than one natural language. When a processor is able to support each natural language in the set but is unable to handle them concurrently it should inform the hosting environment. When the set includes one or more natural languages that are not supported by the processor it should inform the hosting environment.</li>
<li>A Conforming Speech Synthesis Markup Language Processor may implement natural languages by approximate substitutions according to a documented, processor-specific behavior. For example, a US English synthesis processor could process British English input.</li>
</ul>
<p>When a Conforming Speech Synthesis Markup Language Processor encounters elements or attributes, other than <a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> and <a href="#adef_xmlbase" class="aref"><code class="att">xml:base</code></a> , in a non-synthesis namespace it may:</p>
<ul>
<li>ignore the non-standard elements and/or attributes</li>
<li>or, process the non-standard elements and/or attributes</li>
<li>or, reject the document containing those elements and/or attributes</li>
</ul>
<p>There is, however, no conformance requirement with respect to performance characteristics of the Speech Synthesis Markup Language Processor. For instance, no statement is required regarding the accuracy, speed or other characteristics of speech produced by the processor. No statement is made regarding the size of input that a Speech Synthesis Markup Language Processor must support.</p>
<h3 id="g388"><a id="S2.2.5" name="S2.2.5">2.2.5</a> Conforming User Agent</h3>
<p>A <em>Conforming User Agent</em> is a <a href="#S2.2.4">Conforming Speech Synthesis Markup Language Processor</a> that is capable of accepting an SSML document as input and producing a spoken output by using the information contained in the markup to render the document as intended by the author. A Conforming User Agent must support at least one natural language.</p>
<p>Since the output cannot be guaranteed to be a correct representation of all the markup contained in the input there is no conformance requirement regarding accuracy. A conformance test may, however, require some examples of correct synthesis of a reference document to determine conformance.</p>
<h3 id="g30"><a id="S2.3" name="S2.3">2.3</a> Integration With Other Markup Languages</h3>
<h4 id="g31"><a id="S2.3.1" name="S2.3.1">2.3.1</a> SMIL</h4>
<p>The Synchronized Multimedia Integration Language (SMIL, pronounced "smile") [<a href="#ref-smil">SMIL</a>] enables simple authoring of interactive audiovisual presentations. SMIL is typically used for "rich media"/multimedia presentations which integrate streaming audio and video with images, text or any other media type. SMIL is an easy-to-learn HTML-like language, and many SMIL presentations are written using a simple text editor. See the SMIL/SSML integration examples in <a href="#AppF">Appendix F</a>.</p>
<h4 id="g32"><a id="S2.3.2" name="S2.3.2">2.3.2</a> ACSS</h4>
<p>Aural Cascading Style Sheets [<a href="#ref-css2">CSS2</a> §19] are employed to augment standard visual forms of documents (like HTML) with additional elements that assist in the synthesis of the text into audio. In comparison to SSML, ACSS-generated documents are capable of more complex specifications of the audio sequence, including the designation of 3D location of the audio source. Many of the other ACSS elements overlap SSML functionality, especially in the specification of voice type/quality. SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.</p>
<h4 id="g2.3.3t"><a id="S2.3.3" name="S2.3.3">2.3.3</a> VoiceXML</h4>
<p>The Voice Extensible Markup Language [<a href="#ref-vxml">VXML</a>] enables Web-based development and content-delivery for interactive voice response applications (see <a href="#term-voicebrowser"><em>voice browser</em></a> ). VoiceXML supports <a href="#term-synthesis">speech synthesis</a>, recording and playback of digitized audio, speech recognition, DTMF input, telephony call control, and form-driven mixed initiative dialogs. VoiceXML 2.0 extends SSML for the markup of text to be synthesized. For an example of the integration between VoiceXML and SSML see <a href="#AppF">Appendix F</a>.</p>
<h3 id="g33"><a id="S2.4" name="S2.4">2.4</a> Fetching SSML Documents</h3>
<p>The fetching and caching behavior of SSML documents is defined by the environment in which the <a href="#term-processor">synthesis processor</a> operates. In a VoiceXML interpreter context for example, the caching policy is determined by the VoiceXML interpreter.</p>
<h2><a id="S3" name="S3">3.</a> Elements and Attributes</h2>
<p>The following elements and attributes are defined in this specification.</p>
<ul class="toc">
<li class="tocline">3.1 <a href="#S3.1">Document Structure, Text Processing and Pronunciation</a>
<ul class="toc">
<li class="tocline">3.1.1 <a href="#S3.1.1">"speak" Root Element</a></li>
<li class="tocline">3.1.2 <a href="#S3.1.2">Language: "xml:lang" Attribute</a></li>
<li class="tocline">3.1.3 <a href="#S3.1.3">base URI: "xml:base" Attribute</a></li>
<li class="tocline">3.1.4 <a href="#S3.1.4">Pronunciation Lexicon: "lexicon" Element</a></li>
<li class="tocline">3.1.5 <a href="#S3.1.5">"meta" Element</a></li>
<li class="tocline">3.1.6 <a href="#S3.1.6">"metadata" Element</a></li>
<li class="tocline">3.1.7 <a href="#S3.1.7">Text Structure: "p" and "s" Elements</a></li>
<li class="tocline">3.1.8 <a href="#S3.1.8">"say-as" Element</a></li>
<li class="tocline">3.1.9 <a href="#S3.1.9">"phoneme" Element</a></li>
<li class="tocline">3.1.10 <a href="#S3.1.10">"sub" Element</a></li>
</ul>
</li>
<li class="tocline">3.2 <a href="#S3.2">Prosody and Style</a>
<ul class="toc">
<li class="tocline">3.2.1 <a href="#S3.2.1">"voice" Element</a></li>
<li class="tocline">3.2.2 <a href="#S3.2.2">"emphasis" Element</a></li>
<li class="tocline">3.2.3 <a href="#S3.2.3">"break" Element</a></li>
<li class="tocline">3.2.4 <a href="#S3.2.4">"prosody" Element</a></li>
</ul>
</li>
<li class="tocline">3.3 <a href="#S3.3">Other Elements</a>
<ul class="toc">
<li class="tocline">3.3.1 <a href="#S3.3.1">"audio" Element</a></li>
<li class="tocline">3.3.2 <a href="#S3.3.2">"mark" Element</a></li>
<li class="tocline">3.3.3 <a href="#S3.3.3">"desc" Element</a></li>
</ul>
</li>
</ul>
<h2><a id="S3.1" name="S3.1">3.1</a> Document Structure, Text Processing and Pronunciation</h2>
<h3><a id="S3.1.1" name="S3.1.1">3.1.1</a> <a name="edef_speak" id="edef_speak" class="edef">speak</a> Root Element</h3>
<p>The Speech Synthesis Markup Language is an XML application. The root element is <a href="#edef_speak" class="eref">speak</a>. <a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> is a required attribute specifying the language of the root document. <a href="#adef_xmlbase" class="aref"><code class="att">xml:base</code></a> is an optional attribute specifying the Base <a href="#term-uri">URI</a> of the root document. The <code class="att">version</code> attribute is a required attribute that indicates the version of the specification to be used for the document and must have the value "1.0".</p>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
... the body ...
</speak>
</pre>
<p>The <a href="#edef_speak" class="eref">speak</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref">audio</a>, <a href="#edef_break" class="eref">break</a>, <a href="#edef_emphasis" class="eref">emphasis</a>, <a href="#edef_lexicon" class="eref">lexicon</a>, <a href="#edef_mark" class="eref">mark</a>, <a href="#edef_meta" class="eref">meta</a>, <a href="#edef_metadata" class="eref">metadata</a>, <a href="#edef_paragraph" class="eref">p</a>, <a href="#edef_phoneme" class="eref">phoneme</a>, <a href="#edef_prosody" class="eref">prosody</a>, <a href="#edef_say-as" class="eref">say-as</a>, <a href="#edef_sub" class="eref">sub</a>, <a href="#edef_sentence" class="eref">s</a>, <a href="#edef_voice" class="eref">voice</a>.</p>
<h3><a id="S3.1.2" name="S3.1.2">3.1.2</a> Language: <a class="adef" id="adef_xmllang" name="adef_xmllang"><code>xml:lang</code></a> Attribute</h3>
<p>The <a href="http://www.w3.org/TR/2000/REC-xml-20001006#sec-lang-tag"><code class="att">xml:lang</code> attribute</a>, as defined by XML 1.0 [<a href="#ref-xml">XML</a> §2.12], can be used in SSML to indicate the natural language of the enclosing element and its attributes and subelements. RFC 3066 [<a href="#ref-rfc3066">RFC3066</a>] may be of some use in understanding how to use this attribute.</p>
<p>Language information is inherited down the document hierarchy, i.e. it has to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.</p>
<p><a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> is a defined attribute for the <a href="#edef_voice" class="eref">voice</a>, <a href="#edef_speak" class="eref">speak</a>, <a href="#edef_paragraph" class="eref">p</a>, and <a href="#edef_sentence" class="eref">s</a> elements. For vocal rendering, a language change can have an effect on various other parameters (including gender, speed, age, pitch, etc.) which may be disruptive to the listener. There might even be unnatural breaks between language shifts. For this reason authors are encouraged to use the <a href="#edef_voice" class="eref">voice</a> element to change the language. <a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> is permitted on <a href="#edef_paragraph" class="eref">p</a> and <a href="#edef_sentence" class="eref">s</a> only because it is common to change the language at those levels.</p>
<p>Although this attribute is also permitted on the <a href="#edef_desc" class="eref">desc</a> element, none of the voice-change behavior described in this section applies when used with that element.</p>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>I don't speak Japanese.</p>
<p xml:lang="ja">日本語が分かりません。</p>
</speak>
</pre>
<p>In the case that a document requires speech output in a language not supported by the processor, the <a href="#term-processor">synthesis processor</a> largely determines behavior. Specifying <a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> does not imply a change in voice, though this may indeed occur. When a given voice is unable to speak content in the indicated language, a new voice may be selected by the processor. No change in the voice or prosody should occur if the <a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> value is the same as the inherited value. Further information about voice selection appears in <a href="#S3.2.1">Section 3.2.1</a>.</p>
<p>There may be variation across conforming processors in the implementation of <a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> voice changes for different markup elements (e.g. <a href="#edef_paragraph" class="eref">p</a> and <a href="#edef_sentence" class="eref">s</a> elements).</p>
<p>All elements should process their contents specific to the enclosing language. For instance, the <a href="#edef_phoneme" class="eref">phoneme</a>, <a href="#edef_emphasis" class="eref">emphasis</a>, <a href="#edef_break" class="eref">break</a>, <a href="#edef_paragraph" class="eref">p</a> and <a href="#edef_sentence" class="eref">s</a> elements should each be rendered in a manner that is appropriate to the current language.</p>
<p>The <a href="#text_normalization">text normalization</a> processing step may be affected by the enclosing language. This is true for both markup support by the <a href="#edef_say-as" class="eref">say-as</a> element and non-markup behavior. In the following example the same text "2/1/2000" may be read as "February first two thousand" in the first sentence, following American English pronunciation rules, but as "the second of January two thousand" in the second one, which follows Italian preprocessing rules.</p>
<pre class="example">
<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<s>Today, 2/1/2000.</s>
<!-- Today, February first two thousand -->
<s xml:lang="it">Un mese fà, 2/1/2000.</s>
<!-- Un mese fà, il due gennaio duemila -->
<!-- One month ago, the second of January two thousand -->
</speak>
</pre>
<h3 id="s10"><a id="S3.1.3" name="S3.1.3">3.1.3</a> base URI: <a href="#adef_xmlbase" class="aref"><code class="att">xml:base</code></a> Attribute</h3>
<p>Relative <a href="#term-uri">URIs</a> are resolved according to a <em>base URI</em>, which may come from a variety of sources. The base URI declaration allows authors to specify a document's base URI explicitly. See <a href="#S3.1.3.1">Section 3.1.3.1</a> for details on the resolution of relative URIs.</p>
<p>The <a href="#xmlbase">base URI declaration</a> is permitted but optional. The two elements affected by it are</p>
<blockquote>
<dl>
<dt><a href="#edef_audio" class="eref">audio</a></dt>
<dd>The optional <code class="att">src</code> attribute can specify a relative URI.</dd>
<dt><a href="#edef_lexicon" class="eref">lexicon</a></dt>
<dd>The <code class="att">uri</code> attribute can specify a relative URI.</dd>
</dl>
</blockquote>
<h4 id="id-S4.9-abnf"><a name="xmlbase" id="xmlbase"></a>The <a name="adef_xmlbase" id="adef_xmlbase" class="adef">xml:base</a> attribute</h4>
<p>The base <a href="#term-uri">URI</a> declaration follows [<a href="#ref-xml-base">XML-BASE</a>] and is indicated by an <a href="#adef_xmlbase" class="aref"><code class="att">xml:base</code></a> attribute on the root <a href="#edef_speak" class="eref">speak</a> element.</p>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:base="http://www.example.com/base-file-path">
</pre>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xml:lang="en-US"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:base="http://www.example.com/another-base-file-path">
</pre>
<h4 id="s25"><a id="S3.1.3.1" name="S3.1.3.1">3.1.3.1</a> Resolving Relative URIs</h4>
<p>User agents must calculate the base <a href="#term-uri">URI</a> for resolving relative URIs according to [<a href="#ref-rfc2396">RFC2396</a>]. The following describes how RFC2396 applies to synthesis documents.</p>
<p>User agents must calculate the base URI according to the following precedences (highest priority to lowest):</p>
<ol>
<li>The base URI is set by the <a href="#adef_xmlbase" class="aref"><code class="att">xml:base</code></a> attribute on the <a href="#edef_speak" class="eref">speak</a> element (see <a href="#S3.1.3">Section 3.1.3</a>).</li>
<li>The base URI is given by metadata discovered during a protocol interaction, such as an HTTP header (see [<a href="#ref-rfc2616">RFC2616</a>]).</li>
<li>By default, the base URI is that of the current document. Not all synthesis documents have a base URI (e.g., a valid synthesis document may appear in an email and may not be designated by a URI). It is an <a href="#term-error">error</a> if such documents contain relative URIs.</li>
</ol>
<h3 id="s12"><a id="S3.1.4" name="S3.1.4">3.1.4</a> Pronunciation Lexicon: <a href="#edef_lexicon" class="eref">lexicon</a> Element</h3>
<p>An SSML document may reference one or more external pronunciation lexicon documents. A lexicon document is identified by a <a href="#term-uri">URI</a> with an optional <a href="#term-media-type">media type</a>. No standard lexicon media type has yet been defined as the default for this specification.</p>
<p>The W3C Voice Browser Working Group is developing the Pronunciation Lexicon Markup Language [<a href="#ref-lex">LEX</a>]. The specification is expected to address the matching process between tokens and lexicon entries and the mechanism by which a <a href="#term-processor">synthesis processor</a> handles multiple pronunciations from internal and synthesis-specified lexicons. Pronunciation handling with proprietary lexicon formats will necessarily be specific to the <a href="#term-processor">synthesis processor</a>.</p>
<p>A lexicon document contains pronunciation information for tokens that can appear in a text to be spoken. The pronunciation information contained within a lexicon is used for tokens appearing within the referencing document.</p>
<p>Pronunciation lexicons are necessarily language-specific. Pronunciation lookup in a lexicon and pronunciation inference for any token may use an algorithm that is language-specific. As mentioned in <a href="#S1.2">Section 1.2</a>, the definition of what constitutes a "token" may itself be language-specific.</p>
<p>When multiple lexicons are referenced, their precedence goes from lower to higher with document order. Precedence means that a token is first looked up in the lexicon with highest precedence. Only if not found in that lexicon, the next lexicon is searched and so on until a first match or until all lexicons have been used for lookup.</p>
<h4>The <a name="edef_lexicon" id="edef_lexicon" class="edef">lexicon</a> element</h4>
<p>Any number of <a href="#edef_lexicon" class="eref">lexicon</a> elements may occur as immediate children of the <a href="#edef_speak" class="eref">speak</a> element. The <a href="#edef_lexicon" class="eref">lexicon</a> element must have a <code class="att">uri</code> attribute specifying a <a href="#term-uri">URI</a> that identifies the location of the pronunciation lexicon document.</p>
<p>The <a href="#edef_lexicon" class="eref">lexicon</a> element may have a <code class="att">type</code> attribute that specifies the <a href="#term-media-type">media type</a> of the pronunciation lexicon document.</p>
<pre class="example">
<?xml version="1.0"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
"http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<lexicon uri="http://www.example.com/lexicon.file"/>
<lexicon uri="http://www.example.com/strange-words.file"
type="media-type"/>
...
</speak>
</pre>
<h4 id="lexicon_type">Details of the type attribute</h4>
<p><i>Note: the description and table that follow use an imaginary vendor-specific lexicon type <code>of x-vnd.example.lexicon</code>. This is intended to represent whatever format is returned/available, as appropriate.</i></p>
<p>A lexicon resource indicated by a <a href="#term-uri">URI</a> reference may be available in one or more <a href="#term-media-type">media types</a>. The SSML author can specify the preferred media type via the <code class="att">type</code> attribute. When the content represented by a URI is available in many data formats, a <a href="#term-processor">synthesis processor</a> may use the preferred type to influence which of the multiple formats is used. For instance, on a server implementing HTTP content negotiation, the processor may use the type to order the preferences in the negotiation.</p>
<p>Upon delivery, the resource indicated by a URI reference may be considered in terms of two types. The <i>declared media type</i> is the alleged value for the resource and the actual media type is the true format of its content. The <i>actual type</i> should be the same as the declared type, but this is not always the case (e.g. a misconfigured HTTP server might return <code>text/plain</code> for a document following the vendor-specific <code>x-vnd.example.lexicon</code> format). A specific URI scheme may require that the resource owner always, sometimes, or never return a media type. Whenever a type is returned, it is treated as authoritative. The declared media type is determined by the value returned by the resource owner or, if none is returned, by the preferred media type given in the SSML document.</p>
<p>Three special cases may arise. The declared type may not be supported by the processor; this is an <a href="#term-error">error</a>. The declared type may be supported but the actual type may not match; this is also an <a href="#term-error">error</a>. Finally, no media type may be declared; the behavior depends on the specific URI scheme and the capabilities of the <a href="#term-processor">synthesis processor</a>. For instance, HTTP 1.1 allows document introspection (see [<a href="#ref-rfc2616">RFC2616</a> §7.2.1]), the data scheme falls back to a default media type, and local file access defines no guidelines. The following table provides some informative examples:</p>
<table width="100%" border="1" cellpadding="3" summary="This table presents some informative examples of possible media type interpretations when the source document is of type x-vnd.example.lexicon.">
<caption>Media type examples</caption>
<tr>
<td width="20%"></td>
<th colspan="2" scope="col">
<div align="center"><b>HTTP 1.1 request</b></div>
</th>
<th colspan="2" scope="col">
<div align="center"><b>Local file access</b></div>
</th>
</tr>
<tr>
<th width="20%" scope="row">Media type returned by the resource owner</th>
<td width="20%">text/plain</td>
<td width="20%">x-vnd.example.lexicon</td>
<td width="20%"><none></td>
<td><none></td>
</tr>
<tr>
<th width="20%" scope="row">Preferred media type from the SSML document</th>
<td colspan="2">Not applicable; the returned type is authoritative.</td>
<td width="20%">x-vnd.example.lexicon</td>
<td><none></td>
</tr>
<tr>
<th width="20%" scope="row">Declared media type</th>
<td width="20%">text/plain</td>
<td width="20%">x-vnd.example.lexicon</td>
<td width="20%">x-vnd.example.lexicon</td>
<td><none></td>
</tr>
<tr>
<th width="20%" scope="row">Behavior for an actual media type of x-vnd.example.lexicon</th>
<td width="20%">The must be processed as text/plain. This will generate an <a href="#term-error">error</a> if text/plain is not supported or if the document does not follow the expected format.</td>
<td colspan="2">The declared and actual types match; success if x-vnd.example.lexicon
is supported by the synthesis processor; otherwise an <a href="#term-error">error</a>.</td>
<td>Scheme specific; the synthesis processor might introspect the document to determine the type.</td>
</tr>
</table>
<p>The <a href="#edef_lexicon" class="eref">lexicon</a> element is an empty element.</p>
<h3 id="S3.1.5t"><a id="S3.1.5" name="S3.1.5">3.1.5</a> <a class="edef" id="edef_meta" name="edef_meta">meta</a> Element</h3>
<p>The <a href="#edef_metadata" class="eref">metadata</a> and <a href="#edef_meta" class="eref">meta</a> elements are containers in which information about the document can be placed. The <a href="#edef_metadata" class="eref">metadata</a> element provides more general and powerful treatment of metadata information than <a href="#edef_meta" class="eref">meta</a> by using a metadata schema.</p>
<p>A <a href="#edef_meta" class="eref">meta</a> declaration associates a string to a declared meta property or declares "http-equiv" content. Either a <code class="att">name</code> or <code class="att">http-equiv</code> attribute is required. It is an <a href="#term-error">error</a> to provide both <code class="att">name</code> and <code class="att">http-equiv</code> attributes. A <code class="att">content</code> attribute is required. The <code class="att">seeAlso</code> property is the only defined <a href="#edef_meta" class="eref">meta</a> property name. It is used to specify a resource that might provide additional metadata information about the content. This property is modelled on the <a href="http://www.w3.org/TR/2004/REC-rdf-schema-20040210/#ch_seealso"><code class="att">seeAlso</code></a> property of Resource Description Framework (RDF) Schema Specification 1.0 [<a href="#ref-rdf-schema">RDF-SCHEMA</a> §5.4.1]. The <code class="att">http-equiv</code> attribute has a special significance when documents are retrieved via HTTP. Although the preferred method of providing HTTP header information is by using HTTP header fields, the "http-equiv" content may be used in situations where the SSML document author is unable to configure HTTP header fields associated with their document on the origin server, for example, cache control information. Note that HTTP servers and caches are not required to introspect the contents of <a href="#edef_meta" class="eref">meta</a> in SSML documents and thereby override the header values they would send otherwise.</p>
<p>Informative: This is an example of how <a href="#edef_meta" class="eref">meta</a> elements can be included in an SSML document to specify a resource that provides additional metadata information and also indicate that the document must not be cached.</p>
<pre class="example">
<?xml version="1.0"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
"http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<meta name="seeAlso" content="http://example.com/my-ssml-metadata.xml"/>
<meta http-equiv="Cache-Control" content="no-cache"/>
</speak>
</pre>
<p>The <a href="#edef_meta" class="eref">meta</a> element is an empty element.</p>
<h3 id="S3.1.6t"><a id="S3.1.6" name="S3.1.6">3.1.6</a> <a name="edef_metadata" id="edef_metadata" class="edef">metadata</a> Element</h3>
<p>The <a href="#edef_metadata" class="eref">metadata</a> element is a container in which information about the document can be placed using a metadata schema. Although any metadata schema can be used with <a href="#edef_metadata" class="eref">metadata</a>, it is recommended that the XML syntax of the Resource Description Framework (RDF) [<a href="#ref-rdf-xml">RDF-XMLSYNTAX</a>] be used in conjunction with the general metadata properties defined in the Dublin Core Metadata Initiative [<a href="#ref-dc">DC</a>].</p>
<p>The Resource Description Format [<a href="#ref-rdf">RDF</a>] is a declarative language and provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web. Content creators should refer to W3C metadata Recommendations [<a href="#ref-rdf-xml">RDF-XMLSYNTAX</a>] and [<a href="#ref-rdf-schema">RDF-SCHEMA</a>] when deciding which metadata RDF schema to use in their documents. Content creators should also refer to the Dublin Core Metadata Initiative [<a href="#ref-dc">DC</a>], which is a set of generally applicable core metadata properties (e.g., Title, Creator, Subject, Description, Rights, etc.).</p>
<p>Document properties declared with the <a href="#edef_metadata" class="eref">metadata</a> element can use any metadata schema.</p>
<p>Informative: This is an example of how <a href="#edef_metadata" class="eref">metadata</a> can be included in an SSML document using the Dublin Core version 1.0 RDF schema [<a href="#ref-dc">DC</a>] describing general document information such as title, description, date, and so on:</p>
<pre class="example">
<?xml version="1.0"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
"http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<metadata>
<rdf:RDF
xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:rdfs = "http://www.w3.org/2000/01/rdf-schema#"
xmlns:dc = "http://purl.org/dc/elements/1.1/">
<!-- Metadata about the synthesis document -->
<rdf:Description rdf:about="http://www.example.com/meta.ssml"
dc:Title="Hamlet-like Soliloquy"
dc:Description="Aldine's Soliloquy in the style of Hamlet"
dc:Publisher="W3C"
dc:Language="en-US"
dc:Date="2002-11-29"
dc:Rights="Copyright 2002 Aldine Turnbet"
dc:Format="application/ssml+xml" >
<dc:Creator>
<rdf:Seq ID="CreatorsAlphabeticalBySurname">
<rdf:li>William Shakespeare</rdf:li>
<rdf:li>Aldine Turnbet</rdf:li>
</rdf:Seq>
</dc:Creator>
</rdf:Description>
</rdf:RDF>
</metadata>
</speak>
</pre>
<p>The <a href="#edef_metadata" class="eref">metadata</a> element can have arbitrary content, although none of the content will be rendered by the <a href="#term-processor">synthesis processor</a>.</p>
<h3><a id="S3.1.7" name="S3.1.7">3.1.7</a> Text Structure: <a class="edef" id="edef_paragraph" name="edef_paragraph">p</a> and <a class="edef" id="edef_sentence" name="edef_sentence">s</a> Elements</h3>
<p>A <a href="#edef_paragraph" class="eref">p</a> element represents a paragraph. An <a href="#edef_sentence" class="eref">s</a> element represents a sentence.</p>
<p><a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> is a defined attribute on the <a href="#edef_paragraph" class="eref">p</a> and <a href="#edef_sentence" class="eref">s</a> elements.</p>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>This is the first sentence of the paragraph.</s>
<s>Here's another sentence.</s>
</p>
</speak>
</pre>
<p>The use of <a href="#edef_paragraph" class="eref">p</a> and <a href="#edef_sentence" class="eref">s</a> elements is optional. Where text occurs without an enclosing <a href="#edef_paragraph" class="eref">p</a> or <a href="#edef_sentence" class="eref">s</a> element the <a href="#term-processor">synthesis processor</a> should attempt to determine the structure using language-specific knowledge of the format of plain text.</p>
<p>The <a href="#edef_paragraph" class="eref">p</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref">audio</a>, <a href="#edef_mark" class="eref">break</a>, <a href="#edef_emphasis" class="eref">emphasis</a>, <a href="#edef_mark" class="eref">mark</a>, <a href="#edef_phoneme" class="eref">phoneme</a>, <a href="#edef_prosody" class="eref">prosody</a>, <a href="#edef_say-as" class="eref">say-as</a>, <a href="#edef_sub" class="eref">sub</a>, <a href="#edef_sentence" class="eref">s</a>, <a href="#edef_voice" class="eref">voice</a>.</p>
<p>The <a href="#edef_sentence" class="eref">s</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref">audio</a>, <a href="#edef_mark" class="eref">break</a>, <a href="#edef_emphasis" class="eref">emphasis</a>, <a href="#edef_mark" class="eref">mark</a>, <a href="#edef_phoneme" class="eref">phoneme</a>, <a href="#edef_prosody" class="eref">prosody</a>, <a href="#edef_say-as" class="eref">say-as</a>, <a href="#edef_sub" class="eref">sub</a>, <a href="#edef_voice" class="eref">voice</a>.</p>
<h3><a id="S3.1.8" name="S3.1.8">3.1.8</a> <a id="edef_say-as" name="edef_say-as" class="edef">say-as</a> Element</h3>
<p>The <a href="#edef_say-as" class="eref">say-as</a> element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.</p>
<p>Defining a comprehensive set of text format types is difficult because of the variety of languages that have to be considered and because of the innate flexibility of written languages. SSML only specifies the <a href="#edef_say-as" class="eref">say-as</a> element, its attributes, and their purpose. It does not enumerate the possible values for the attributes. The Working Group expects to produce a separate document that will define standard values and associated normative behavior for these values. Examples given here are only for illustrating the purpose of the element and the attributes.</p>
<p>The <a href="#edef_say-as" class="eref">say-as</a> element has three attributes: <code class="att">interpret-as</code>, <code class="att">format</code>, and <code class="att">detail</code>. The <code class="att">interpret-as</code> attribute is always required; the other two attributes are optional. The legal values for the <code class="att">format</code> attribute depend on the value of the <code class="att">interpret-as</code> attribute.</p>
<p>The <a href="#edef_say-as" class="eref">say-as</a> element can only contain text to be rendered.</p>
<h4 id="g972">The <code class="att">interpret-as</code> and <code class="att">format</code> attributes</h4>
<p>The <code class="att">interpret-as</code> attribute indicates the content type of the contained text construct. Specifying the content type helps the <a href="#term-processor">synthesis processor</a> to distinguish and interpret text constructs that may be rendered in different ways depending on what type of information is intended. In addition, the optional <code class="att">format</code> attribute can give further hints on the precise formatting of the contained text for content types that may have ambiguous formats.</p>
<p>When specified, the <code class="att">interpret-as</code> and <code class="att">format</code> values are to be interpreted by the <a href="#term-processor">synthesis processor</a> as hints provided by the markup document author to aid <a href="#text_normalization">text normalization</a> and pronunciation.</p>
<p>In all cases, the text enclosed by any <a href="#edef_say-as" class="eref">say-as</a> element is intended to be a standard, orthographic form of the language currently in context. A <a href="#term-processor">synthesis processor</a> should be able to support the common, orthographic forms of the specified language for every content type that it supports.</p>
<p>When the value for the <code class="att">interpret-as</code> attribute is unknown or unsupported by a processor, it must render the contained text as if no <code class="att">interpret-as</code> value were specified.</p>
<p>When the value for the <code class="att">format</code> attribute is unknown or unsupported by a processor, it must render the contained text as if no <code class="att">format</code> value were specified, and should render it using the <code class="att">interpret-as</code> value that is specified.</p>
<p>When the content of the <a href="#edef_say-as" class="eref">say-as</a> element contains additional text next to the content that is in the indicated <code class="att">format</code> and <code class="att">interpret-as</code> type, then this additional text must be rendered. The processor may make the rendering of the additional text dependent on the <code class="att">interpret-as</code> type of the element in which it appears.<br />
When the content of the <a href="#edef_say-as" class="eref">say-as</a> element contains no content in the indicated <code class="att">interpret-as</code> type or <code class="att">format</code>, the processor must render the content either as if the <code class="att">format</code> attribute were not present, or as if the <code class="att">interpret-as</code> attribute were not present, or as if neither the <code class="att">format</code> nor <code class="att">interpret-as</code> attributes were present. The processor should also notify the environment of the mismatch.</p>
<p>Indicating the content type or format does not necessarily affect the way the information is pronounced. A <a href="#term-processor">synthesis processor</a> should pronounce the contained text in a manner in which such content is normally produced for the language.</p>
<h4 id="g1000">The <code class="att">detail</code> attribute</h4>
<p>The <code class="att">detail</code> attribute is an optional attribute that indicates the level of detail to be read aloud or rendered. Every value of the <code class="att">detail</code> attribute must render all of the informational content in the contained text; however, specific values for the <code class="att">detail</code> attribute can be used to render content that is not usually informational in running text but may be important to render for specific purposes. For example, a <a href="#term-processor">synthesis processor</a> will usually render punctuations through appropriate changes in prosody. Setting a higher level of detail may be used to speak punctuations explicitly, e.g. for reading out coded part numbers or pieces of software code.</p>
<p>The <code class="att">detail</code> attribute can be used for all <code class="att">interpret-as</code> types.</p>
<p>If the <code class="att">detail</code> attribute is not specified, the level of detail that is produced by the <a href="#term-processor">synthesis processor</a> depends on the text content and the language.</p>
<p>When the value for the <code class="att">detail</code> attribute is unknown or unsupported by a processor, it must render the contained text as if no value were specified for the <code class="att">detail</code> attribute.</p>
<h3 id="g9"><a id="S3.1.9" name="S3.1.9">3.1.9</a> <a name="edef_phoneme" id="edef_phoneme" class="edef">phoneme</a> Element</h3>
<p>The <a href="#edef_phoneme" class="eref">phoneme</a> element provides a phonemic/phonetic pronunciation for the contained text. The <a href="#edef_phoneme" class="eref">phoneme</a> element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.</p>
<p>The <code class="att">ph</code> attribute is a required attribute that specifies the phoneme/phone string.</p>
<p>This element is designed strictly for phonemic and phonetic notations and is intended to be used to provide pronunciations for words or very short phrases. The phonemic/phonetic string does not undergo <a href="#text_normalization">text normalization</a> and is not treated as a token for lookup in the lexicon (see <a href="#S3.1.4">Section 3.1.4</a>), while values in <a href="#edef_say-as" class="eref">say-as</a> and <a href="#edef_sub" class="eref">sub</a> may undergo both. Briefly, phonemic strings consist of phonemes, language-dependent speech units that characterize linguistically significant differences in the language; loosely, phonemes represent all the sounds needed to distinguish one word from another in a given language. On the other hand, phonetic strings consist of phones, speech units that characterize the manner (puff of air, click, vocalized, etc.) and place (front, middle, back, etc.) of articulation within the human vocal tract and are thus independent of language; phones represent realized distinctions in human speech production.</p>
<p>The <code class="att">alphabet</code> attribute is an optional attribute that specifies the phonemic/phonetic alphabet. An alphabet in this context refers to a collection of symbols to represent the sounds of one or more human languages. The only valid values for this attribute are "<strong>ipa</strong>" (see the next paragraph) and vendor-defined strings of the form "<strong>x-organization</strong>" or "<strong>x-organization-alphabet</strong>". For example, the Japan Electronics and Information Technology Industries Association [<a href="#ref-jeita">JEITA</a>] might wish to encourage the use of an alphabet such as "x-JEITA" or "x-JEITA-2000" for their phoneme alphabet [<a href="#ref-jeidaalphabet">JEIDAALPHABET</a>].</p>
<p><a href="#term-processor">Synthesis processors</a> should support a value for <code class="att">alphabet</code> of "<strong>ipa</strong>", corresponding to Unicode representations of the phonetic characters developed by the International Phonetic Association [<a href="#ref-ipa">IPA</a>]. In addition to an exhaustive set of vowel and consonant symbols, this character set supports a syllable delimiter, numerous diacritics, stress symbols, lexical tone symbols, intonational markers and more. For this alphabet, legal <code class="att">ph</code> values are strings of the values specified in Appendix 2 of [<a href="#ref-ipahndbk">IPAHNDBK</a>]. Informative tables of the IPA-to-Unicode mappings can be found at [<a href="#ref-ipaunicode1">IPAUNICODE1</a>] and [<a href="#ref-ipaunicode2">IPAUNICODE2</a>]. Note that not all of the IPA characters are available in Unicode. For processors supporting this alphabet,</p>
<ul>
<li>The processor must syntactically accept all legal <code class="att">ph</code> values.</li>
<li>The processor should produce output when given Unicode IPA codes that can reasonably be considered to belong to the current language.</li>
<li>The production of output when given other codes is entirely at processor discretion.</li>
</ul>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<phoneme alphabet="ipa" ph="t&#x259;mei&#x325;&#x27E;ou&#x325;"> tomato </phoneme>
<!-- This is an example of IPA using character entities -->
<!-- Because many platform/browser/text editor combinations do not
correctly cut and paste Unicode text, this example uses the entity
escape versions of the IPA characters. Normally, one would directly
use the UTF-8 representation of these symbols: "təmei̥ɾou̥". -->
</speak>
</pre>
<p>It is an <a href="#term-error">error</a> if a value for <code class="att">alphabet</code> is specified that is not known or cannot be applied by a <a href="#term-processor">synthesis processor</a>. The default behavior when the <code class="att">alphabet</code> attribute is left unspecified is processor-specific.</p>
<p>The <a href="#edef_phoneme" class="eref">phoneme</a> element itself can only contain text (no elements).</p>
<h3 id="g11"><a id="S3.1.10" name="S3.1.10">3.1.10</a> <a name="edef_sub" id="edef_sub" class="edef">sub</a> Element</h3>
<p>The <a href="#edef_sub" class="eref">sub</a> element is employed to indicate that the text in the <code class="att">alias</code> attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. The required <code class="att">alias</code> attribute specifies the string to be spoken instead of the enclosed string. The processor should apply <a href="#text_normalization">text normalization</a> to the <code class="att">alias</code> value.</p>
<p>The <a href="#edef_sub" class="eref">sub</a> element can only contain text (no elements).</p>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<sub alias="World Wide Web Consortium">W3C</sub>
<!-- World Wide Web Consortium -->
</speak>
</pre>
<h2 id="g12"><a id="S3.2" name="S3.2">3.2</a> Prosody and Style</h2>
<h3 id="g13"><a id="S3.2.1" name="S3.2.1">3.2.1</a> <a name="edef_voice" id="edef_voice" class="edef">voice</a> Element</h3>
<p>The <a href="#edef_voice" class="eref">voice</a> element is a production element that requests a change in speaking voice. Attributes are:</p>
<ul>
<li>
<p><a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> : optional language specification attribute.</p>
</li>
<li>
<p><code class="att">gender</code>: optional attribute indicating the preferred gender of the voice to speak the contained text. Enumerated values are: <b>"male"</b>, <b>"female"</b>, <b>"neutral"</b>.</p>
</li>
<li>
<p><code class="att">age</code>: optional attribute indicating the preferred age in years (since birth) of the voice to speak the contained text. Acceptable values are of type <a href="http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#nonNegativeInteger"><strong>xsd:nonNegativeInteger</strong></a> [<a href="#ref-schema2">SCHEMA2</a> §3.3.20].</p>
</li>
<li>
<p><code class="att">variant</code>: optional attribute indicating a preferred variant of the other voice characteristics to speak the contained text. (e.g. the second male child voice). Valid values of <code>variant</code> are of type <a href="http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#positiveInteger"><strong>xsd:positiveInteger</strong></a> [<a href="#ref-schema2">SCHEMA2</a> §3.3.25].</p>
</li>
<li>
<p><code class="att">name</code>: optional attribute indicating a processor-specific voice name to speak the contained text. The value may be a space-separated list of names ordered from top preference down. As a result a name must not contain any white space.</p>
</li>
</ul>
<p>Although each attribute individually is optional, it is an <a href="#term-error">error</a> if no attributes are specified when the <a href="#edef_voice" class="eref">voice</a> element is used.</p>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<voice gender="female">Mary had a little lamb,</voice>
<!-- now request a different female child's voice -->
<voice gender="female" variant="2">
Its fleece was white as snow.
</voice>
<!-- processor-specific voice selection -->
<voice name="Mike">I want to be like Mike.</voice>
</speak>
</pre>
<p>The <a href="#edef_voice" class="eref">voice</a> element is commonly used to change the language. When there is not a voice available that exactly matches the attributes specified in the document, or there are multiple voices that match the criteria, the following voice selection algorithm must be used. There are cases in the algorithm that are ambiguous; in such cases voice selection may be processor-specific. Approximately speaking, the <a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> attribute has the highest priority and all other attributes are equal in priority but below <a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> . The complete algorithm is:</p>
<ol>
<li>If a voice is available for a requested <a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> , a <a href="#term-processor">synthesis processor</a> must use it. If there are multiple such voices available, the processor should use the voice that best matches the specified values for <code class="att">name</code>, <code class="att">variant</code>, <code class="att">gender</code> and <code class="att">age</code>.</li>
<li>If there is no voice available for the requested <a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> , the processor should select a voice that is closest to the requested language (e.g. a variant or dialect of the same language). If there are multiple such voices available, the processor should use a voice that best matches the specified values for <code class="att">name</code>, <code class="att">variant</code>, <code class="att">gender</code> and <code class="att">age</code>.</li>
<li>It is an <a href="#term-error">error</a> if the processor decides it does not have a voice that sufficiently matches the above criteria.</li>
</ol>
<p>Note that simple cases of foreign-text embedding (where a voice change is not needed or undesirable) can be done. See <a href="#AppF">Appendix F</a> for examples.</p>
<p><a href="#edef_voice" class="eref">voice</a> attributes are inherited down the tree including to within elements that change the language.</p>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<voice gender="female">
Any female voice here.
<voice age="6">
A female child voice here.
<p xml:lang="ja">
<!-- A female child voice in Japanese. -->
</p>
</voice>
</voice>
</speak>
</pre>
<p>Relative changes in prosodic parameters should be carried across voice changes. However, different voices have different natural defaults for pitch, speaking rate, etc. because they represent different personalities, so absolute values of the prosodic parameters may vary across changes in the voice.</p>
<p>The quality of the output audio or voice may suffer if a change in voice is requested within a sentence.</p>
<p>The <a href="#edef_voice" class="eref">voice</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref">audio</a>, <a href="#edef_mark" class="eref">break</a>, <a href="#edef_emphasis" class="eref">emphasis</a>, <a href="#edef_mark" class="eref">mark</a>, <a href="#edef_paragraph" class="eref">p</a>, <a href="#edef_phoneme" class="eref">phoneme</a>, <a href="#edef_prosody" class="eref">prosody</a>, <a href="#edef_say-as" class="eref">say-as</a>, <a href="#edef_sub" class="eref">sub</a>, <a href="#edef_sentence" class="eref">s</a>, <a href="#edef_voice" class="eref">voice</a>.</p>
<h3 id="g15"><a id="S3.2.2" name="S3.2.2">3.2.2</a> <a name="edef_emphasis" id="edef_emphasis" class="edef">emphasis</a> Element</h3>
<p>The <a href="#edef_emphasis" class="eref">emphasis</a> element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The <a href="#term-processor">synthesis processor</a> determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:</p>
<ul>
<li>
<p><code class="att">level</code>: the optional <code class="att">level</code> attribute indicates the strength of emphasis to be applied. Defined values are <strong>"strong"</strong>, <strong>"moderate"</strong>, <strong>"none"</strong> and <strong>"reduced"</strong>. The default <code class="att">level</code> is <strong>"moderate"</strong>. The meaning of <strong>"strong"</strong> and <strong>"moderate"</strong> emphasis is interpreted according to the language being spoken (languages indicate emphasis using a possible combination of pitch change, timing changes, loudness and other acoustic differences). The <strong>"reduced"</strong> <code class="att">level</code> is effectively the opposite of emphasizing a word. For example, when the phrase "going to" is reduced it may be spoken as "gonna". The <strong>"none"</strong> <code class="att">level</code> is used to prevent the <a href="#term-processor">synthesis processor</a> from emphasizing words that it might typically emphasize. The values <strong>"none"</strong>, <strong>"moderate"</strong>, and <strong>"strong"</strong> are monotonically non-decreasing in strength.</p>
</li>
</ul>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
That is a <emphasis> big </emphasis> car!
That is a <emphasis level="strong"> huge </emphasis>
bank account!
</speak>
</pre>
<p>The <a href="#edef_emphasis" class="eref">emphasis</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref">audio</a>, <a href="#edef_mark" class="eref">break</a>, <a href="#edef_emphasis" class="eref">emphasis</a>, <a href="#edef_mark" class="eref">mark</a>, <a href="#edef_phoneme" class="eref">phoneme</a>, <a href="#edef_prosody" class="eref">prosody</a>, <a href="#edef_say-as" class="eref">say-as</a>, <a href="#edef_sub" class="eref">sub</a>, <a href="#edef_voice" class="eref">voice</a>.</p>
<h3 id="g16"><a id="S3.2.3" name="S3.2.3">3.2.3</a> <a name="edef_break" id="edef_break" class="edef">break</a> Element</h3>
<p>The <a href="#edef_break" class="eref">break</a> element is an empty element that controls the pausing or other prosodic boundaries between words. The use of the <a href="#edef_break" class="eref">break</a> element between any pair of words is optional. If the element is not present between words, the <a href="#term-processor">synthesis processor</a> is expected to automatically determine a break based on the linguistic context. In practice, the <a href="#edef_break" class="eref">break</a> element is most often used to override the typical automatic behavior of a synthesis processor. The attributes on this element are:</p>
<ul>
<li>
<p><code class="att">strength</code>: the <code class="att">strength</code> attribute is an optional attribute having one of the following values: <strong>"none"</strong>, <strong>"x-weak"</strong>, <strong>"weak"</strong>, <strong>"medium"</strong> (default value), <strong>"strong"</strong>, or <strong>"x-strong"</strong>. This attribute is used to indicate the strength of the prosodic break in the speech output. The value <strong>"none"</strong> indicates that no prosodic break boundary should be outputted, which can be used to prevent a prosodic break which the processor would otherwise produce. The other values indicate monotonically non-decreasing (conceptually increasing) break strength between words. The stronger boundaries are typically accompanied by pauses. "<strong>x-weak</strong>" and "<strong>x-strong</strong>" are mnemonics for "extra weak" and "extra strong", respectively.</p>
</li>
<li>
<p><code class="att">time</code>: the <code class="att">time</code> attribute is an optional attribute indicating the duration of a pause to be inserted in the output in seconds or milliseconds. It follows the time value format from the Cascading Style Sheets Level 2 Recommendation [<a href="#ref-css2">CSS2</a>], e.g. "250ms", "3s".</p>
</li>
</ul>
<p>The <code class="att">strength</code> attribute is used to indicate the prosodic strength of the break. For example, the breaks between paragraphs are typically much stronger than the breaks between words within a sentence. The <a href="#term-processor">synthesis processor</a> may insert a pause as part of its implementation of the prosodic break. A pause of a specific length can also be inserted by using the <code class="att">time</code> attribute.</p>
<p>If a <a class="eref" href="#edef_break">break</a> element is used with neither <code class="att">strength</code> nor <code class="att">time</code> attributes, a break will be produced by the processor with a prosodic strength greater than that which the processor would otherwise have used if no <a class="eref" href="#edef_break">break</a> element was supplied.</p>
<p>If both <code class="att">strength</code> and <code class="att">time</code> attributes are supplied, the processor will insert a break with a duration as specified by the <code class="att">time</code> attribute, with other prosodic changes in the output based on the value of the <code class="att">strength</code> attribute.</p>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
Take a deep breath <break/>
then continue.
Press 1 or wait for the tone. <break time="3s"/>
I didn't hear you! <break strength="weak"/> Please repeat.
</speak>
</pre>
<h3 id="g18"><a id="S3.2.4" name="S3.2.4">3.2.4</a> <a name="edef_prosody" id="edef_prosody" class="edef">prosody</a> Element</h3>
<p>The <a href="#edef_prosody" class="eref">prosody</a> element permits control of the pitch, speaking rate and volume of the speech output. The attributes, all optional, are:</p>
<ul>
<li>
<p><code class="att">pitch</code>: the baseline pitch for the contained text. Although the exact meaning of "baseline pitch" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the approximate pitch of the output. Legal values are: a <a href="#number_values">number</a> followed by "Hz", a <a href="#relative_values">relative change</a> or <strong>"x-low"</strong>, <strong>"low"</strong>, <strong>"medium"</strong>, <strong>"high"</strong>, <strong>"x-high"</strong>, or <strong>"default"</strong>. Labels <strong>"x-low"</strong> through <strong>"x-high"</strong> represent a sequence of monotonically non-decreasing pitch levels.</p>
</li>
<li>
<p><code class="att">contour</code>: sets the actual pitch contour for the contained text. The format is specified in <a href="#pitch_contour">Pitch contour</a> below.</p>
</li>
<li>
<p><code class="att">range</code>: the pitch range (variability) for the contained text. Although the exact meaning of "pitch range" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the dynamic range of the output pitch. Legal values are: a <a href="#number_values">number</a> followed by "Hz", a <a href="#relative_values">relative change</a> or <strong>"x-low"</strong>, <strong>"low"</strong>, <strong>"medium"</strong>, <strong>"high"</strong>, <strong>"x-high"</strong>, or <strong>"default"</strong>. Labels <strong>"x-low"</strong> through <strong>"x-high"</strong> represent a sequence of monotonically non-decreasing pitch ranges.</p>
</li>
<li>
<p><code class="att">rate</code>: a change in the speaking rate for the contained text. Legal values are: a <a href="#relative_values">relative change</a> or <strong>"x-slow"</strong>, <strong>"slow"</strong>, <strong>"medium"</strong>, <strong>"fast"</strong>, <strong>"x-fast"</strong>, or <strong>"default"</strong>. Labels <strong>"x-slow"</strong> through <strong>"x-fast"</strong> represent a sequence of monotonically non-decreasing speaking rates. When a <a href="#number_values">number</a> is used to specify a <a href="#relative_values">relative change</a> it acts as a multiplier of the default rate. For example, a value of 1 means no change in speaking rate, a value of 2 means a speaking rate twice the default rate, and a value of 0.5 means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice should be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well.</p>
</li>
<li>
<p><code class="att">duration</code>: a value in seconds or milliseconds for the desired time to take to read the element contents. Follows the time value format from the Cascading Style Sheet Level 2 Recommendation [<a href="#ref-css2">CSS2</a>], e.g. "250ms", "3s".</p>
</li>
<li>
<p><code class="att">volume</code>: the volume for the contained text in the range 0.0 to 100.0 (higher values are louder and specifying a value of zero is equivalent to specifying <strong>"silent"</strong>). Legal values are: <a href="#number_values">number</a>, a <a href="#relative_values">relative change</a> or <strong>"silent"</strong>, <strong>"x-soft"</strong>, <strong>"soft"</strong>, <strong>"medium"</strong>, <strong>"loud"</strong>, <strong>"x-loud"</strong>, or <strong>"default"</strong>. The volume scale is linear amplitude. The default is 100.0. Labels <strong>"silent"</strong> through <strong>"x-loud"</strong> represent a sequence of monotonically non-decreasing volume levels.</p>
</li>
</ul>
<p>Although each attribute individually is optional, it is an <a href="#term-error">error</a> if no attributes are specified when the <a href="#edef_prosody" class="eref">prosody</a> element is used. The "<strong>x-<em>foo</em></strong> " attribute value names are intended to be mnemonics for "extra <em>foo</em>". All units ("Hz", "st") are case-sensitive. Note also that customary pitch levels and standard pitch ranges may vary significantly by language, as may the meanings of the labelled values for pitch targets and ranges.</p>
<h4 id="g499"><a name="number_values" id="number_values">Number</a></h4>
<p>A number is a simple positive floating point value without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or more digits.</p>
<h4 id="g19"><a name="relative_values" id="relative_values">Relative values</a></h4>
<p>Relative changes for the attributes above can be specified</p>
<ul>
<li>as a percentage (a <a href="#number_values">number</a> optionally preceded by "+" or "-" and followed by "%"), e.g. "3%", "+15.2%", "-8.0%", or</li>
<li>as a relative number:
<ul>
<li>For the <code class="att">rate</code> attribute, relative changes are a <a href="#number_values">number</a>.</li>
<li>For the <code class="att">volume</code> attribute, relative changes are a <a href="#number_values">number</a> preceded by "+" or "-", e.g. "+10", "-5.5".</li>
<li>For the <code class="att">pitch</code> and <code class="att">range</code> attributes, relative changes can be given in semitones (a <a href="#number_values">number</a> preceded by "+" or "-" and followed by "st") or in Hertz (a <a href="#number_values">number</a> preceded by "+" or "-" and followed by "Hz"): "+0.5st", "+5st", "-2st", "+10Hz", "-5.5Hz". A semitone is half of a tone (a half step) on the standard diatonic scale.</li>
</ul>
</li>
</ul>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
The price of XYZ is <prosody rate="-10%">$45</prosody>
</speak>
</pre>
<h4 id="g20"><a name="pitch_contour" id="pitch_contour">Pitch contour</a></h4>
<p>The pitch contour is defined as a set of white space-separated targets at specified time positions in the speech output. The algorithm for interpolating between the targets is processor-specific. In each pair of the form <code>(time position,target)</code>, the first value is a percentage of the period of the contained text (a <a href="#number_values">number</a> followed by "%") and the second value is the value of the <code class="att">pitch</code> attribute (a <a href="#number_values">number</a> followed by "Hz", a <a href="#relative_values">relative change</a>, or a label value). Time position values outside 0% to 100% are ignored. If a pitch value is not defined for 0% or 100% then the nearest pitch target is copied. All relative values for the pitch are relative to the pitch value just before the contained text.</p>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<prosody contour="(0%,+20Hz) (10%,+30%) (40%,+10Hz)">
good morning
</prosody>
</speak>
</pre>
<p>The <code class="att">duration</code> attribute takes precedence over the <code class="att">rate</code> attribute. The <code class="att">contour</code> attribute takes precedence over the <code class="att">pitch</code> and <code class="att">range</code> attributes.</p>
<p>The default value of all prosodic attributes is no change. For example, omitting the <code class="att">rate</code> attribute means that the rate is the same within the element as outside.</p>
<p>The <a href="#edef_prosody" class="eref">prosody</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref">audio</a>, <a href="#edef_break" class="eref">break</a>, <a href="#edef_emphasis" class="eref">emphasis</a>, <a href="#edef_mark" class="eref">mark</a>, <a href="#edef_paragraph" class="eref">p</a>, <a href="#edef_phoneme" class="eref">phoneme</a>, <a href="#edef_prosody" class="eref">prosody</a>, <a href="#edef_say-as" class="eref">say-as</a>, <a href="#edef_sub" class="eref">sub</a>, <a href="#edef_sentence" class="eref">s</a>, <a href="#edef_voice" class="eref">voice</a>.</p>
<h4 id="g22">Limitations</h4>
<p>All prosodic attribute values are indicative. If a <a href="#term-processor">synthesis processor</a> is unable to accurately render a document as specified (e.g., trying to set the pitch to 1Mhz or the speaking rate to 1,000,000 words per minute), it must make a best effort to continue processing by imposing a limit or a substitute for the specified, unsupported value and may inform the host environment when such limits are exceeded.</p>
<p>In some cases, <a href="#term-processor">synthesis processors</a> may elect to ignore a given prosodic markup if the processor determines, for example, that the indicated value is redundant, improper or in error. In particular, concatenative-type synthetic speech systems that employ large acoustic units may reject prosody-modifying markup elements if they are redundant with the prosody of a given acoustic unit(s) or would otherwise result in degraded speech quality.</p>
<h2 id="g23"><a id="S3.3" name="S3.3">3.3</a> Other Elements</h2>
<h3 id="g24"><a id="S3.3.1" name="S3.3.1">3.3.1</a> <a name="edef_audio" id="edef_audio" class="edef">audio</a> Element</h3>
<p>The <a href="#edef_audio" class="eref">audio</a> element supports the insertion of recorded audio files (see <a href="#AppA">Appendix A</a> for required formats) and the insertion of other audio formats in conjunction with synthesized speech output. The <a href="#edef_audio" class="eref">audio</a> element may be empty. If the <a href="#edef_audio" class="eref">audio</a> element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available. The alternate content may include text, speech markup, <a href="#edef_desc" class="eref">desc</a> elements, or other <a href="#edef_audio" class="eref">audio</a> elements. The alternate content may also be used when rendering the document to non-audible output and for accessibility (see the <a href="#edef_desc" class="eref">desc</a> element). The required attribute is <code class="att">src</code>, which is the <a href="#term-uri">URI</a> of a document with an appropriate MIME type.</p>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<!-- Empty element -->
Please say your name after the tone. <audio src="beep.wav"/>
<!-- Container element with alternative text -->
<audio src="prompt.au">What city do you want to fly from?</audio>
<audio src="welcome.wav">
<emphasis>Welcome</emphasis> to the Voice Portal.
</audio>
</speak>
</pre>
<p>An <a href="#edef_audio" class="eref">audio</a> element is successfully rendered:</p>
<ol>
<li>If the referenced audio source is played, or</li>
<li>If the <a href="#term-processor">synthesis processor</a> is unable to execute #1 but the alternative content is successfully rendered, or</li>
<li>If the processor can detect that text-only output is required and the alternative content is successfully rendered.</li>
</ol>
<p>Deciding which conditions result in the alternative content being rendered is processor-dependent. If the <a href="#edef_audio" class="eref">audio</a> element is not successfully rendered, a <a href="#term-processor">synthesis processor</a> should continue processing and should notify the hosting environment. The processor may determine after beginning playback of an audio source that the audio cannot be played in its entirety. For example, encoding problems, network disruptions, etc. may occur. The processor may designate this either as successful or unsuccessful rendering, but it must document this behavior.</p>
<p>The <a href="#edef_audio" class="eref">audio</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref">audio</a>, <a href="#edef_mark" class="eref">break</a>, <a href="#edef_desc" class="eref">desc</a>, <a href="#edef_emphasis" class="eref">emphasis</a>, <a href="#edef_mark" class="eref">mark</a>, <a href="#edef_paragraph" class="eref">p</a>, <a href="#edef_phoneme" class="eref">phoneme</a>, <a href="#edef_prosody" class="eref">prosody</a>, <a href="#edef_say-as" class="eref">say-as</a>, <a href="#edef_sub" class="eref">sub</a>, <a href="#edef_sentence" class="eref">s</a>, <a href="#edef_voice" class="eref">voice</a>.</p>
<h3 id="g26"><a id="S3.3.2" name="S3.3.2">3.3.2</a> <a name="edef_mark" id="edef_mark" class="edef">mark</a> Element</h3>
<p>A <a href="#edef_mark" class="eref">mark</a> element is an empty element that places a marker into the text/tag sequence. It has one required attribute, <code class="att">name</code>, which is of type <code><a href="http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#token">xsd:token</a></code> [<a href="#ref-schema2">SCHEMA2</a> §3.3.2]. The <a href="#edef_mark" class="eref">mark</a> element can be used to reference a specific location in the text/tag sequence, and can additionally be used to insert a marker into an output stream for asynchronous notification. When processing a <a href="#edef_mark" class="eref">mark</a> element, a <a href="#term-processor">synthesis processor</a> must do one or both of the following:</p>
<ul>
<li>inform the hosting environment with the value of the <code class="att">name</code> attribute and with information allowing the platform to retrieve the corresponding position in the rendered output.</li>
<li>when audio output of the SSML document reaches the <a href="#edef_mark" class="eref">mark</a>, issue an event that includes the required <code class="att">name</code> attribute of the element. The hosting environment defines the destination of the event.</li>
</ul>
<p>The <a href="#edef_mark" class="eref">mark</a> element does not affect the speech output process.</p>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
Go from <mark name="here"/> here, to <mark name="there"/> there!
</speak>
</pre>
<h3 id="g233"><a id="S3.3.3" name="S3.3.3">3.3.3</a> <a name="edef_desc" id="edef_desc" class="edef">desc</a> Element</h3>
<p>The <a href="#edef_desc" class="eref">desc</a> element can only occur within the content of the <a href="#edef_audio" class="eref">audio</a> element. When the audio source referenced in <a href="#edef_audio" class="eref">audio</a> is not speech, e.g. audio wallpaper or sonicon punctuation, it should contain a <a href="#edef_desc" class="eref">desc</a> element whose textual content is a description of the audio source (e.g. "door slamming"). If text-only output is being produced by the <a href="#term-processor">synthesis processor</a>, the content of the <a href="#edef_desc" class="eref">desc</a> element(s) should be rendered instead of other alternative content in <a href="#edef_audio" class="eref">audio</a>. The optional <a href="#adef_xmllang" class="aref">xml:lang</a> attribute can be used to indicate that the content of the element is in a different language from that of the content surrounding the element. Unlike all other uses of <a href="#adef_xmllang" class="aref">xml:lang</a> in this document, the presence or absence of this attribute will have no effect on the output in the normal case of audio (rather than text) output.</p>
<pre class="example">
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<!-- Normal use of <desc> -->
Heads of State often make mistakes when speaking in a foreign language.
One of the most well-known examples is that of John F. Kennedy:
<audio src="ichbineinberliner.wav">If you could hear it, this would be
a recording of John F. Kennedy speaking in Berlin.
<desc>Kennedy's famous German language gaffe</desc>
</audio>
<!-- Suggesting the language of the recording -->
<!-- Although there is no requirement that a recording be in the current language
(since it might even be non-speech such as music), an author might wish to
suggest the language of the recording by marking the entire <audio> element
using <voice>. In this case, the xml:lang attribute on <desc> can be used
to put the description back into the original language. -->
Here's the same thing again but with a different fallback:
<voice xml:lang="de-DE">
<audio src="ichbineinberliner.wav">Ich bin ein Berliner.
<desc xml:lang="en-US">Kennedy's famous German language gaffe</desc>
</audio>
</voice>
</speak>
</pre>
<p>The <a href="#edef_desc" class="eref">desc</a> element can only contain descriptive text.</p>
<h2 id="g40"><a id="S4" name="S4">4.</a> References</h2>
<h3 id="g41"><a id="S4.1" name="S4.1">4.1</a> Normative References</h3>
<dl>
<dt><a id="ref-css2" name="ref-css2">[CSS2]</a></dt>
<dd><cite><a href="http://www.w3.org/TR/1998/REC-CSS2-19980512/">Cascading Style Sheets, level 2: CSS2 Specification</a></cite>, B. Bos, et al., Editors. World Wide Web Consortium, 12 May 1998. This version of the CSS2 Recommendation is http://www.w3.org/TR/1998/REC-CSS2-19980512/. The <a href="http://www.w3.org/TR/REC-CSS2/">latest version of CSS2</a> is available at http://www.w3.org/TR/REC-CSS2/.</dd>
<dt><a id="ref-ipahndbk" name="ref-ipahndbk">[IPAHNDBK]</a></dt>
<dd><cite><a href="http://www.arts.gla.ac.uk/ipa/handbook.html">Handbook of the International Phonetic Association</a></cite>, International Phonetic Association, Editors. Cambridge University Press, July 1999. Information on the Handbook is available at http://www.arts.gla.ac.uk/ipa/handbook.html.</dd>
<dt><a id="ref-rfc1521" name="ref-rfc1521">[RFC1521]</a></dt>
<dd><cite><a href="http://www.ietf.org/rfc/rfc1521.txt">MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies</a></cite>, N. Borenstein and N. Freed, Editors. IETF, September 1993. This RFC is available at http://www.ietf.org/rfc/rfc1521.txt.</dd>
<dt><a id="ref-rfc2045" name="ref-rfc2045">[RFC2045]</a></dt>
<dd><cite><a href="http://www.ietf.org/rfc/rfc2045.txt">Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies.</a></cite>, N. Freed and N. Borenstein, Editors. IETF, November 1996. This RFC is available at http://www.ietf.org/rfc/rfc2045.txt.</dd>
<dt><a id="ref-rfc2046" name="ref-rfc2046">[RFC2046]</a></dt>
<dd><cite><a href="http://www.ietf.org/rfc/rfc2046.txt">Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types</a></cite>, N. Freed and N. Borenstein, Editors. IETF, November 1996. This RFC is available at http://www.ietf.org/rfc/rfc2046.txt.</dd>
<dt><a id="ref-rfc2119" name="ref-rfc2119">[RFC2119]</a></dt>
<dd><cite><a href="http://www.ietf.org/rfc/rfc2119.txt">Key words for use in RFCs to Indicate Requirement Levels</a></cite>, S. Bradner, Editor. IETF, March 1997. This RFC is available at http://www.ietf.org/rfc/rfc2119.txt.</dd>
<dt><a id="ref-rfc2396" name="ref-rfc2396">[RFC2396]</a></dt>
<dd><cite><a href="http://www.ietf.org/rfc/rfc2396.txt">Uniform Resource Identifiers (URI): Generic Syntax</a></cite>, T. Berners-Lee et al., Editors. IETF, August 1998. This RFC is available at http://www.ietf.org/rfc/rfc2396.txt.</dd>
<dt><a id="ref-rfc3066" name="ref-rfc3066">[RFC3066]</a></dt>
<dd><cite><a href="http://www.ietf.org/rfc/rfc3066.txt">Tags for the Identification of Languages</a></cite>, H. Alvestrand, Editor. IETF, January 2001. This RFC is available at http://www.ietf.org/rfc/rfc3066.txt.</dd>
<dt><a id="ref-schema1" name="ref-schema1">[SCHEMA1]</a></dt>
<dd><cite><a href="http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/">XML Schema Part 1: Structures</a></cite>, H. S. Thompson, et al., Editors. World Wide Web Consortium, 2 May 2001. This version of the XML Schema Part 1 Recommendation is http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/. The <a href="http://www.w3.org/TR/xmlschema-1/">latest version of XML Schema 1</a> is available at http://www.w3.org/TR/xmlschema-1/.</dd>
<dt><a id="ref-schema2" name="ref-schema2">[SCHEMA2]</a></dt>
<dd><cite><a href="http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/">XML Schema Part 2: Datatypes</a></cite>, P.V. Biron and A. Malhotra, Editors. World Wide Web Consortium, 2 May 2001. This version of the XML Schema Part 2 Recommendation is http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/. The <a href="http://www.w3.org/TR/xmlschema-2/">latest version of XML Schema 2</a> is available at http://www.w3.org/TR/xmlschema-2/.</dd>
<dt><a id="ref-mimetypes" name="ref-mimetypes">[TYPES]</a></dt>
<dd><cite><a href="http://www.iana.org/assignments/media-types/index.html">MIME Media types</a></cite>, IANA. This continually-updated list of media types registered with IANA is available at http://www.iana.org/assignments/media-types/index.html.</dd>
<dt><a id="ref-xml" name="ref-xml">[XML]</a></dt>
<dd><cite><a href="http://www.w3.org/TR/2000/REC-xml-20001006">Extensible Markup Language (XML) 1.0 (Second Edition)</a></cite>, T. Bray et al., Editors. World Wide Web Consortium, 6 October 2000. This version of the XML 1.0 Recommendation is http://www.w3.org/TR/2000/REC-xml-20001006. The <a href="http://www.w3.org/TR/REC-xml">latest version of XML 1.0</a> is available at http://www.w3.org/TR/REC-xml.</dd>
<dt><a id="ref-xml-base" name="ref-xml-base"></a>[XML-BASE]</dt>
<dd><cite><a href="http://www.w3.org/TR/2001/REC-xmlbase-20010627/">XML Base</a></cite>, J. Marsh, Editor. World Wide Web Consortium, 27 June 2001. This version of the XML Base Recommendation is http://www.w3.org/TR/2001/REC-xmlbase-20010627/. The <a href="http://www.w3.org/TR/xmlbase/">latest version of XML Base</a> is available at http://www.w3.org/TR/xmlbase/.</dd>
<dt><a id="ref-xmlns" name="ref-xmlns">[XMLNS]</a></dt>
<dd><cite><a href="http://www.w3.org/TR/1999/REC-xml-names-19990114/">Namespaces in XML</a></cite>, T. Bray et al., Editors. World Wide Web Consortium, 14 January 1999. This version of the XML Namespaces Recommendation is http://www.w3.org/TR/1999/REC-xml-names-19990114/. The <a href="http://www.w3.org/TR/REC-xml-names/">latest version of XML Namespaces</a> is available at http://www.w3.org/TR/REC-xml-names/.</dd>
</dl>
<h3 id="g43"><a id="S4.2" name="S4.2">4.2</a> Informative References</h3>
<dl>
<dt><a id="ref-dc" name="ref-dc">[DC]</a></dt>
<dd><cite>Dublin Core Metadata Initiative.</cite> See <a href="http://dublincore.org/">http://dublincore.org/</a></dd>
<dt><a id="ref-html" name="ref-html">[HTML]</a></dt>
<dd><cite><a href="http://www.w3.org/TR/1999/REC-html401-19991224/">HTML 4.01 Specification</a></cite>, D. Raggett et al., Editors. World Wide Web Consortium, 24 December 1999. This version of the HTML 4 Recommendation is http://www.w3.org/TR/1999/REC-html401-19991224/. The <a href="http://www.w3.org/TR/html4/">latest version of HTML 4</a> is available at http://www.w3.org/TR/html4/.</dd>
<dt><a id="ref-ipa" name="ref-ipa">[IPA]</a></dt>
<dd><cite><a href="http://www.arts.gla.ac.uk/ipa/ipa.html">International Phonetic Association</a></cite>. See http://www.arts.gla.ac.uk/ipa/ipa.html for the organization's website.</dd>
<dt><a id="ref-ipaunicode1" name="ref-ipaunicode1">[IPAUNICODE1]</a></dt>
<dd><cite><a href="http://web.uvic.ca/ling/resources/ipa/charts/unicode_ipa-chart.htm">The International Phonetic Alphabet</a></cite>, J. Esling. This table of IPA characters in Unicode is available at http://web.uvic.ca/ling/resources/ipa/charts/unicode_ipa-chart.htm.</dd>
<dt><a id="ref-ipaunicode2" name="ref-ipaunicode2">[IPAUNICODE2]</a></dt>
<dd><cite><a href="http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm">The International Phonetic Alphabet in Unicode</a></cite>, J. Wells. This table of Unicode values for IPA characters is available at http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm.</dd>
<dt><a id="ref-jeidaalphabet" name="ref-jeidaalphabet">[JEIDAALPHABET]</a></dt>
<dd><cite><a href="http://it.jeita.or.jp/document/publica/standard/summary/JEIDA-62-2000.pdf">JEIDA-62-2000 Phoneme Alphabet</a></cite>. JEITA. An abstract of this document (in Japanese) is available at http://it.jeita.or.jp/document/publica/standard/summary/JEIDA-62-2000.pdf.</dd>
<dt><a id="ref-jeita" name="ref-jeita">[JEITA]</a></dt>
<dd><cite><a href="http://www.jeita.or.jp">Japan Electronics and Information Technology Industries Association</a></cite>. See http://www.jeita.or.jp/.</dd>
<dt><a id="ref-jsml" name="ref-jsml">[JSML]</a></dt>
<dd><cite><a href="http://www.w3.org/TR/2000/NOTE-jsml-20000605/">JSpeech Markup Language</a></cite>, A. Hunt, Editor. World Wide Web Consortium, 5 June 2000. Copyright ©2000 Sun Microsystems, Inc. This version of the JSML submission is http://www.w3.org/TR/2000/NOTE-jsml-20000605/. The <a href="http://www.w3.org/TR/jsml/">latest W3C Note of JSML</a> is available at http://www.w3.org/TR/jsml/.</dd>
<dt><a id="ref-lex" name="ref-lex">[LEX]</a></dt>
<dd><cite><a href="http://www.w3.org/TR/2001/WD-lexicon-reqs-20010312/">Pronunciation Lexicon Markup Requirements</a></cite>, F. Scahill, Editor. World Wide Web Consortium, 12 March 2001. This document is a work in progress. This version of the Lexicon Requirements is http://www.w3.org/TR/2001/WD-lexicon-reqs-20010312/. The <a href="http://www.w3.org/TR/lexicon-reqs/">latest version of the Lexicon Requirements</a> is available at http://www.w3.org/TR/lexicon-reqs/.</dd>
<dt><a id="ref-rdf" name="ref-rdf">[RDF]</a></dt>
<dd><cite><a href="http://www.w3.org/TR/2004/REC-rdf-primer-20040210/">RDF Primer</a></cite>, F. Manola and E. Miller, Editors. World Wide Web Consortium, 10 February 2004. This version of the RDF Primer Recommendation is http://www.w3.org/TR/2004/REC-rdf-primer-20040210/. The <a href="http://www.w3.org/TR/rdf-primer/">latest version of the RDF Primer</a> is available at http://www.w3.org/TR/rdf-primer/.</dd>
<dt><a id="ref-rdf-xml" name="ref-rdf-xml">[RDF-XMLSYNTAX]</a></dt>
<dd><cite><a href="http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/">RDF/XML Syntax Specification</a></cite>, D. Beckett, Editor. World Wide Web Consortium, 10 February 2004. This version of the RDF/XML Syntax Recommendation is http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/. The <a href="http://www.w3.org/TR/rdf-syntax-grammar">latest version of the RDF XML Syntax</a> is available at http://www.w3.org/TR/rdf-syntax-grammar/.</dd>
<dt><a id="ref-rdf-schema" name="ref-rdf-schema">[RDF-SCHEMA]</a></dt>
<dd><cite><a href="http://www.w3.org/TR/2004/REC-rdf-schema-20040210/">RDF Vocabulary Description Language 1.0: RDF Schema</a></cite>, D. Brickley and R. Guha, Editors. World Wide Web Consortium, 10 February 2004. This version of the RDF Schema Recommendation is http://www.w3.org/TR/2004/REC-rdf-schema-20040210/. The <a href="http://www.w3.org/TR/rdf-schema/">latest version of RDF Schema</a> is available at http://www.w3.org/TR/rdf-schema/.</dd>
<dt><a id="ref-reqs" name="ref-reqs">[REQS]</a></dt>
<dd><cite><a href="http://www.w3.org/TR/1999/WD-voice-tts-reqs-19991223/">Speech Synthesis Markup Requirements for Voice Markup Languages</a></cite>, A. Hunt, Editor. World Wide Web Consortium, 23 December 1999. This document is a work in progress. This version of the Synthesis Requirements is http://www.w3.org/TR/1999/WD-voice-tts-reqs-19991223/. The <a href="http://www.w3.org/TR/voice-tts-reqs/">latest version of the Synthesis Requirements</a> is available at http://www.w3.org/TR/voice-tts-reqs/.</dd>
<dt><a id="ref-rfc2616" name="ref-rfc2616">[RFC2616]</a></dt>
<dd><cite><a href="http://www.ietf.org/rfc/rfc2616.txt">Hypertext Transfer Protocol -- HTTP/1.1</a></cite>, R. Fielding, et al., Editors. IETF, June 1999. This RFC is available at http://www.ietf.org/rfc/rfc2616.txt.</dd>
<dt><a id="ref-rfc2732" name="ref-rfc2732">[RFC2732]</a></dt>
<dd><cite><a href="http://www.ietf.org/rfc/rfc2732.txt">Format for Literal IPv6 Addresses in URL's</a></cite>, R. Hinden, et al., Editors. IETF, December 1999. This RFC is available at http://www.ietf.org/rfc/rfc2732.txt.</dd>
<dt><a id="ref-sable" name="ref-sable">[SABLE]</a></dt>
<dd>"SABLE: A Standard for TTS Markup", Richard Sproat, et al. <cite>Proceedings of the International Conference on Spoken Language Processing</cite>, R. Mannell and J. Robert-Ribes, Editors. <a href="http://www.causalproductions.com/">Causal Productions Pty Ltd</a> (Adelaide), 1998. Vol. 5, pp. 1719-1722. Conference proceedings are available from the publisher at http://www.causalproductions.com/.</dd>
<dt><a id="ref-smil" name="ref-smil">[SMIL]</a></dt>
<dd><cite><a href="http://www.w3.org/TR/2001/REC-smil20-20010807/">Synchronized Multimedia Integration Language (SMIL 2.0)</a></cite>, J. Ayars, et al., Editors. World Wide Web Consortium, 7 August 2001. This version of the SMIL 2 Recommendation is http://www.w3.org/TR/2001/REC-smil20-20010807/. The <a href="http://www.w3.org/TR/smil20/">latest version of SMIL2</a> is available at http://www.w3.org/TR/smil20/.</dd>
<dt><a id="ref-unicode" name="ref-unicode">[UNICODE]</a></dt>
<dd><cite><a href="http://www.unicode.org/standard/standard.html">The Unicode Standard</a></cite>. The Unicode Consortium. Information about the Unicode Standard and its versions can be found at http://www.unicode.org/standard/standard.html.</dd>
<dt><a id="ref-vxml" name="ref-vxml">[VXML]</a></dt>
<dd><cite><a href="http://www.w3.org/TR/2004/REC-voicexml20-20040316/">Voice Extensible Markup Language (VoiceXML) Version 2.0</a></cite>, S. McGlashan, et al., Editors. World Wide Web Consortium, 16 March 2004. This version of the VoiceXML 2.0 Recommendation is http://www.w3.org/TR/2004/REC-voicexml20-20040316/. The <a href="http://www.w3.org/TR/voicexml20/">latest version of VoiceXML 2</a> is available at http://www.w3.org/TR/voicexml20/.</dd>
</dl>
<h2 id="g44"><a id="S5" name="S5">5.</a> Acknowledgments</h2>
<p>This document was written with the participation of the following participants in the W3C Voice Browser Working Group <em>(listed in alphabetical order)</em>:</p>
<dl>
<dd>Paolo Baggia, Loquendo<br />
Dan Burnett, Nuance<br />
Dave Burke, Voxpilot<br />
Jerry Carter, Independent Consultant<br />
Sasha Caskey, IBM<br />
Brian Eberman, ScanSoft<br />
Andrew Hunt, ScanSoft<br />
Jim Larson, Intel<br />
Bruce Lucas, IBM<br />
Scott McGlashan, HP<br />
T.V. Raman, IBM<br />
Dave Raggett, W3C/Canon<br />
Laura Ricotti, Loquendo<br />
Richard Sproat, ATT<br />
Luc Van Tichelen, ScanSoft<br />
Mark Walker, Intel<br />
Kuansan Wang, Microsoft<br />
Dave Wood, Microsoft<br />
</dd>
</dl>
<h2 id="g49"><a id="AppA" name="AppA">Appendix A</a>: Audio File Formats</h2>
<p><b>This appendix is normative.</b></p>
<p>SSML requires that a platform support the playing of the audio formats specified below.</p>
<table cellspacing="0" cellpadding="5" width="80%" border="1" summary="This table lists the audio formats, with associated media types, that synthesis processors are required to support.">
<caption>Required audio formats</caption>
<tr>
<th scope="col">Audio Format</th>
<th scope="col">Media Type</th>
</tr>
<tr>
<td>Raw (headerless) 8kHz 8-bit mono mu-law (PCM) single channel. (G.711)</td>
<td>audio/basic (from [<a href="#ref-rfc1521">RFC1521</a>])</td>
</tr>
<tr>
<td>Raw (headerless) 8kHz 8 bit mono A-law (PCM) single channel. (G.711)</td>
<td>audio/x-alaw-basic</td>
</tr>
<tr>
<td>WAV (RIFF header) 8kHz 8-bit mono mu-law (PCM) single channel.</td>
<td>audio/x-wav</td>
</tr>
<tr>
<td>WAV (RIFF header) 8kHz 8-bit mono A-law (PCM) single channel.</td>
<td>audio/x-wav</td>
</tr>
</table>
<p>The 'audio/basic' MIME type is commonly used with the 'au' header format as well as the headerless 8-bit 8Khz mu-law format. If this MIME type is specified for playing, the mu-law format must be used. For playback with the 'audio/basic' MIME type, processors must support the mu-law format and may support the 'au' format.</p>
<h2 id="g51"><a id="AppB" name="AppB">Appendix B</a>: Internationalization</h2>
<p><b>This appendix is normative.</b></p>
<p>SSML is an application of XML 1.0 [<a href="#ref-xml">XML</a>] and thus supports [<a href="#ref-unicode">UNICODE</a>] which defines a standard universal character set.</p>
<p>SSML provides a mechanism for control of the spoken language via the use of the <a href="#adef_xmllang" class="aref"><code class="att">xml:lang</code></a> attribute. Language changes can occur as frequently as per word, although excessive language changes can diminish the output audio quality. SSML also permits finer control over output pronunciations via the <a href="#edef_lexicon" class="eref">lexicon</a> and <a href="#edef_phoneme" class="eref">phoneme</a> elements, features that can help to mitigate poor quality default lexicons for languages with only minimal commercial support today.</p>
<h2 id="g50"><a id="AppC" name="AppC">Appendix C</a>: MIME Types and File Suffix</h2>
<p><b>This appendix is normative.</b></p>
<p>The W3C Voice Browser Working Group has applied to IETF to register a MIME type for the Speech Synthesis Markup Language. The current proposal is to use "application/ssml+xml".</p>
<p>The W3C Voice Browser Working Group has adopted the convention of using the ".ssml" filename suffix for Speech Synthesis Markup Language documents where <a href="#edef_speak" class="eref">speak</a> is the root element.</p>
<h2 id="g48"><a id="AppD" name="AppD">Appendix D</a>: Schema for the Speech Synthesis Markup Language</h2>
<p><b>This appendix is normative.</b></p>
<p>The synthesis schema is located at <a href="http://www.w3.org/TR/speech-synthesis/synthesis.xsd">http://www.w3.org/TR/speech-synthesis/synthesis.xsd</a>.</p>
<p>Note: the synthesis schema includes a no-namespace core schema, located at <a href="http://www.w3.org/TR/speech-synthesis/synthesis-core.xsd">http://www.w3.org/TR/speech-synthesis/synthesis-core.xsd</a>, which may be used as a basis for specifying Speech Synthesis Markup Language Fragments (<a href="#S2.2.1">Sec. 2.2.1</a>) embedded in non-synthesis namespace schemas.</p>
<h2 id="g47"><a id="AppE" name="AppE">Appendix E</a>: DTD for the Speech Synthesis Markup Language</h2>
<p><b>This appendix is informative.</b></p>
<p>The SSML DTD is located at <a href="http://www.w3.org/TR/speech-synthesis/synthesis.dtd">http://www.w3.org/TR/speech-synthesis/synthesis.dtd</a>.</p>
<p>Due to DTD limitations, the SSML DTD does not correctly express that the <a href="#edef_metadata" class="eref">metadata</a> element can contain elements from other XML namespaces.</p>
<h2 id="g45"><a id="AppF" name="AppF">Appendix F</a>: Example SSML</h2>
<p><b>This appendix is informative.</b><br />
</p>
<p>The following is an example of reading headers of email messages. The <a href="#edef_paragraph" class="eref">p</a> and <a href="#edef_sentence" class="eref">s</a> elements are used to mark the text structure. The <a href="#edef_break" class="eref">break</a> element is placed before the time and has the effect of marking the time as important information for the listener to pay attention to. The <a href="#edef_prosody" class="eref">prosody</a> element is used to slow the speaking rate of the email subject so that the user has extra time to listen and write down the details.</p>
<pre class="example">
<?xml version="1.0"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
"http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>You have 4 new messages.</s>
<s>The first is from Stephanie Williams and arrived at <break/> 3:45pm.
</s>
<s>
The subject is <prosody rate="-20%">ski trip</prosody>
</s>
</p>
</speak>
</pre>
<p>The following example combines audio files and different spoken voices to provide information on a collection of music.</p>
<pre class="example">
<?xml version="1.0"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
"http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<voice gender="male">
<s>Today we preview the latest romantic music from Example.</s>
<s>Hear what the Software Reviews said about Example's newest hit.</s>
</voice>
</p>
<p>
<voice gender="female">
He sings about issues that touch us all.
</voice>
</p>
<p>
<voice gender="male">
Here's a sample. <audio src="http://www.example.com/music.wav"/>
Would you like to buy it?
</voice>
</p>
</speak>
</pre>
<p>It is often the case that an author wishes to include a bit of foreign text (say, a movie title) in an application without having to switch languages (for example via the <a href="#edef_voice" class="eref">voice</a> element). A simple way to do this is shown here. In this example the synthesis processor would render the movie name using the pronunciation rules of the container language ("en-US" in this case), similar to how a reader who doesn't know the foreign language might try to read (and pronounce) it.</p>
<pre class="example">
<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
The title of the movie is:
"La vita è bella"
(Life is beautiful),
which is directed by Roberto Benigni.
</speak>
</pre>
<p>With some additional work the output quality can be improved tremendously either by creating a custom pronunciation in an external lexicon (see <a href="#S3.1.4">Section 3.1.4</a>) or via the <a href="#edef_phoneme" class="eref">phoneme</a> element as shown in the next example.</p>
<p>It is worth noting that IPA alphabet support is an optional feature and that phonemes for an external language may be rendered with some approximation (see <a href="#S3.1.4">Section 3.1.4</a> for details). The following example only uses phonemes common to US English.</p>
<pre class="example">
<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
The title of the movie is:
<phoneme alphabet="ipa"
ph="&#x2C8;l&#x251; &#x2C8;vi&#x2D0;&#x27E;&#x259; &#x2C8;&#x294;e&#x26A; &#x2C8;b&#x25B;l&#x259;">
La vita è bella </phoneme>
<!-- The IPA pronunciation is <span class="ipa">ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə</span> -->
(Life is beautiful),
which is directed by
<phoneme alphabet="ipa"
ph="&#x279;&#x259;&#x2C8;b&#x25B;&#x2D0;&#x279;&#x27E;o&#x28A; b&#x25B;&#x2C8;ni&#x2D0;nji">
Roberto Benigni </phoneme>
<!-- The IPA pronunciation is <span class="ipa">ɹəˈbɛːɹɾoʊ bɛˈniːnji</span> -->
<!-- Note that in actual practice an author might change the
encoding to UTF-8 and directly use the Unicode characters in
the document rather than using the escapes as shown.
The escaped values are shown for ease of copying. -->
</speak>
</pre>
<h4 id="g46">SMIL Integration Example</h4>
<p>The SMIL language [<a href="#ref-smil">SMIL</a>] is an XML-based multimedia control language. It is especially well suited for describing dynamic media applications that include synthetic speech output.</p>
<p>File <b>'greetings.ssml'</b> contains the following:</p>
<pre class="example">
<?xml version="1.0"?>
<!DOCTYPE speak PUBLIC "-//W3C//DTD SYNTHESIS 1.0//EN"
"http://www.w3.org/TR/speech-synthesis/synthesis.dtd">
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<s>
<mark name="greetings"/>
<emphasis>Greetings</emphasis> from the <sub alias="World Wide Web Consortium">W3C</sub>!
</s>
</speak>
</pre>
<p><em>SMIL Example 1:</em> W3C logo image appears, and then one second later, the speech sequence is rendered. File <b>'greetings.smil'</b> contains the following:</p>
<pre class="example">
<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
<head>
<top-layout width="640" height="320">
<region id="whole" width="640" height="320"/>
</top-layout>
</head>
<body>
<par>
<img src="http://www.w3.org/Icons/w3c_home" region="whole" begin="0s"/>
<ref src="greetings.ssml" begin="1s"/>
</par>
</body>
</smil>
</pre>
<p><em>SMIL Example 2:</em> W3C logo image appears, then clicking on the image causes it to disappear and the speech sequence to be rendered. File <b>'greetings.smil'</b> contains the following:</p>
<pre class="example">
<smil xmlns="http://www.w3.org/2001/SMIL20/Language">
<head>
<top-layout width="640" height="320">
<region id="whole" width="640" height="320"/>
</top-layout>
</head>
<body>
<seq>
<img id="logo" src="http://w3clogo.gif" region="whole" begin="0s" end="logo.activateEvent"/>
<ref src="greetings.ssml"/>
</seq>
</body>
</smil>
</pre>
<h4 id="AppFVoiceXML">VoiceXML Integration Example</h4>
The following is an example of SSML in VoiceXML (see <a href="#S2.3.3">Section 2.3.3</a>) for <a href="#term-voicebrowser">voice browser</a> applications. It is worth noting that the VoiceXML namespace includes the SSML namespace elements and attributes. See Appendix O of [<a href="#ref-vxml">VXML</a>] for details.
<pre class="example">
<?xml version="1.0" encoding="UTF-8"?>
<vxml version="2.0" xmlns="http://www.w3.org/2001/vxml"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/vxml
http://www.w3.org/TR/voicexml20/vxml.xsd">
<form>
<block>
<prompt>
<emphasis>Welcome</emphasis> to the Bird Seed Emporium.
<audio src="rtsp://www.birdsounds.example.com/thrush.wav"/>
We have 250 kilogram drums of thistle seed for
$299.95
plus shipping and handling this month.
<audio src="http://www.birdsounds.example.com/mourningdove.wav"/>
</prompt>
</block>
</form>
</vxml>
</pre>
<h2 id="AppGt"><a id="AppG" name="AppG">Appendix G</a>: Summary of changes since the Candidate Recommendation</h2>
<p>This is a list of the major changes to the specification since the Candidate Recommendation:</p>
<ul>
<li>Removed unused definition of "Fatal Error" in section 1.5. (CR155)</li>
<li>Updated references to and descriptions of RDF in sections 3.1.5, 3.1.6, and 4. (CR166)</li>
<li>Removed "addition of XML declaration" in section 2.2.1. (CR170)</li>
<li>Revised 1.2, bullet 4, to remove implication that all processors use phonemes as the base acoustic units. (CR183)</li>
<li>Clarified in 3.2.4 that use of the <prosody> element with no attributes is an error. (CR184)</li>
<li>Clarified in 3.2.1 that use of the <voice> element with no attributes is an error. (CR185)</li>
<li>Changed audio/wav to audio/x-wav in Appendix A because the former type is not yet registered with IANA. (CR186)</li>
<li>Updated ToC and section titles to match. (CR190)</li>
<li>Updated informative bibliography reference to VoiceXML 2.0 to point to Recommendation.</li>
<li>Added missing "(SSML)" to title.</li>
<li>Linked instances of "error" to definition in section 1.5.</li>
<li>Removed reference to HTML from 3.1.5. (CR191)</li>
<li>Miscellaneous editorial fixes</li>
</ul>
<p> </p>
<p>
<a href="http://validator.w3.org/check?uri=referer"><img
src="http://www.w3.org/Icons/valid-xhtml10"
alt="Valid XHTML 1.0!" height="31" width="88" /></a>
</p>
<p>
<a href="http://jigsaw.w3.org/css-validator/">
<img style="border:0;width:88px;height:31px"
src="http://jigsaw.w3.org/css-validator/images/vcss"
alt="Valid CSS!" />
</a>
</p>
</body>
</html>