index.html 249 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137 2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195 2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253 2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 2610 2611 2612 2613 2614 2615 2616 2617 2618 2619 2620 2621 2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661 2662 2663 2664 2665 2666 2667 2668 2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724 2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769 2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785 2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811 2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865
<?xml version="1.0" encoding="UTF-8"?>
<!--OFFLINE
<!DOCTYPE html SYSTEM "xhtml1-transitional.dtd"> 
OFFLINE-->
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>Speech Synthesis Markup Language (SSML) Version 1.1</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style type="text/css" xml:space="preserve">
/**/
pre.example {
   font-family: monospace;
   white-space: pre;
   background: #CCCCFF;
   border: solid black thin;
   margin-left: 0;
   padding: 0.5em;
   font-size: 85%;
   width: 97%;
}
pre.dtd {
   font-family: "Lucida Console", "Courier New", monospace;
   white-space: pre;
   background: #CCFFCC;
   border: solid black thin;
   margin-left: 0;
   padding: 0.5em;
}

.ipa { font-family: "Lucida Sans Unicode", monospace; }
.RFC2119 {
  text-transform: lowercase;
  font-style: italic;
}
table { width: 100% }
td { background: #EAFFEA }

.tocline { list-style: disc; list-style: none; }
.hide { display: none }
.issues { font-style: italic; color: green }

.recentremove {
    text-decoration: line-through;
    color: black;
}
.recentnew {
        color: red;
}

.remove {
    text-decoration: line-through;
    color: maroon;
}
.new {
        color: fuchsia;
}
.elements {
    font-family: monospace;
    font-weight: bold;
}
.attributes {
    font-family: monospace;
    font-weight: bold;
}
code.att {
    font-family: monospace;
        font-weight: bold;
}
a.adef {
    font-family: monospace;
    font-weight: bold;
}
a.aref {
    font-family: monospace;
    font-weight: bold;
}
a.edef {
    font-family: monospace;
    font-weight: bold;
}
a.eref {
    font-family: monospace;
    font-weight: bold;
}


    /**/
</style>
<link rel="stylesheet" type="text/css" href="http://www.w3.org/StyleSheets/TR/W3C-REC" />
</head>
<body>
<div class="head">
<p><a href="http://www.w3.org/" shape="rect"><img height="48" alt="W3C" src="http://www.w3.org/Icons/w3c_home" width="72" /> 
</a></p>

<h1 class="notoc" id="h1">Speech Synthesis Markup Language (SSML) Version 1.1</h1>

<h2 class="notoc" id="date">W3C Recommendation 7 September 2010</h2>

<dl>
<dt>This version:</dt>
<dd><a href="http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/" shape="rect">http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/</a></dd>

<dt>Latest version:</dt>
<dd><a href="http://www.w3.org/TR/speech-synthesis11/" shape="rect">http://www.w3.org/TR/speech-synthesis11/</a></dd>

<dt>Previous version:</dt>
<dd><a href="http://www.w3.org/TR/2010/PR-speech-synthesis11-20100223/" shape="rect">http://www.w3.org/TR/2010/PR-speech-synthesis11-20100223/</a></dd>

<dt><br clear="none" />
 Editors:</dt>

<dd>Daniel C. Burnett,  Voxeo (formerly of Vocalocity and Nuance) </dd>
<dd>双志伟 (Zhi Wei Shuang), IBM </dd>
<dt>Authors:</dt>
<dd>Paolo Baggia, Loquendo</dd>
<dd>Paul Bagshaw, France Telecom </dd>
<dd>Michael Bodell,  Microsoft </dd>
<dd>黄德智 (De Zhi Huang), France Telecom</dd>
<dd>楼晓雁 (Lou Xiaoyan), Toshiba </dd>
<dd>Scott McGlashan, HP</dd>
<dd>陶建华 (Jianhua Tao), Chinese Academy of Sciences</dd>
<dd>严峻 (Yan Jun), iFLYTEK</dd>

<dd>胡方 (Hu Fang) (until 20 October 2009 while an Invited Expert)</dd>
<dd>康永国 (Yongguo Kang) (until 5 December 2007 while at Panasonic Corporation)</dd>
<dd>蒙美玲 (Helen Meng) (until 29 July 2009 while at Chinese University of Hong Kong)</dd>
<dd>王霞 (Wang Xia) (until 30 October 2006 while at Nokia)</dd>
<dd>夏海荣 (Xia Hairong) (until 2 August 2006 while at Panasonic Corporation)</dd>
<dd>吴志勇 (Zhiyong Wu) (until 29 July 2009 while at Chinese University of Hong Kong)</dd>
</dl>


    <p>Please refer to the
    <a href="http://www.w3.org/2010/04/speech-synthesis11-errata.html">
    <strong>errata</strong></a>
    for this document, which may include some normative
    corrections.</p>

    <p>See also
    <a href="http://www.w3.org/2003/03/Translations/byTechnology?technology=speech-synthesis11">
    <strong>translations</strong></a>.</p>


<p class="copyright"><a href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright" shape="rect">Copyright</a> &copy; 2010 <a href="http://www.w3.org/" shape="rect"><acronym title="World Wide Web Consortium">W3C</acronym></a><sup>&reg;</sup> (<a href="http://www.csail.mit.edu/" shape="rect"><acronym title="Massachusetts Institute of Technology">MIT</acronym></a>, <a href="http://www.ercim.eu/" shape="rect"><acronym title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>, <a href="http://www.keio.ac.jp/" shape="rect">Keio</a>), All Rights Reserved. W3C <a href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer" shape="rect">liability</a>, <a href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks" shape="rect">trademark</a> and <a href="http://www.w3.org/Consortium/Legal/copyright-documents" shape="rect">document use</a> rules apply.</p>
<!--
      <p>Sun, Sun Microsystems, Inc., the Sun logo, Java and all
      Java-based marks and logos are trademarks or registered trademarks
      of Sun Microsystems, Inc. in the United States and other countries.
      &copy;2000 Sun Microsystems.</p>
      -->
<hr title="Separator from Header" />
</div>

<h2 class="notoc" id="abstr"><a id="abstract" name="abstract" shape="rect">Abstract</a></h2>

<p>The Voice Browser Working Group has sought to develop standards to enable access to the Web using spoken interaction. The Speech Synthesis Markup Language Specification is one of these standards and is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms.</p>

<h2 class="notoc" id="status">Status of this Document</h2>

<p><em>This section describes the status of this document at the time of its publication. Other documents may supersede this document. A list of current W3C publications and the latest revision of this technical report can be found in the <a href="http://www.w3.org/TR/" shape="rect">W3C technical reports index</a> at http://www.w3.org/TR/.</em></p>

<p>This is the
<a href="http://www.w3.org/2005/10/Process-20051014/tr.html#RecsW3C">
Recommendation
</a>
of "Speech Synthesis Markup Language (SSML) Version 1.1".

It has been produced by the
<a href="http://www.w3.org/Voice/">Voice Browser Working Group</a>,
which is part of the
<a href="http://www.w3.org/Voice/Activity.html">Voice Browser Activity</a>.
</p>

<p>Comments are welcome on <a
href="mailto:www-voice@w3.org">www-voice@w3.org</a> (<a
href="http://lists.w3.org/Archives/Public/www-voice/">archive</a>).
See <a href="http://www.w3.org/Mail/">W3C mailing list and archive
usage guidelines</a>.</p>

<p>The design of SSML 1.1 has been widely reviewed (see the
<a href="http://www.w3.org/TR/2009/CR-speech-synthesis11-20090827/ssml11-disp.html">disposition of comments</a>)
and satisfies the Working Group's technical requirements.

A list of implementations is included in the
<a href="http://www.w3.org/Voice/2009/ssml11-ir/">
SSML 1.1 Implementation Report</a>,
along with the associated test suite.

The Working Group made a few editorial changes to the
<a href="http://www.w3.org/TR/2010/PR-speech-synthesis11-20100223/">
23 February 2010 Proposed Recommendation</a> in response to comments.

Changes from the Proposed Recommendation can be found in
<a href="#AppG">Appendix G</a>.

Also changes from SSML 1.0 including a note on backwards compatibility
to SSML 1.0 can be found in
<a href="#AppF">Appendix F</a>.
</p>

<p> This document enhances SSML 1.0 [<a href="#ref-ssml" shape="rect">SSML</a>] to provide better support for a broader set of natural (human) languages. To determine in what ways, if any, SSML is limited by its design with respect to supporting languages that are in large commercial or emerging markets for speech synthesis technologies but for which there was limited or no participation by either native speakers or experts during the development of SSML 1.0, the W3C held  three workshops on the Internationalization of SSML. The first workshop [<a href="#ref-WS">WS</a>], in Beijing, PRC, in October 2005, focused primarily on Chinese, Korean, and Japanese languages, and the second [<a href="#ref-WS2">WS2</a>], in Crete, Greece, in May 2006, focused primarily on Arabic, Indian, and Eastern European languages. The third workshop [<a href="#ref-WS3">WS3</a>], in Hyderabad, India, in January 2007, focused heavily on Indian and Middle Eastern languages. Information collected during these workshops was used to develop a requirements document [<a href="#ref-reqs11">REQS11</a>].  Changes from SSML 1.0 are motivated by these requirements.  </p>

  <p>This document has been reviewed by W3C Members, by software
  developers, and by other W3C groups and interested parties, and is
  endorsed by the Director as a W3C Recommendation. It is a stable
  document and may be used as reference material or cited from another
  document. W3C's role in making the Recommendation is to draw
  attention to the specification and to promote its widespread
  deployment. This enhances the functionality and interoperability of
  the Web.</p>

<p>This document was produced by a group operating under the <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/" shape="rect">5 February 2004 W3C Patent Policy</a>. W3C maintains a <a rel="disclosure" href="http://www.w3.org/2004/01/pp-impl/34665/status" shape="rect">public list of any patent disclosures</a> made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/#def-essential" shape="rect">Essential Claim(s)</a> must disclose the information in accordance with <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/#sec-Disclosure" shape="rect">section 6 of the W3C Patent Policy</a>. </p>

  <p>The sections in the main body of this document are normative
  unless otherwise specified.  The appendices in this document are
  informative unless otherwise indicated explicitly.</p>


<h2 id="toc">Table of Contents</h2>

<ul class="toc">
<li class="tocline">1. <a href="#S1" shape="rect">Introduction</a> 
<ul class="toc">
<li class="tocline">1.1 <a href="#S1.1" shape="rect">Design Concepts</a></li>

<li class="tocline">1.2 <a href="#S1.2" shape="rect">Speech Synthesis Process Steps</a></li>

<li class="tocline">1.3 <a href="#S1.3" shape="rect">Document Generation, Applications and Contexts</a></li>

<li class="tocline">1.4 <a href="#S1.4" shape="rect">Platform-Dependent Output Behavior of SSML Content</a></li>

<li class="tocline">1.5 <a href="#S1.5" shape="rect">Terminology</a></li>
</ul>
</li>

<li class="tocline">2.<a href="#S2" shape="rect">SSML Documents</a> 
<ul class="toc">
<li class="tocline">2.1 <a href="#S2.1" shape="rect">Document Form</a></li>

<li class="tocline">2.2 <a href="#S2.2" shape="rect">Conformance</a> 
<ul class="toc">
<li class="tocline">2.2.1 <a href="#S2.2.1" shape="rect">Conforming Speech Synthesis Markup Language Fragments</a></li>

<li class="tocline">2.2.2 <a href="#S2.2.2" shape="rect">Conforming Stand-Alone Speech Synthesis Markup Language Documents</a></li>

<li class="tocline">2.2.3 <a href="#S2.2.3" shape="rect">Using SSML With Other Namespaces</a></li>

<li class="tocline">2.2.4 <a href="#S2.2.4" shape="rect">Conforming Speech Synthesis Markup Language Processors</a></li>

<li class="tocline">2.2.5 <a href="#S2.2.5" shape="rect">Profiles</a></li>

<li class="tocline">2.2.6 <a href="#S2.2.6" shape="rect">Conforming User Agent</a></li>
</ul>
</li>

<li class="tocline">2.3 <a href="#S2.3" shape="rect">Integration With Other Markup Languages</a> 
<ul class="toc">
<li class="tocline">2.3.1 <a href="#S2.3.1" shape="rect">SMIL</a></li>

<li class="tocline">2.3.2 <a href="#S2.3.2" shape="rect">ACSS</a></li>

<li class="tocline">2.3.3 <a href="#S2.3.3" shape="rect">VoiceXML</a></li>
</ul>
</li>

<li class="tocline">2.4 <a href="#S2.4" shape="rect">Fetching SSML Documents</a></li>
</ul>
</li>

<li class="tocline">3. <a href="#S3" shape="rect">Elements and Attributes</a> 
<ul class="toc">
<li class="tocline">3.1 <a href="#S3.1" shape="rect">Document Structure, Text Processing and Pronunciation</a> 
<ul class="toc">
<li class="tocline">3.1.1 <a href="#S3.1.1" shape="rect">"speak" Root Element</a>
<ul class="toc">
<li class="tocline">3.1.1.1 <a href="#S3.1.1.1" shape="rect">Trimming Attributes</a></li>
</ul>
</li>

<li class="tocline">3.1.2 <a href="#S3.1.2" shape="rect">Language: "xml:lang" Attribute</a></li>

<li class="tocline">3.1.3 <a href="#S3.1.3" shape="rect">Base URI: "xml:base" Attribute</a>
<ul class="toc">
<li class="tocline">3.1.3.1 <a href="#S3.1.3.1" shape="rect">Resolving Relative URIs</a></li>
</ul>
</li>

<li class="tocline">3.1.4 <a href="#S3.1.4" shape="rect">Identifier: "xml:id" Attribute</a></li>

<li class="tocline">3.1.5 <a href="#S3.1.5" shape="rect">Lexicon Documents</a>
<ul class="toc">
<li class="tocline">3.1.5.1 <a href="#S3.1.5.1" shape="rect">"lexicon" Element</a></li>
<li class="tocline">3.1.5.2 <a href="#S3.1.5.2" shape="rect">"lookup" Element</a></li>
</ul>
</li>

<li class="tocline">3.1.6 <a href="#S3.1.6" shape="rect">"meta" Element</a></li>

<li class="tocline">3.1.7 <a href="#S3.1.7" shape="rect">"metadata" Element</a></li>

<li class="tocline">3.1.8 <a href="#S3.1.8" shape="rect">Text Structure</a>
<ul class="toc">
<li class="tocline">3.1.8.1 <a href="#S3.1.8.1" shape="rect">"p" and "s" Elements</a></li>
<li class="tocline">3.1.8.2 <a href="#S3.1.8.2" shape="rect">"token" and "w" Elements</a></li>
</ul>
</li>

<li class="tocline">3.1.9 <a href="#S3.1.9" shape="rect">"say-as" Element</a></li>

<li class="tocline">3.1.10 <a href="#S3.1.10" shape="rect">"phoneme" Element</a>
<ul class="toc">
<li class="tocline">3.1.10.1 <a href="#S3.1.10.1" shape="rect">Pronunciation Alphabet Registry</a></li>
</ul>
</li>

<li class="tocline">3.1.11 <a href="#S3.1.11" shape="rect">"sub" Element</a></li>
<li class="tocline">3.1.12 <a href="#S3.1.12" shape="rect">"lang" Element</a></li>
<li class="tocline">3.1.13 <a href="#S3.1.13" shape="rect">Language Speaking Failure: "onlangfailure" Attribute</a></li>
</ul>
</li>

<li class="tocline">3.2 <a href="#S3.2" shape="rect">Prosody and Style</a> 
<ul class="toc">
<li class="tocline">3.2.1 <a href="#S3.2.1" shape="rect">"voice" Element</a></li>

<li class="tocline">3.2.2 <a href="#S3.2.2" shape="rect">"emphasis" Element</a></li>

<li class="tocline">3.2.3 <a href="#S3.2.3" shape="rect">"break" Element</a></li>

<li class="tocline">3.2.4 <a href="#S3.2.4" shape="rect">"prosody" Element</a></li>
</ul>
</li>

<li class="tocline">3.3 <a href="#S3.3" shape="rect">Other Elements</a> 
<ul class="toc">
<li class="tocline">3.3.1 <a href="#S3.3.1" shape="rect">"audio" Element</a>
<ul class="toc">
<li class="tocline">3.3.1.1 <a href="#S3.3.1.1" shape="rect">Trimming Attributes</a></li>
<li class="tocline">3.3.1.2 <a href="#S3.3.1.2" shape="rect">"soundLevel" Attribute</a></li>
<li class="tocline">3.3.1.3 <a href="#S3.3.1.3" shape="rect">"speed" Attribute</a></li>
</ul>
</li>

<li class="tocline">3.3.2 <a href="#S3.3.2" shape="rect">"mark" Element</a></li>

<li class="tocline">3.3.3 <a href="#S3.3.3" shape="rect">"desc" Element</a></li>
</ul>
</li>
</ul>
</li>

<li class="tocline">4. <a href="#S4" shape="rect">References</a></li>

<li class="tocline">5. <a href="#S5" shape="rect">Acknowledgments</a></li>

<li class="tocline">Appendix A. <a href="#AppA" shape="rect">Audio File Formats</a> (normative)</li>

<li class="tocline">Appendix B. <a href="#AppB" shape="rect">Internationalization</a> (normative)</li>

<li class="tocline">Appendix C. <a href="#AppC" shape="rect">Media Types and File Suffix</a> (normative)</li>

<li class="tocline">Appendix D. <a href="#AppD" shape="rect">Schema for the Speech Synthesis Markup Language</a> (normative)</li>

<li>Appendix E. <a href="#AppE" shape="rect">Example SSML</a> (informative)</li>

<li>Appendix F. <a href="#AppF" shape="rect">Changes since SSML 1.0</a> (informative)</li>

<li>Appendix G. <a href="#AppG" shape="rect">Changes since last draft</a> (informative)</li>
</ul>

<h2><a id="S1" name="S1" shape="rect">1.</a> Introduction</h2>

<p>This W3C specification is known as the Speech Synthesis Markup Language specification (SSML) and is based upon the JSGF and/or JSML specifications, which are owned by Sun Microsystems, Inc., California, U.S.A. The JSML specification can be found at [<a href="#ref-jsml" shape="rect">JSML</a>].</p>

<p>SSML is part of a larger set of markup specifications for <a href="#term-voicebrowser" shape="rect">voice browsers</a> developed through the open processes of the W3C. It is designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in Web and other applications. The essential role of the markup language is to give authors of synthesizable content a standard way to control aspects of speech output such as pronunciation, volume, pitch, rate, etc. across different synthesis-capable platforms. A related initiative to establish a standard system for marking up text input is SABLE [<a href="#ref-sable" shape="rect">SABLE</a>], which tried to integrate many different XML-based markups for <a href="#term-synthesis" shape="rect">speech synthesis</a> into a new one. The activity carried out in SABLE was also used as the main starting point for defining the Speech Synthesis Markup Requirements for Voice Markup Languages [<a href="#ref-reqs" shape="rect">REQS</a>]. Since then, SABLE itself has not undergone any further development.</p>

<p>The intended use of SSML is to improve the quality of synthesized content. Different markup elements impact different stages of the synthesis process (see <a href="#S1.2" shape="rect">Section 1.2</a>). The markup may be produced either automatically, for instance via XSLT or CSS3 from an XHTML document, or by human authoring. Markup may be present within a complete SSML document (see <a href="#S2.2.2" shape="rect">Section 2.2.2</a>) or as part of a fragment (see <a href="#S2.2.1" shape="rect">Section 2.2.1</a>) embedded in another language, although no interactions with other languages are specified as part of SSML itself. Most of the markup included in SSML is suitable for use by the majority of content developers; however, some advanced features like <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a> and <a href="#edef_prosody" class="eref" shape="rect">prosody</a> (e.g. for speech contour design) may require specialized knowledge.</p>

<h3><a id="S1.1" name="S1.1" shape="rect">1.1</a> Design Concepts</h3>

<p>The design and standardization process has followed from the Speech Synthesis Markup Requirements for Voice Markup Languages [<a href="#ref-reqs" shape="rect">REQS</a>].</p>

<p>The following items were the key design criteria.</p>

<ul>
<li><em>Consistency:</em> provide predictable control of voice output across platforms and across <a href="#term-synthesis" shape="rect">speech synthesis</a> implementations.</li>

<li><em>Interoperability:</em> support use along with other W3C specifications including (but not limited to) VoiceXML, aural Cascading Style Sheets and SMIL.</li>

<li><em>Generality:</em> support speech output for a wide range of applications with varied speech content.</li>

<li><em>Internationalization:</em> Enable speech output in a large number of languages within or across documents.</li>

<li><em>Generation and Readability:</em> Support automatic generation and hand authoring of documents. The documents should be human-readable.</li>

<li><em>Implementable:</em> The specification should be implementable with existing, generally available technology, and the number of optional features should be minimal.</li>
</ul>

<h3><a id="S1.2" name="S1.2" shape="rect">1.2</a> Speech Synthesis Process Steps</h3>

<p>A <a href="#term-tts" shape="rect">Text-To-Speech</a> system (a <a href="#term-processor" shape="rect">synthesis processor</a>) that supports SSML will be responsible for rendering a document as spoken output and for using the information contained in the markup to render the document as intended by the author.</p>

<p><em>Document creation:</em> A text document provided as input to the <a href="#term-processor" shape="rect">synthesis processor</a> may be produced automatically, by human authoring, or through a combination of these forms. SSML defines the form of the document.</p>

<p><em>Document processing:</em> The following are the six major processing steps undertaken by a <a href="#term-processor" shape="rect">synthesis processor</a> to convert marked-up text input into automatically generated voice output. The markup language is designed to be sufficiently rich so as to allow control over each of the steps described below so that the document author (human or machine) can control the final voice output. Although each step below is divided into "markup support" and "non-markup behavior", actual behavior is usually a mix of the two and varies depending on the tag. The processor has the ultimate authority to ensure that what it produces is pronounceable (and ideally intelligible). In general the markup provides a way for the author to make prosodic and other information available to the processor, typically information the processor would be unable to acquire on its own. It is then up to the processor to determine whether and in what way to use the information.</p>

<ol>
<li>
<p><b>XML parse:</b> An XML parser is used to extract the document tree and content from the incoming text document. The structure, tags and attributes obtained in this step influence each of the following steps.</p>
</li>

<li>
<p><b>Structure analysis:</b> The structure of a document influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.</p>

<ul>
<li>
<p><em>Markup support:</em> The <a href="#edef_paragraph" class="eref" shape="rect">p</a> and <a href="#edef_sentence" class="eref" shape="rect">s</a> elements defined in SSML explicitly indicate document structures that affect the speech output.</p>
</li>

<li>
<p><em>Non-markup behavior:</em> In documents and parts of documents where these elements are not used, the <a href="#term-processor" shape="rect">synthesis processor</a> is responsible for inferring the structure by automated analysis of the text, often using punctuation and other language-specific data.</p>
</li>
</ul>
</li>

<li>
<p><a name="text_normalization" id="text_normalization" shape="rect"><b>Text normalization:</b></a> All written languages have special constructs that require a conversion of the written form (orthographic form) into the spoken form. Text normalization is an automated process of the <a href="#term-processor" shape="rect">synthesis processor</a> that performs this conversion. For example, for English, when "$200" appears in a document it may be spoken as "two hundred dollars". Similarly, "1/2" may be spoken as "half", "January second", "February first", "one of two" and so on. By the end of this step the text to be spoken has been converted completely into tokens. The exact details of what constitutes a token are language-specific. In English, tokens are usually separated by white space and are typically words. For languages with different tokenization behavior, the term "word" in this specification is intended to mean an appropriately comparable unit. Tokens in SSML cannot span markup tags except within the <a href="#edef_token" class="eref" shape="rect">token</a> and <a href="#edef_word" class="eref" shape="rect">w</a> elements. A simple English example is "cup&lt;break/&gt;board"; outside  the <a href="#edef_token" class="eref" shape="rect">token</a> and <a href="#edef_word" class="eref" shape="rect">w</a> elements, the <a href="#term-processor" shape="rect">synthesis processor</a> will treat this as the two tokens "cup" and "board" rather than as one token (word) with a pause in the middle. Breaking one token into multiple tokens this way will likely affect how the processor treats it.</p>

<ul>
<li>
<p><em>Markup support:</em> The <a href="#edef_say-as" class="eref" shape="rect">say-as</a> element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. The set of constructs that can be marked has not yet been defined but might include dates, times, numbers, acronyms, currency amounts and more. Note that many acronyms and abbreviations can be handled by the author via direct text replacement or by use of the <a href="#edef_sub" class="eref" shape="rect">sub</a> element, e.g. "BBC" can be written as "B B C" and "AAA" can be written as "triple A". These replacement written forms will likely be pronounced as one would want the original acronyms to be pronounced. In the case of Japanese text, if you have a <a href="#term-processor" shape="rect">synthesis processor</a> that supports both Kanji and kana, you may be able to use the <a href="#edef_sub" class="eref" shape="rect">sub</a> element to identify whether 今日は should be spoken as きょうは ("kyou wa" = "today") or こんにちは ("konnichiwa" = "hello").</p>
</li>

<li>
<p><em>Non-markup behavior:</em> For text content that is not marked with the <a href="#edef_say-as" class="eref" shape="rect">say-as</a> element the <a href="#term-processor" shape="rect">synthesis processor</a> is expected to make a reasonable effort to automatically locate and convert these constructs to a speakable form. Because of inherent ambiguities (such as the "1/2" example above) and because of the wide range of possible constructs in any language, this process may introduce errors in the speech output and may cause different processors to render the same document differently.</p>
</li>
</ul>
</li>

<li>
<p><b>Text-to-phoneme conversion:</b> Once the <a href="#term-processor" shape="rect">synthesis processor</a> has determined the set of  tokens to be spoken, it must derive pronunciations for each token. Pronunciations may be conveniently described as sequences of phonemes, which are units of sound in a language that serve to distinguish one word from another. Each language (and sometimes each national or dialect variant of a language) has a specific phoneme set: e.g., most US English dialects have around 45 phonemes, Hawai'ian has between 12 and 18 (depending on who you ask), and some languages have more than 100! This conversion is made complex by a number of issues. One issue is that there are differences between written and spoken forms of a language, and these differences can lead to indeterminacy or ambiguity in the pronunciation of written words. For example, compared with their spoken form, words in Hebrew and Arabic are usually written with no vowels, or only a few vowels specified. In many languages the same written word may have many spoken forms. For example, in English, "read" may be spoken as "reed" (I will read the book) or "red" (I have read the book). Both human speakers and synthesis processors can pronounce these words correctly in context but may have difficulty without context (see "Non-markup behavior" below). Another issue is the handling of words with non-standard spellings or pronunciations. For example, an English <a href="#term-processor" shape="rect">synthesis processor</a> will often have trouble determining how to speak some non-English-origin names, e.g. "Caius College" (pronounced "keys college") and President Tito (pronounced "sutto"), the president of the Republic of Kiribati (pronounced "kiribass").</p>

<ul>
<li>
<p><em>Markup support:</em> The <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a> element allows a phonemic sequence to be provided for any  token or  token sequence. This provides the content creator with explicit control over pronunciations. The <a href="#edef_say-as" class="eref" shape="rect">say-as</a> element might also be used to indicate that text is a proper name that may allow a <a href="#term-processor" shape="rect">synthesis processor</a> to apply special rules to determine a pronunciation. The <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> and <a href="#edef_lookup" class="eref" shape="rect">lookup</a> elements can be used to reference external definitions of pronunciations. These elements can be particularly useful for acronyms and abbreviations that the processor is unable to resolve via its own <a href="#text_normalization" shape="rect">text normalization</a> and that are not addressable via direct text substitution or the <a href="#edef_sub" class="eref" shape="rect">sub</a> element (see paragraph 3, above).</p>
</li>

<li>
<p><em>Non-markup behavior:</em> In the absence of a <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a> element the <a href="#term-processor" shape="rect">synthesis processor</a> <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> apply automated capabilities to determine pronunciations. This is typically achieved by looking up  tokens in a pronunciation dictionary (which may be language-dependent) and applying rules to determine other pronunciations. <a href="#term-processor" shape="rect">Synthesis processors</a> are designed to perform text-to-phoneme conversions so most words of most documents can be handled automatically. As an alternative to relying upon the processor, authors may choose to perform some conversions themselves prior to encoding in SSML. Written words with indeterminate or ambiguous pronunciations could be replaced by words with an unambiguous pronunciation; for example, in the case of "read", "I will reed the book". Authors should be aware, however, that the resulting SSML document may not be optimal for visual display.</p>
</li>
</ul>
</li>

<li>
<p><b>Prosody analysis:</b> Prosody is the set of features of speech output that includes the pitch (also called intonation or melody), the timing (or rhythm), the pausing, the speaking rate, the emphasis on words and many other features. Producing human-like prosody is important for making speech sound natural and for correctly conveying the meaning of spoken language.</p>

<ul>
<li>
<p><em>Markup support:</em> The <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a> element, <a href="#edef_break" class="eref" shape="rect">break</a> element and <a href="#edef_prosody" class="eref" shape="rect">prosody</a> element may all be used by document creators to guide the <a href="#term-processor" shape="rect">synthesis processor</a> in generating appropriate prosodic features in the speech output.</p>
</li>

<li>
<p><em>Non-markup behavior:</em> In the absence of these elements, <a href="#term-processor" shape="rect">synthesis processors</a> are expert (but not perfect) in automatically generating suitable prosody. This is achieved through analysis of the document structure, sentence syntax, and other information that can be inferred from the text input.</p>
</li>
</ul>

<p>While most of the elements of SSML can be considered high-level in that they provide either content to be spoken or logical descriptions of style, the <a href="#edef_break" class="eref" shape="rect">break</a> and <a href="#edef_prosody" class="eref" shape="rect">prosody</a> elements mentioned above operate at a later point in the process and thus must coexist both with uses of the <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a> element and with the processor's own determinations of prosodic behavior. Unless specified in the appropriate sections, details of the interactions between the processor's own determinations and those provided by the author at this level are processor-specific. Authors are encouraged not to casually or arbitrarily mix these two levels of control.</p>
</li>

<li>
<p><b>Waveform production:</b> The phonemes and prosodic information are used by the <a href="#term-processor" shape="rect">synthesis processor</a> in the production of the audio waveform. There are many approaches to this processing step so there may be considerable processor-specific variation.</p>

<ul>
<li>
<p><em>Markup support:</em> The <a href="#edef_voice" class="eref" shape="rect">voice</a> element allows the document creator to request a particular voice or specific voice qualities (e.g. a young male voice). The <a href="#edef_audio" class="eref" shape="rect">audio</a> element allows for insertion of recorded audio data into the output stream, with optional control over the duration, sound level and playback speed of the recording.  Rendering can be restricted to a subset of the document by using the trimming attributes on the <a href="#edef_speak" class="eref" shape="rect">speak</a> element.</p>
</li>
<li>
  <p><em>Non-markup behavior:</em> The default volume/sound level, speed, and pitch/frequency of both voices and recorded audio in the document are that of the unmodified waveforms, whether they be voices or recordings. </p>
</li>
</ul>
</li>
</ol>

<h3><a id="S1.3" name="S1.3" shape="rect">1.3</a> Document Generation, Applications and Contexts</h3>

<p>There are many classes of document creator that will produce marked-up documents to be spoken by a <a href="#term-processor" shape="rect">synthesis processor</a>. Not all document creators (including human and machine) have access to information that can be used in all of the elements or in each of the processing steps described in the <a href="#S1.2" shape="rect">previous section</a>. The following are some of the common cases.</p>

<ul>
<li>
<p>The document creator has no access to information to mark up the text. All processing steps in the <a href="#term-processor" shape="rect">synthesis processor</a> must be performed fully automatically on <em>raw text</em>. The document requires only the containing <a href="#edef_speak" class="eref" shape="rect">speak</a> element to indicate the content is to be spoken.</p>
</li>

<li>
<p>When marked text is generated programmatically the creator may have specific knowledge of the structure and/or special text constructs in some or all of the document. For example, an email reader can mark the location of the time and date of receipt of email. Such applications may use elements that affect structure, <a href="#text_normalization" shape="rect">text normalization</a>, prosody and possibly text-to-phoneme conversion.</p>
</li>

<li>
<p>Some document creators make considerable effort to mark as many details of the document as possible to ensure consistent speech quality across platforms and to more precisely specify output qualities. In these cases, the markup may use any or all of the available elements to tightly control the speech output. For example, prompts generated in telephony and <a href="#term-voicebrowser" shape="rect">voice browser</a> applications may be fine-tuned to maximize the effectiveness of the overall system.</p>
</li>

<li>
<p>The most advanced document creators may skip the higher-level markup (structure, <a href="#text_normalization" shape="rect">text normalization</a>, text-to-phoneme conversion, and prosody analysis) and produce low-level <a href="#term-synthesis" shape="rect">speech synthesis</a> markup for segments of documents or for entire documents. This typically requires tools to generate sequences of phonemes, plus pitch and timing information. For instance, tools that do "copy synthesis" or "prosody transplant" try to emulate human speech by copying properties from recordings.</p>
</li>
</ul>

<p>The following are important instances of architectures or designs from which marked-up synthesis documents will be generated. The language design is intended to facilitate each of these approaches.</p>

<ul>
<li>
<p><em>Dialog language</em>: It is a requirement that it  <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> be possible to include documents marked with SSML into the dialog description document to be produced by the Voice Browser Working Group.</p>
</li>

<li>
<p><em>Interoperability with aural CSS (ACSS)</em>: Any HTML processor that is aural CSS-enabled can produce SSML. ACSS is covered in <a href="http://www.w3.org/TR/1998/REC-CSS2-19980512/aural.html" shape="rect">Section 19 of the Cascading Style Sheets, level 2 (CSS2) Specification</a> [<a href="#ref-css2" shape="rect">CSS2</a> §19]. This usage of <a href="#term-synthesis" shape="rect">speech synthesis</a> facilitates improved accessibility to existing HTML and XHTML content.</p>
</li>

<li>
<p><em>Application-specific style sheet processing</em>: As mentioned above,
there are classes of applications that have knowledge of text content to be spoken, and that can be incorporated into the <a href="#term-synthesis" shape="rect">speech synthesis</a> markup to enhance rendering of the document. In many cases, it is expected that the application will use style sheets to perform transformations of existing XML documents to SSML. This is equivalent to the use of ACSS with HTML and once again SSML is the resulting representation to be passed to the <a href="#term-processor" shape="rect">synthesis processor</a>. In this context, SSML may be viewed as a superset of <a href="http://www.w3.org/TR/1998/REC-CSS2-19980512/aural.html" shape="rect">ACSS</a> [<a href="#ref-css2" shape="rect">CSS2</a>§19] capabilities, excepting spatial audio.</p>
</li>
</ul>

<h3><a id="S1.4" name="S1.4" shape="rect">1.4</a> Platform-Dependent Output Behavior of SSML Content</h3>

<p>SSML provides a standard way to specify gross properties of synthetic speech production such as pronunciation, volume, pitch, rate, etc. Exact specification of synthetic speech output behavior across disparate processors, however, is beyond the scope of this document.</p>

<p>Unless otherwise specified, markup values are merely indications rather than absolutes. For example, it is possible for an author to explicitly indicate the duration of a text segment and also indicate an explicit duration for a subset of that text segment. If the two durations result in a text segment that the <a href="#term-processor" shape="rect">synthesis processor</a> cannot reasonably render, the processor is permitted to modify the durations as needed to render the text segment.</p>

<h3><a id="S1.5" name="S1.5" shape="rect">1.5</a> Terminology</h3>

<dl>
<dt><br clear="none" />
 <b><em><a id="term-requirements" name="term-requirements" shape="rect">Requirements terms</a></em></b></dt>

<dd>The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [<a href="#ref-rfc2119" shape="rect">RFC2119</a>]. However, for readability, these words do not appear in all uppercase letters in this specification.</dd>

<dt><br clear="none" />
 <b><em><a id="term-useroption" name="term-useroption" shape="rect">At user option</a></em></b></dt>

<dd>A conforming <a href="#term-processor" shape="rect">synthesis processor</a> <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> or  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> (depending on the modal verb in the sentence) behave as described; if it does, it  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> provide users a means to enable or disable the behavior described.</dd>

<dt><br clear="none" />
 <b><em><a id="term-error" name="term-error" shape="rect">Error</a></em></b></dt>

<dd>Results are undefined. A conforming <a href="#term-processor" shape="rect">synthesis processor</a> <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> detect and report an error and  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> recover from it.</dd>

<dt><br clear="none" />
 <b><em><a id="term-media-type" name="term-media-type" shape="rect">Media Type</a></em></b></dt>

<dd>A <em>media type</em> (defined in [<a href="#ref-rfc2045" shape="rect">RFC2045</a>] and [<a href="#ref-rfc2046" shape="rect">RFC2046</a>]) specifies the nature of a linked resource. Media types are case insensitive. A list of registered media types is available for download [<a href="#ref-mimetypes" shape="rect">TYPES</a>]. 
See <a href="#AppC" shape="rect">Appendix C</a> for information on media types for SSML.</dd>

<dt><br clear="none" />
 <b><em><a id="term-synthesis" name="term-synthesis" shape="rect">Speech Synthesis</a></em></b></dt>

<dd>The process of automatic generation of speech output from data input which may include plain text, marked up text or binary objects.</dd>

<dt><br clear="none" />
 <b><em><a id="term-processor" name="term-processor" shape="rect">Synthesis Processor</a></em></b></dt>

<dd>A <a href="#term-tts" shape="rect">Text-To-Speech</a> system that accepts SSML documents as input and renders them as spoken output.</dd>

<dt><br clear="none" />
 <b><em><a id="term-tts" name="term-tts" shape="rect">Text-To-Speech</a></em></b></dt>

<dd>The process of automatic generation of speech output from text or annotated text input.</dd>

<dt><br clear="none" />
 <b><em><a id="term-uri" name="term-uri" shape="rect">URI: Uniform Resource Identifier</a></em></b></dt>

<dd> A global identifier in the context of the World Wide Web [<a href="#ref-web-arch">WEB-ARCH</a>]. A URI is defined as any legal <code><a href="http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#anyURI" shape="rect">anyURI</a></code> primitive as defined in XML Schema Part 2: Datatypes [<a href="#ref-schema2" shape="rect">SCHEMA2</a> §3.2.17]. For informational purposes only, [<a href="#ref-rfc3986" shape="rect">RFC3986</a>] and [<a href="#ref-rfc2732" shape="rect">RFC2732</a>] may be useful in understanding the structure, format, and use of URIs. Note that IRIs (see [<a href="#ref-rfc3987" shape="rect">RFC3987</a>]) are permitted within the above definition of URI. Any relative URI reference  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> be resolved according to the rules given in <a href="#S3.1.3.1" shape="rect">Section 3.1.3.1</a>. In this specification URIs are provided as attributes to elements, for example in the <a href="#edef_audio" class="eref" shape="rect">audio</a> and <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> elements.</dd>

<dt><br clear="none" />
 <b><em><a id="term-voicebrowser" name="term-voicebrowser" shape="rect">Voice Browser</a></em></b></dt>

<dd>A device which interprets a (voice) markup language and is capable of generating voice output and/or interpreting voice input, and possibly other input/output modalities.</dd>
</dl>

<h2 id="g28"><a id="S2" name="S2" shape="rect">2.</a> SSML Documents</h2>

<h3 id="g29"><a id="S2.1" name="S2.1" shape="rect">2.1</a> Document Form</h3>

<p>A legal stand-alone Speech Synthesis Markup Language document  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> have a legal XML Prolog [<a href="#ref-xml10" shape="rect">XML 1.0</a> or <a href="#ref-xml11" shape="rect">XML 1.1</a>, as appropriate, §2.8].</p>

<p>The XML prolog is followed by the root <a href="#edef_speak" class="eref" shape="rect">speak</a> element. See <a href="#S3.1.1" shape="rect">Section 3.1.1</a> for details on this element.</p>

<p>The <a href="#edef_speak" class="eref" shape="rect">speak</a> element  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> designate the SSML namespace. This can be achieved by declaring an <code class="att">xmlns</code> attribute or an attribute with an "xmlns" prefix. See [<a href="#ref-xmlns10" shape="rect">XMLNS 1.0</a> or <a href="#ref-xmlns11" shape="rect">XMLNS 1.1</a>, as appropriate, §2] for details. Note that when the <code class="att">xmlns</code> attribute is used alone, it sets the default namespace for the element on which it appears and for any child elements. The namespace for SSML is defined to be <a href="http://www.w3.org/2001/10/synthesis" shape="rect">http://www.w3.org/2001/10/synthesis</a>.</p>

<p>It is  <em title="RECOMMENDED in RFC 2119 context" class="RFC2119">RECOMMENDED</em> that the <a href="#edef_speak" class="eref" shape="rect">speak</a> element also indicate the location of the appropriate SSML schema (see <a href="#AppD" shape="rect">Appendix D</a>) via the <code class="att"><a href="http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/#xsi_schemaLocation" shape="rect">xsi:schemaLocation</a></code> attribute from [<a href="#ref-schema1" shape="rect">SCHEMA1</a> §2.6.3]. Although such indication is not required, to encourage it this document provides such indication on all of the examples. When this attribute is not given, the Core profile [<a href="#S2.2.5">Section 2.2.5</a>] <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> be assumed. </p>

<p>The following are two examples of legal SSML headers:</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
</pre>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xml:lang="en-US"&gt;
</pre>

<p>The <a href="#edef_meta" class="eref" shape="rect">meta</a>, <a href="#edef_metadata" class="eref" shape="rect">metadata</a> and <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> elements  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> occur before all other elements and text contained within the root <a href="#edef_speak" class="eref" shape="rect">speak</a> element. There are no other ordering constraints on the elements in this specification.</p>

<h2 id="g34"><a id="S2.2" name="S2.2" shape="rect">2.2.</a> Conformance</h2>

<h3 id="g2.2.1"><a id="S2.2.1" name="S2.2.1" shape="rect">2.2.1</a> Conforming Speech Synthesis Markup Language Fragments</h3>

<h3 id="g2.2.1.1"><a id="S2.2.1.1" name="S2.2.1.1" shape="rect">2.2.1.1</a> Conforming Core Speech Synthesis Markup Language Fragments</h3>

<p>A document fragment is a <em>Conforming Core Speech Synthesis Markup Language Fragment</em> if:</p>
<ul>
  <li>it conforms to the criteria for <a href="#S2.2.2.1" shape="rect">Conforming Stand-Alone Core Speech Synthesis Markup Language Documents</a> after:
    <ul>
        <li>with the exception of <a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a> and <a href="#adef_xmlbase" class="aref" shape="rect"><code class="att">xml:base</code></a> , all non-synthesis namespace elements and attributes and all <code class="att">xmlns</code> attributes which refer to non-synthesis namespace elements are removed from the document,</li>
      <li>and, if the <a href="#edef_speak" class="eref" shape="rect">speak</a> element does not already designate the synthesis namespace using the <code class="att">xmlns</code> attribute, then <code>xmlns="http://www.w3.org/2001/10/synthesis"</code> is added to the element.</li>
    </ul>
  </li>
</ul>

<h3 id="g2.2.1.2"><a id="S2.2.1.2" name="S2.2.1.2" shape="rect">2.2.1.2</a> Conforming Extended Speech Synthesis Markup Language Fragments</h3>
<p>A document fragment is a <em>Conforming Extended Speech Synthesis Markup Language Fragment</em> if:</p>
<ul>
  <li>it conforms to the criteria for <a href="#S2.2.2.2" shape="rect">Conforming Stand-Alone Extended Speech Synthesis Markup Language Documents</a> after: 
    <ul>
  <li>with the exception of <a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a> and <a href="#adef_xmlbase" class="aref" shape="rect"><code class="att">xml:base</code></a> , all non-synthesis namespace elements and attributes and all <code class="att">xmlns</code> attributes which refer to non-synthesis namespace elements are removed from the document,</li>
  
<li>and, if the <a href="#edef_speak" class="eref" shape="rect">speak</a> element does not already designate the synthesis namespace using the <code class="att">xmlns</code> attribute, then <code>xmlns="http://www.w3.org/2001/10/synthesis"</code> is added to the element.</li>
  </ul>
  </li>
</ul>
<h3 id="g36"><a id="S2.2.2" name="S2.2.2" shape="rect">2.2.2</a> Conforming Stand-Alone Speech Synthesis Markup Language Documents</h3>

<h3 id="g2.2.2.1"><a id="S2.2.2.1" name="S2.2.2.1" shape="rect">2.2.2.1</a> Conforming Stand-Alone Core Speech Synthesis Markup Language Documents</h3>
<p>A document is a <em>Conforming Stand-Alone Core Speech Synthesis Markup Language Document</em> if it meets both the following conditions:</p>

<ul>
  <li>It is a well-formed XML document [<a href="#ref-xml10" shape="rect">XML 1.0</a> or <a href="#ref-xml11" shape="rect">XML 1.1</a> §2.1] conforming to Namespaces in XML (1.0 [<a href="#ref-xmlns10" shape="rect">XMLNS 1.0</a>] or 1.1 [<a href="#ref-xmlns11" shape="rect">XMLNS 1.1</a>], respectively).</li>
  <li>It is a valid XML document [<a href="#ref-xml10" shape="rect">XML 1.0</a> or <a href="#ref-xml11" shape="rect">XML 1.1</a> §2.8] which adheres to the specification described in this document (<a href="#S1" shape="rect">Speech Synthesis Markup Language Specification</a>) including the constraints expressed in the Core Schema (see <a href="#AppD" shape="rect">Appendix D</a>) and having an XML Prolog and <a href="#edef_speak" class="eref" shape="rect">speak</a> root element as specified in <a href="#S2.1" shape="rect">Section 2.1</a>.</li>
</ul>

<h3 id="g2.2.2.2"><a id="S2.2.2.2" name="S2.2.2.2" shape="rect">2.2.2.2</a> Conforming Stand-Alone Extended Speech Synthesis Markup Language Documents</h3>
<p>A document is a <em>Conforming Stand-Alone Extended Speech Synthesis Markup Language Document</em> if it meets both the following conditions:</p>
<ul>
  <li>It is a well-formed XML document [<a href="#ref-xml10" shape="rect">XML 1.0</a> or <a href="#ref-xml11" shape="rect">XML 1.1</a> §2.1] conforming to Namespaces in XML (1.0 [<a href="#ref-xmlns10" shape="rect">XMLNS 1.0</a>] or 1.1 [<a href="#ref-xmlns11" shape="rect">XMLNS 1.1</a>], respectively).</li>
  
<li>It is a valid XML document [<a href="#ref-xml10" shape="rect">XML 1.0</a> or <a href="#ref-xml11" shape="rect">XML 1.1</a> §2.8] which adheres to the specification described in this document (<a href="#S1" shape="rect">Speech Synthesis Markup Language Specification</a>) including the constraints expressed in the  Extended Schema (see <a href="#AppD" shape="rect">Appendix D</a>) and having an XML Prolog and <a href="#edef_speak" class="eref" shape="rect">speak</a> root element as specified in <a href="#S2.1" shape="rect">Section 2.1</a>.</li>
</ul>
<p>The SSML specification and these conformance criteria provide no designated size limits on any aspect of synthesis documents. There are no maximum values on the number of elements, the amount of character data, or the number of characters in attribute values.</p>

<h3><a id="S2.2.3" name="S2.2.3" shape="rect">2.2.3</a> Using SSML with other Namespaces</h3>

<p>The synthesis namespace  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> be used with other XML namespaces as per the appropriate Namespaces in XML Recommendation (1.0 [<a href="#ref-xmlns10" shape="rect">XMLNS 1.0</a>] or 1.1 [<a href="#ref-xmlns11" shape="rect">XMLNS 1.1</a>], depending on the version of XML being used). Future work by W3C is expected to address ways to specify conformance for documents involving multiple namespaces. Language-specific (i.e. non-SSML) elements and attributes may be inserted into SSML using an appropriate namespace.  However,  such content would only be rendered by a <a href="#term-processor" shape="rect">synthesis processor</a> that supported the custom markup. Here is an example of how one might insert Ruby [<a href="#ref-ruby" shape="rect">RUBY</a>] elements into SSML: </p>
<pre class="example" xml:space="preserve">
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xhtml="http://www.w3.org/1999/xhtml"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="ja"&gt;
  &lt;!-- It's 20 July today. --&gt;
  &lt;s&gt;今日は七月
    &lt;xhtml:ruby&gt;
      &lt;xhtml:rb&gt;二十日&lt;/xhtml:rb&gt;
      &lt;xhtml:rt role="alphabet:x-JEITA"&gt;ハツカ&lt;/xhtml:rt&gt;
    &lt;/xhtml:ruby&gt;
    です。
  &lt;/s&gt;

  &lt;!-- It's 20 July today. --&gt;
  &lt;s&gt;今日は七月
    &lt;xhtml:ruby&gt;
      &lt;xhtml:rb&gt;二十日&lt;/xhtml:rb&gt;
      &lt;xhtml:rt role="alphabet:x-JEITA"&gt;ニジューニチ&lt;/xhtml:rt&gt;
    &lt;/xhtml:ruby&gt;
    です。
  &lt;/s&gt;
&lt;/speak&gt;</pre>

<h3 id="g38"><a id="S2.2.4" name="S2.2.4" shape="rect">2.2.4</a> Conforming Speech Synthesis Markup Language Processors</h3>

<p>In a <em>Conforming Speech Synthesis Markup Language Processor</em>, the XML parser <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> be able to parse and process all XML constructs defined by XML 1.0 [<a href="#ref-xml10" shape="rect">XML 1.0</a>] and XML 1.1 [<a href="#ref-xml11" shape="rect">XML 1.1</a>] and the corresponding versions of Namespaces in XML (1.0 [<a href="#ref-xmlns10" shape="rect">XMLNS 1.0</a>] and 1.1 [<a href="#ref-xmlns11" shape="rect">XMLNS 1.1</a>]). This XML parser is not required to perform validation of an SSML document as per its schema or DTD; this implies that during processing of an SSML document it is <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> to apply or expand external entity references defined in an external DTD.</p>
<p>A Conforming Speech Synthesis Markup Language Processor <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> meet the following requirements for handling of natural (human) languages:</p>
<ul>
  <li>A Conforming Speech Synthesis Markup Language Processor is <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em> to parse all legal natural language declarations successfully.</li>
  <li>A Conforming Speech Synthesis Markup Language Processor may be able to apply the semantics of markup languages which refer to more than one natural language. When a processor is able to support each natural language in the set but is unable to handle them concurrently it <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> inform the hosting environment. When the set includes one or more natural languages that are not supported by the processor it <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> inform the hosting environment.</li>
  <li>A Conforming Speech Synthesis Markup Language Processor <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> implement natural languages by approximate substitutions according to a documented, processor-specific behavior. For example, a US English synthesis processor could process British English input.</li>
</ul>
<p>There is no conformance requirement with respect to performance characteristics of the Speech Synthesis Markup Language Processor. For instance, no statement is required regarding the accuracy, speed or other characteristics of speech produced by the processor. No statement is made regarding the size of input that a Speech Synthesis Markup Language Processor must support.</p>
<h3 id="g2.2.4.1"><a id="S2.2.4.1" name="S2.2.4.1" shape="rect">2.2.4.1</a> Conforming Core Speech Synthesis Markup Language Processors</h3>

<p>A Core Speech Synthesis Markup Language processor is a  Conforming Speech Synthesis Markup Language Processor that can parse and process <a href="#S2.2.2.1" shape="rect">Conforming Stand-Alone Core Speech Synthesis Markup Language documents</a>.</p>
<p>A Conforming Core Speech Synthesis Markup Language Processor <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> correctly understand and apply the semantics of  the elements and attributes of the <a href="#S2.2.5">Core profile</a> as described by this document.</p>
<p>When a Conforming Core Speech Synthesis Markup Language Processor encounters elements or attributes other than those included in the <a href="#S2.2.5">Core profile</a> it <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em>:</p>
<ul>
  <li>ignore the non-standard elements and/or attributes</li>
  <li>or, process the non-standard elements and/or attributes</li>
  <li>or, reject the document containing those elements and/or attributes</li>
</ul>

<h3 id="g2.2.4.2"><a id="S2.2.4.2" name="S2.2.4.2" shape="rect">2.2.4.2</a> Conforming Extended Speech Synthesis Markup Language Processors</h3>
<p>An Extended Speech Synthesis Markup Language processor is a  Conforming Speech Synthesis Markup Language Processor that can parse and process <a href="#S2.2.2.2" shape="rect">Conforming Stand-Alone Extended Speech Synthesis Markup Language documents</a>.</p>
<p>A Conforming Extended Speech Synthesis Markup Language Processor  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> correctly understand and apply the semantics of the elements and attributes of the <a href="#S2.2.5">Extended profile</a> as described by this document.</p>
<p>When a Conforming Extended Speech Synthesis Markup Language Processor encounters elements or attributes other than those included in the <a href="#S2.2.5">Extended profile</a> it  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em>:</p>
<ul>
  <li>ignore the non-standard elements and/or attributes</li>
  
<li>or, process the non-standard elements and/or attributes</li>
  
<li>or, reject the document containing those elements and/or attributes</li>
</ul>

<h3 id="g2.2.5"><a id="S2.2.5" name="S2.2.5" shape="rect">2.2.5</a> Profiles </h3>
<p>An SSML Profile is a collection of SSML elements and attributes. There are only two profiles defined in this document:</p>
<dl>
  <dt>Core profile</dt>
  <dd>The Core profile consists of all elements and attributes defined in this specification except for the <code class="att">clipBegin</code>, <code class="att">clipEnd</code>, <code class="att">repeatCount</code>, <code class="att">repeatDur</code>, <code class="att">soundLevel</code>, and <code class="att">speed</code> attributes on the <a href="#edef_audio" class="eref" shape="rect">audio</a> element. </dd>
  <dt>Extended profile</dt>
  <dd>The Extended profile consists of all elements and attributes defined in this specification.</dd>
</dl>
  <h3 id="g2.2.6"><a id="S2.2.6" name="S2.2.6" shape="rect">2.2.6</a> Conforming User Agent</h3>

<p>A <em>Conforming User Agent</em> is a <a href="#S2.2.4" shape="rect">Conforming Speech Synthesis Markup Language Processor</a> that is capable of accepting an SSML document as input and producing a spoken output by using the information contained in the markup to render the document as intended by the author. A Conforming User Agent  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> support at least one natural language.</p>

<p>Since the output cannot be guaranteed to be a correct representation of all the markup contained in the input there is no conformance requirement regarding accuracy. A conformance test  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em>, however, require some examples of correct synthesis of a reference document to determine conformance.</p>

<h3 id="g30"><a id="S2.3" name="S2.3" shape="rect">2.3</a> Integration With Other Markup Languages</h3>

<h4 id="g31"><a id="S2.3.1" name="S2.3.1" shape="rect">2.3.1</a> SMIL</h4>

<p>The Synchronized Multimedia Integration Language (SMIL, pronounced "smile") [<a href="#ref-smil3" shape="rect">SMIL3</a>] enables simple authoring of interactive audiovisual presentations. SMIL is typically used for "rich media"/multimedia presentations which integrate streaming audio and video with images, text or any other media type. SMIL is an easy-to-learn HTML-like language, and many SMIL presentations are written using a simple text editor. See the SMIL/SSML integration examples in <a href="#AppE" shape="rect">Appendix E</a>.</p>

<h4 id="g32"><a id="S2.3.2" name="S2.3.2" shape="rect">2.3.2</a> ACSS</h4>

<p>Aural Cascading Style Sheets [<a href="#ref-css2" shape="rect">CSS2</a> §19] are employed to augment standard visual forms of documents (like HTML) with additional elements that assist in the synthesis of the text into audio. In comparison to SSML, ACSS-generated documents are capable of more complex specifications of the audio sequence, including the designation of 3D location of the audio source. Many of the other ACSS elements overlap SSML functionality, especially in the specification of voice type/quality. SSML may be viewed as a superset of ACSS capabilities, excepting spatial audio.</p>

<h4 id="g2.3.3t"><a id="S2.3.3" name="S2.3.3" shape="rect">2.3.3</a> VoiceXML</h4>

<p>The Voice Extensible Markup Language [<a href="#ref-vxml" shape="rect">VXML</a>] enables Web-based development and content-delivery for interactive voice response applications (see <a href="#term-voicebrowser" shape="rect"><em>voice browser</em></a> ). VoiceXML supports <a href="#term-synthesis" shape="rect">speech synthesis</a>, recording and playback of digitized audio, speech recognition, DTMF input, telephony call control, and form-driven mixed initiative dialogs. VoiceXML 2.0 extends SSML for the markup of text to be synthesized. For an example of the integration between VoiceXML and SSML see <a href="#AppF" shape="rect">Appendix F</a>.</p>

<h3 id="g33"><a id="S2.4" name="S2.4" shape="rect">2.4</a> Fetching SSML Documents</h3>

<p>The fetching and caching behavior of SSML documents is defined by the environment in which the <a href="#term-processor" shape="rect">synthesis processor</a> operates. In a VoiceXML interpreter context for example, the caching policy is determined by the VoiceXML interpreter.</p>

<h2><a id="S3" name="S3" shape="rect">3.</a> Elements and Attributes</h2>

<p>The following elements and attributes are defined in this specification.</p>

<ul class="toc">
<li class="tocline">3.1 <a href="#S3.1" shape="rect">Document Structure, Text Processing and Pronunciation</a> 
<ul class="toc">
<li class="tocline">3.1.1 <a href="#S3.1.1" shape="rect">"speak" Root Element</a>
<ul class="toc">
<li class="tocline">3.1.1.1 <a href="#S3.1.1.1" shape="rect">Trimming Attributes</a></li>
</ul>
</li>

<li class="tocline">3.1.2 <a href="#S3.1.2" shape="rect">Language: "xml:lang" Attribute</a></li>

<li class="tocline">3.1.3 <a href="#S3.1.3" shape="rect">Base URI: "xml:base" Attribute</a>
<ul class="toc">
<li class="tocline">3.1.3.1 <a href="#S3.1.3.1" shape="rect">Resolving Relative URIs</a></li>
</ul>
</li>

<li class="tocline">3.1.4 <a href="#S3.1.4" shape="rect">Identifier: "xml:id" Attribute</a></li>

<li class="tocline">3.1.5 <a href="#S3.1.5" shape="rect"> Lexicon Documents</a>
<ul class="toc">
<li class="tocline">3.1.5.1 <a href="#S3.1.5.1" shape="rect">"lexicon" Element</a></li>
<li class="tocline">3.1.5.2 <a href="#S3.1.5.2" shape="rect">"lookup" Element</a></li>
</ul>
</li>

<li class="tocline">3.1.6 <a href="#S3.1.6" shape="rect">"meta" Element</a></li>

<li class="tocline">3.1.7 <a href="#S3.1.7" shape="rect">"metadata" Element</a></li>

<li class="tocline">3.1.8 <a href="#S3.1.8" shape="rect">Text Structure</a>
<ul class="toc">
<li class="tocline">3.1.8.1 <a href="#S3.1.8.1" shape="rect">"p" and "s" Elements</a></li>
<li class="tocline">3.1.8.2 <a href="#S3.1.8.2" shape="rect">"token" and "w" Elements</a></li>
</ul>
</li>

<li class="tocline">3.1.9 <a href="#S3.1.9" shape="rect">"say-as" Element</a></li>

<li class="tocline">3.1.10 <a href="#S3.1.10" shape="rect">"phoneme" Element</a>
<ul class="toc">
<li class="tocline">3.1.10.1 <a href="#S3.1.10.1" shape="rect">Pronunciation Alphabet Registry</a></li>
</ul>
</li>

<li class="tocline">3.1.11 <a href="#S3.1.11" shape="rect">"sub" Element</a></li>
<li class="tocline">3.1.12 <a href="#S3.1.12" shape="rect">"lang" Element</a></li>
<li class="tocline">3.1.13 <a href="#S3.1.13" shape="rect">Language Speaking Failure: "onlangfailure" Attribute</a></li>
</ul>
</li>

<li class="tocline">3.2 <a href="#S3.2" shape="rect">Prosody and Style</a> 
<ul class="toc">
<li class="tocline">3.2.1 <a href="#S3.2.1" shape="rect">"voice" Element</a></li>

<li class="tocline">3.2.2 <a href="#S3.2.2" shape="rect">"emphasis" Element</a></li>

<li class="tocline">3.2.3 <a href="#S3.2.3" shape="rect">"break" Element</a></li>

<li class="tocline">3.2.4 <a href="#S3.2.4" shape="rect">"prosody" Element</a></li>
</ul>
</li>

<li class="tocline">3.3 <a href="#S3.3" shape="rect">Other Elements</a> 
<ul class="toc">
<li class="tocline">3.3.1 <a href="#S3.3.1" shape="rect">"audio" Element</a>
<ul class="toc">
<li class="tocline">3.3.1.1 <a href="#S3.3.1.1" shape="rect">Trimming Attributes</a></li>
<li class="tocline">3.3.1.2 <a href="#S3.3.1.2" shape="rect">"soundLevel" Attribute</a></li>
<li class="tocline">3.3.1.3 <a href="#S3.3.1.3" shape="rect">"speed" Attribute</a></li>
</ul>
</li>

<li class="tocline">3.3.2 <a href="#S3.3.2" shape="rect">"mark" Element</a></li>

<li class="tocline">3.3.3 <a href="#S3.3.3" shape="rect">"desc" Element</a></li>
</ul>
</li>
</ul>

<h2 id="g3.1"><a id="S3.1" name="S3.1" shape="rect">3.1</a> Document Structure, Text Processing and Pronunciation</h2>

<h3 id="g3.1.1"><a id="S3.1.1" name="S3.1.1" shape="rect">3.1.1</a> <a name="edef_speak" id="edef_speak" class="edef" shape="rect">speak</a> Root Element</h3>

<p>The Speech Synthesis Markup Language is an XML application. The root element is <a href="#edef_speak" class="eref" shape="rect">speak</a>.</p>

<p> <a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a> is a <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em> attribute specifying the language of the root document.</p>

<p><a href="#adef_xmlbase" class="aref" shape="rect"><code class="att">xml:base</code></a> is an  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute specifying the Base <a href="#term-uri" shape="rect">URI</a> of the root document.</p>

<p> <a href="#adef_onlangfailure" class="aref" shape="rect"><code class="att">onlangfailure</code></a> is an <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute specifying the desired behavior upon language speaking failure.</p>

<p>The <code class="att">version</code> attribute is a <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em> attribute that indicates the version of the specification to be used for the document and <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> have the value "1.1".</p>

<p>The trimming attributes are specified in a subsection, below. </p>

<p>Before the <a href="#edef_speak" class="eref" shape="rect">speak</a> element is executed, the <a href="#term-processor" shape="rect">synthesis processor</a> <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> select a default voice. Note that a language speaking failure (see <a href="#S3.1.13" shape="rect"> Section 3.1.13</a>) will occur as soon as the first text is encountered if the language of the text is one that the default voice cannot speak. This assumes that the voice has not been changed before encountering the text, of course.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
  ... the body ...
&lt;/speak&gt;
</pre>

<p>The <a href="#edef_speak" class="eref" shape="rect">speak</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref" shape="rect">audio</a>, <a href="#edef_break" class="eref" shape="rect">break</a>, <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a>, <a href="#edef_lang" class="eref" shape="rect">lang</a>, <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a>, <a href="#edef_lookup" class="eref" shape="rect">lookup</a>, <a href="#edef_mark" class="eref" shape="rect">mark</a>, <a href="#edef_meta" class="eref" shape="rect">meta</a>, <a href="#edef_metadata" class="eref" shape="rect">metadata</a>, <a href="#edef_paragraph" class="eref" shape="rect">p</a>, <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a>, <a href="#edef_prosody" class="eref" shape="rect">prosody</a>, <a href="#edef_say-as" class="eref" shape="rect">say-as</a>, <a href="#edef_sub" class="eref" shape="rect">sub</a>, <a href="#edef_sentence" class="eref" shape="rect">s</a>, <a href="#edef_token" class="eref" shape="rect">token</a>, <a href="#edef_voice" class="eref" shape="rect">voice</a>, <a href="#edef_word" class="eref" shape="rect">w</a>.</p>

<p>&nbsp;</p>

<h3 id="g3.1.1.1"><a id="S3.1.1.1" name="S3.1.1.1" shape="rect">3.1.1.1</a> Trimming Attributes</h3>
<p>Trimming attributes define the span of the document to be
rendered. Both the start and the end of the span within the <a href="#edef_speak" class="eref" shape="rect">speak</a>  content can be specified using  marks. </p>
<p>The following  trimming attributes are defined for <a href="#edef_speak" class="eref" shape="rect">speak</a>: </p>
<table border="1">
  <tbody>
    <tr>
      <th>Name</th>
      <th>Required</th>
      <th>Type</th>
      <th>Default Value</th>
      <th>Description</th>
    </tr>
    <tr>
      <td><code class="att">startmark</code></td>
      <td>false</td>
      <td>type <code><a href="http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#token" shape="rect">xsd:token</a></code> [<a href="#ref-schema2" shape="rect">SCHEMA2</a> §3.3.2]</td>
      <td>none</td>
      <td>The mark used to determined when rendering starts.</td>
    </tr>
    <tr>
      <td><code class="att">endmark</code></td>
      <td>false</td>
      <td>type <code><a href="http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#token" shape="rect">xsd:token</a></code> [<a href="#ref-schema2" shape="rect">SCHEMA2</a> §3.3.2]</td>
      <td>none</td>
      <td>The mark used to determine when rendering ends. </td>
    </tr>
  </tbody>
</table>
<p>The <code class="att">startmark</code> and <code class="att">endmark</code> attributes specify a name that references a marker as assigned by the <code class="att">name</code> attribute of the <a href="#edef_mark" class="eref" shape="rect">mark</a> element. Only markers defined once in the document, i.e. that are unique, are permitted as the value of either <code class="att">startmark</code> or <code class="att">endmark</code>. The span of the document rendered is determined as follows: </p>
<ul>
  <li>If the <code class="att">startmark</code> is specified, then rendering starts at the <code class="att">startmark</code>. If <code class="att">startmark</code> is not specified,
    rendering begins at  the beginning of the
  document. </li>
  <li>If the <code class="att">endmark</code> is specified, then rendering ends at the <code class="att">endmark</code>. If the <code class="att">endmark</code> is not specified, rendering ends
  at the document end. </li>
  <li>If the <code class="att">startmark</code> is after the <code class="att">endmark</code>, then no audio is generated. </li>
</ul>

<p>It is an <a href="#term-error">error</a> if the value given for either <code class="att">startmark</code> or <code class="att">endmark</code> is not a valid mark in the document.</p>

<h4 id="g3.1.1.1.e">Examples</h4>
<p>If no  trimming attributes are specified, then the complete document
is rendered: </p>
<pre class="example" xml:space="preserve">
&lt;speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
   &lt;audio src="first.wav"/&gt;
   &lt;mark name="mark1"/&gt;
   &lt;audio src="middle.wav"/&gt;
   &lt;mark name="mark2"/&gt;
   &lt;audio src="last.wav"/&gt;
&lt;/speak&gt;
</pre>

<p>here "first.wav", "middle.wav" and "last.wav" are rendered, where the
  mark "mark2" is the last mark rendered.</p>
<p>The <code class="att">startmark</code> can be used to specify that rendering begins from a
specific mark: </p>
<pre class="example" xml:space="preserve">
&lt;speak startmark="mark1" version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
   &lt;audio src="first.wav"/&gt;
   &lt;mark name="mark1"/&gt;
   &lt;audio src="middle.wav"/&gt;
   &lt;mark name="mark2"/&gt;
   &lt;audio src="last.wav"/&gt;
&lt;/speak&gt;
</pre>

<p>"middle.wav" and "last.wav" are rendered, but not "first.wav" since it
occurs before the <code class="att">startmark</code> "mark1".</p>

<p>The end of rendering can be specified using the <code class="att">endmark</code>: </p>
<pre class="example" xml:space="preserve">
&lt;speak endmark="mark2" version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
   &lt;audio src="first.wav"/&gt;
   &lt;mark name="mark1"/&gt;
   &lt;audio src="middle.wav"/&gt;
   &lt;mark name="mark2"/&gt;
   &lt;audio src="last.wav"/&gt;
&lt;/speak&gt;
</pre>

<p>where "first.wav" and "middle.wav" are completely rendered but  none of "last.wav" is rendered.</p>
  
<p>Finally, these  trimming attributes can be used to control both the
start and end of rendering: </p>
<pre class="example" xml:space="preserve">
&lt;speak startmark="mark1" endmark="mark2"
     version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
   &lt;audio src="first.wav"/&gt;
   &lt;mark name="mark1"/&gt;
   &lt;audio src="middle.wav"/&gt;
   &lt;mark name="mark2"/&gt;
   &lt;audio src="last.wav"/&gt;
&lt;/speak&gt;
</pre>

<p>where  only "middle.wav" is played. </p>

<p>&nbsp;</p>
<h3 id="g3.1.2"><a id="S3.1.2" name="S3.1.2" shape="rect">3.1.2</a> Language: <a class="adef" id="adef_xmllang" name="adef_xmllang" shape="rect"><code>xml:lang</code></a> Attribute</h3>

<p>The <code class="att">xml:lang</code> attribute, as defined by XML [<a href="#ref-xml10" shape="rect">XML 1.0</a> or <a href="#ref-xml11" shape="rect">XML 1.1</a>, as appropriate, §2.12],   <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> be used in SSML to indicate the natural language of the written content of the element on which it occurs.  BCP47 [<a href="#ref-bcp47" shape="rect">BCP47</a>] can help in understanding how to use this attribute.</p>

<p>Language information is inherited down the document hierarchy, i.e. it needs to be given only once if the whole document is in one language, and language information nests, i.e. inner attributes overwrite outer attributes.</p>

<p><a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a> is a defined attribute for the  <a href="#edef_speak" class="eref" shape="rect">speak</a>, <a href="#edef_lang" class="eref" shape="rect">lang</a>, <a href="#edef_desc" class="eref" shape="rect">desc</a>, <a href="#edef_paragraph" class="eref" shape="rect">p</a>, <a href="#edef_sentence" class="eref" shape="rect">s</a>, <a href="#edef_token" class="eref" shape="rect">token</a>, and <a href="#edef_word" class="eref" shape="rect">w</a> elements.</p>

<p><a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a> is permitted on <a href="#edef_paragraph" class="eref" shape="rect">p</a>, <a href="#edef_sentence" class="eref" shape="rect">s</a>, <a href="#edef_token" class="eref" shape="rect">token</a>, and <a href="#edef_word" class="eref" shape="rect">w</a> only because it is common to change the language at those levels.</p>



<p> The <a href="#term-processor" shape="rect">synthesis processor</a> <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> use the value of the <a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a> attribute to assist it in determining the best way of rendering the content of the element on which it occurs. When the <a href="#term-processor" shape="rect">synthesis processor</a> comes across text it does not know how to speak, it is the responsibility of the processor to decide what to do (see the <a href="#adef_onlangfailure" class="aref" shape="rect"><code class="att">onlangfailure</code></a> attribute). One of the sources of information it can draw upon to make this decision is the value of the <a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a> attribute. </p>

<p>The <a href="#term-processor" shape="rect">synthesis processor</a> may also use the value of the <a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a> attribute to help it to determine the language of the content, which may of course affect how the voice will speak the content. For example, "<code>The French word for cat is &lt;lang xml:lang="fr"&gt;chat&lt;/lang&gt;, not chat.</code>" If the document author requires a new voice that is better adapted to the new language, then the <a href="#term-processor" shape="rect">synthesis processor</a> can be explicitly requested to select a new voice by using the <a href="#edef_voice" class="eref" shape="rect">voice</a> element. Further information about voice selection appears in <a href="#S3.2.1" shape="rect">Section 3.2.1</a>.</p>

<p>The <a href="#text_normalization" shape="rect">text normalization</a> processing step may be affected by the enclosing language. This is true for both markup support by the <a href="#edef_say-as" class="eref" shape="rect">say-as</a> element and non-markup behavior. In the following example the same text "2/1/2000" may be read as "February first two thousand" in the first sentence, following American English pronunciation rules, but as "the second of January two thousand" in the second one, which follows Italian preprocessing rules.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
  &lt;s&gt;Today, 2/1/2000.&lt;/s&gt;
  &lt;!-- Today, February first two thousand --&gt;
  &lt;s xml:lang="it"&gt;Un mese fà, 2/1/2000.&lt;/s&gt;
  &lt;!-- Un mese fà, il due gennaio duemila --&gt;
  &lt;!-- One month ago, the second of January two thousand --&gt;
&lt;/speak&gt;
</pre>

<h3 id="s10"><a id="S3.1.3" name="S3.1.3" shape="rect">3.1.3</a> Base URI: <a href="#adef_xmlbase" class="aref" shape="rect"><code class="att">xml:base</code></a> Attribute</h3>

<p>Relative <a href="#term-uri" shape="rect">URIs</a> are resolved according to a <em>base URI</em>, which may come from a variety of sources. The base URI declaration allows authors to specify a document's base URI explicitly. See <a href="#S3.1.3.1" shape="rect">Section 3.1.3.1</a> for details on the resolution of relative URIs.</p>

<p>The <a href="#xmlbase" shape="rect">base URI declaration</a> is permitted but  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em>. The two elements affected by it are</p>

<blockquote>
<dl>
<dt><a href="#edef_audio" class="eref" shape="rect">audio</a></dt>

<dd>The  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> <code class="att">src</code> attribute can specify a relative URI.</dd>

<dt><a href="#edef_lexicon" class="eref" shape="rect">lexicon</a></dt>

<dd>The <code class="att">uri</code> attribute can specify a relative URI.</dd>
</dl>
</blockquote>

<h4 id="id-S4.9-abnf"><a name="xmlbase" id="xmlbase" shape="rect" />The <a name="adef_xmlbase" id="adef_xmlbase" class="adef" shape="rect">xml:base</a> attribute</h4>

<p>The base <a href="#term-uri" shape="rect">URI</a> declaration follows [<a href="#ref-xml-base" shape="rect">XML-BASE</a>] and is indicated by an <a href="#adef_xmlbase" class="aref" shape="rect"><code class="att">xml:base</code></a> attribute on the root <a href="#edef_speak" class="eref" shape="rect">speak</a> element.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xml:lang="en-US"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:base="http://www.example.com/base-file-path"&gt;
</pre>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xml:lang="en-US"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:base="http://www.example.com/another-base-file-path"&gt;
</pre>

<h4 id="s25"><a id="S3.1.3.1" name="S3.1.3.1" shape="rect">3.1.3.1</a> Resolving Relative URIs</h4>

<p>User agents  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> calculate the base <a href="#term-uri" shape="rect">URI</a> for resolving relative URIs according to [<a href="#ref-rfc3986" shape="rect">RFC3986</a>]. The following describes how RFC3986 applies to synthesis documents.</p>

<p>User agents  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> calculate the base URI according to the following precedences (highest priority to lowest):</p>

<ol>
<li>The base URI is set by the <a href="#adef_xmlbase" class="aref" shape="rect"><code class="att">xml:base</code></a> attribute on the <a href="#edef_speak" class="eref" shape="rect">speak</a> element (see <a href="#S3.1.3" shape="rect">Section 3.1.3</a>).</li>

<li>The base URI is given by metadata discovered during a protocol interaction, such as an HTTP header (see [<a href="#ref-rfc2616" shape="rect">RFC2616</a>]).</li>

<li>By default, the base URI is that of the current document. Not all synthesis documents have a base URI (e.g., a valid synthesis document may appear in an email and may not be designated by a URI). It is an <a href="#term-error" shape="rect">error</a> if such documents contain relative URIs.</li>
</ol>

<h3 id="g314"><a id="S3.1.4" name="S3.1.4" shape="rect">3.1.4</a> Identifier: <a name="adef_xmlid" id="adef_xmlid" class="adef" shape="rect"><code class="att">xml:id</code></a> Attribute</h3>

<p>The <a href="#adef_xmlid" class="aref" shape="rect"><code class="att">xml:id</code></a> attribute [<a href="#ref-xml-id" shape="rect">XML-ID</a>] <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> be used in SSML to give an element an identifier that is unique to the document, allowing the element to be referenced from other documents.</p>

<p><a href="#adef_xmlid" class="aref" shape="rect"><code class="att">xml:id</code></a> is a defined attribute for the <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a>, <a href="#edef_paragraph" class="eref" shape="rect">p</a>, <a href="#edef_sentence" class="eref" shape="rect">s</a>, <a href="#edef_token" class="eref" shape="rect">token</a>, and <a href="#edef_word" class="eref" shape="rect">w</a> elements. </p>
<h3 id="s12"><a id="S3.1.5" name="S3.1.5" shape="rect">3.1.5</a>  Lexicon Documents: <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> and <a href="#edef_lookup" class="eref" shape="rect">lookup</a> Elements</h3>

<p>An SSML document  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> reference one or more  lexicon documents. A lexicon document is located by a <a href="#term-uri" shape="rect">URI</a> with an  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> <a href="#term-media-type" shape="rect">media type</a> and is assigned a name that is unique in the SSML document. </p>

<h4 id="g3151"><a id="S3.1.5.1" name="S3.1.5.1" shape="rect">3.1.5.1</a> <a name="edef_lexicon" id="edef_lexicon" class="edef" shape="rect">lexicon</a> Element</h4>

<p>Any number of <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> elements  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> occur as immediate children of the <a href="#edef_speak" class="eref" shape="rect">speak</a> element.</p>

<p> The <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> element  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> have a <code class="att">uri</code> attribute specifying a <a href="#term-uri" shape="rect">URI</a> that identifies the location of the  lexicon document.</p>

<p>The <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> element  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> have an <a href="#adef_xmlid" class="aref" shape="rect"><code class="att">xml:id</code></a> attribute  that assigns a name to the lexicon document.  The name  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> be unique to the current SSML document. The scope of this name is the current SSML document.</p>

<p>The <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> element  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> have a <code class="att">type</code> attribute that specifies the <a href="#term-media-type" shape="rect">media type</a> of the  lexicon document.  The default value of the <code class="att">type</code> attribute is <i><code>application/pls+xml</code></i>, the media type associated with Pronunciation Lexicon Specification [<a href="#ref-pls" shape="rect">PLS</a>] documents as defined in [<a href="#ref-rfc4267" shape="rect">RFC4267</a>].</p>
<p>The <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> element <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> have a <code class="att">fetchtimeout</code> attribute that specifies the timeout for fetches. The value is a <a href="#def_time_designation">Time Designation</a>. The default value is processor-specific.</p>
<p>The <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> element <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> have a <code class="att">maxage</code> attribute that indicates that the document is willing to use content whose age is no greater than the specified time  (cf. 'max-age' in HTTP 1.1 <a href="#ref-rfc2616">[RFC2616</a>]). The value is an <a href="http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#nonNegativeInteger" shape="rect"><strong>xsd:nonNegativeInteger</strong></a> [<a href="#ref-schema2" shape="rect">SCHEMA2</a> §3.3.20]. The document is not willing to use stale content, unless <code class="att">maxstale</code> is also provided.</p>
<p>The <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> element <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> have a <code class="att">maxstale</code> attribute that indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [<a href="#ref-rfc2616">RFC2616</a>]). The value is an <a href="http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#nonNegativeInteger" shape="rect"><strong>xsd:nonNegativeInteger</strong></a> [<a href="#ref-schema2" shape="rect">SCHEMA2</a> §3.3.20]. If <code class="att">maxstale</code> is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified amount of time.</p>
<p>The <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> element is an empty element.</p>

<p>If an error occurs in fetching or parsing a lexicon document,  the <a href="#term-processor" shape="rect">synthesis processor</a> <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> notify the hosting environment that such an error has occurred. The processor <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> notify the hosting environment immediately with an asynchronous event, or the processor <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> make the error notification through its logging system. The processor <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> include information about the error where possible; for example, if the lexicon couldn't be fetched due to an http 404 error, that error code could be included with the notification. After notification, the processor <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> continue processing as if it had loaded an empty valid lexicon.</p>
<h4 id="lexicon_type">Details of the type attribute</h4>

<p><i>Note: the description and table that follow use an imaginary vendor-specific lexicon type of <code>x-vnd.example.lexicon</code>. This is intended to represent whatever format is returned/available, as appropriate.</i></p>

<p>A lexicon resource indicated by a <a href="#term-uri" shape="rect">URI</a> reference may be available in one or more <a href="#term-media-type" shape="rect">media types</a>. The SSML author can specify the preferred media type via the <code class="att">type</code> attribute. When the content represented by a URI is available in many data formats, a <a href="#term-processor" shape="rect">synthesis processor</a> <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> use the preferred type to influence which of the multiple formats is used. For instance, on a server implementing HTTP content negotiation, the processor may use the type to order the preferences in the negotiation.</p>

<p>Upon delivery, the resource indicated by a URI reference may be considered in terms of two types. The <i>declared media type</i> is the alleged value for the resource and the actual media type is the true format of its content. The <i>actual type</i> should be the same as the declared type, but this is not always the case (e.g. a misconfigured HTTP server might return <code>text/plain</code> for a document following the vendor-specific <code>x-vnd.example.lexicon</code> format). A specific URI scheme may require that the resource owner always, sometimes, or never return a media type. Whenever a type is returned, it is treated as authoritative. The declared media type is determined by the value returned by the resource owner or, if none is returned, by the preferred media type given in the SSML document.</p>

<p>Three special cases may arise. The declared type may not be supported by the processor; this is an <a href="#term-error" shape="rect">error</a>. The declared type may be supported but the actual type may not match; this is also an <a href="#term-error" shape="rect">error</a>. Finally, no media type may be declared; the behavior depends on the specific URI scheme and the capabilities of the <a href="#term-processor" shape="rect">synthesis processor</a>. For instance, HTTP 1.1 allows document introspection (see [<a href="#ref-rfc2616" shape="rect">RFC2616</a> §7.2.1]), the data scheme falls back to a default media type, and local file access defines no guidelines. The following table provides some informative examples:</p>

<table width="100%" border="1" cellpadding="3" summary="This table presents some informative examples of possible media type interpretations when the source document is of type x-vnd.example.lexicon.">
<caption>Media type examples</caption>

<tr>
<td width="20%" rowspan="1" colspan="1" />
<th colspan="2" scope="col" rowspan="1">
<div align="center"><b>HTTP 1.1 request</b></div></th>
<th colspan="2" scope="col" rowspan="1">
<div align="center"><b>Local file access</b></div></th>
</tr>

<tr>
<th width="20%" scope="row" rowspan="1" colspan="1">Media type returned by the resource owner</th>
<td width="20%" rowspan="1" colspan="1">text/plain</td>
<td width="20%" rowspan="1" colspan="1">x-vnd.example.lexicon</td>
<td width="20%" rowspan="1" colspan="1">
&lt;none&gt;</td>
<td rowspan="1" colspan="1">
&lt;none&gt;</td>
</tr>

<tr>
<th width="20%" scope="row" rowspan="1" colspan="1">Preferred media type from the SSML document</th>
    <td colspan="2" rowspan="1">Not applicable; the returned type is authoritative.</td>
<td width="20%" rowspan="1" colspan="1">x-vnd.example.lexicon</td>
<td rowspan="1" colspan="1"> application/pls+xml </td>
</tr>

<tr>
<th width="20%" scope="row" rowspan="1" colspan="1">Declared media type</th>
<td width="20%" rowspan="1" colspan="1">text/plain</td>
<td width="20%" rowspan="1" colspan="1">x-vnd.example.lexicon</td>
<td width="20%" rowspan="1" colspan="1">x-vnd.example.lexicon</td>
<td rowspan="1" colspan="1">
&lt;none&gt;</td>
</tr>

<tr>
<th width="20%" scope="row" rowspan="1" colspan="1">Behavior for an actual media type of x-vnd.example.lexicon</th>
<td width="20%" rowspan="1" colspan="1">This  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> be processed as text/plain. This will generate an <a href="#term-error" shape="rect">error</a> if text/plain is not supported or if the document does not follow the expected format.</td>
    <td colspan="2" rowspan="1">The declared and actual types match; success if x-vnd.example.lexicon 
      is supported by the synthesis processor; otherwise an <a href="#term-error" shape="rect">error</a>.</td>
<td rowspan="1" colspan="1">Scheme specific; the synthesis processor might introspect the document to determine the type.</td>
</tr>
</table>

<h4 id="g3152"><a id="S3.1.5.2" name="S3.1.5.2" shape="rect">3.1.5.2</a> <a name="edef_lookup" id="edef_lookup" class="edef" shape="rect">lookup</a> Element</h4>

<p>The <a href="#edef_lexicon" class="eref" shape="rect">lookup</a> element  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> have a <code class="att">ref</code> attribute. The <code class="att">ref</code> attribute specifies a name that references a lexicon document as assigned by the <a href="#adef_xmlid" class="aref" shape="rect"><code class="att">xml:id</code></a> attribute of the  <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> element.</p>

<p>The referenced lexicon document may contain information (e.g., pronunciation) for tokens that can appear in a text to be rendered. For PLS lexicon documents, the information contained within the PLS document  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> be used by the <a href="#term-processor" shape="rect">synthesis processor</a> when rendering tokens that appear within the context of a lookup element.  For non-PLS lexicon documents,  the information contained within the lexicon document <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> be used by the <a href="#term-processor" shape="rect">synthesis processor</a> when rendering tokens that appear within the  content of a lookup element, although the processor  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> choose not to use the  information if it is deemed incompatible with the content of the SSML  document. For example, a vendor-specific lexicon may be used only for particular values of the <code class="att">interpret-as</code> attribute of the <a href="#edef_say-as" class="eref" shape="rect">say-as</a> element, or for a particular set of voices. Vendors  <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> document the expected behavior of the <a href="#term-processor" shape="rect">synthesis processor</a> when SSML content refers to a non-PLS lexicon.</p>
<p>A <a href="#edef_lookup" class="eref" shape="rect">lookup</a> element  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> contain other <a href="#edef_lookup" class="eref" shape="rect">lookup</a> elements. When a <a href="#edef_lookup" class="eref" shape="rect">lookup</a> element contains other <a href="#edef_lookup" class="eref" shape="rect">lookup</a> elements, the child <a href="#edef_lookup" class="eref" shape="rect">lookup</a> elements have higher precedence.   Precedence means that a token is first looked up in the lexicon with highest precedence. Only if the token is not found in that lexicon is it then looked up in the lexicon with the next lower precedence, and so on until the token is successfully found or until all lexicons have been used for lookup. It is assumed that the <a href="#term-processor" shape="rect">synthesis processor</a> already has one or more built-in system lexicons which will be treated as having a lower precedence than those specified using the <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> and <a href="#edef_lookup" class="eref" shape="rect">lookup</a> elements. Note that if a token is not within the scope of at least  one <a href="#edef_lookup" class="eref" shape="rect">lookup</a> element, then the token can only be looked up in the built-in system lexicons.</p>
<p>The <a href="#edef_lookup" class="eref" shape="rect">lookup</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref" shape="rect">audio</a>, <a href="#edef_break" class="eref" shape="rect">break</a>, <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a>, <a href="#edef_lang" class="eref" shape="rect">lang</a>, <a href="#edef_lookup" class="eref" shape="rect">lookup</a>, <a href="#edef_mark" class="eref" shape="rect">mark</a>, <a href="#edef_paragraph" class="eref" shape="rect">p</a>, <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a>, <a href="#edef_prosody" class="eref" shape="rect">prosody</a>, <a href="#edef_say-as" class="eref" shape="rect">say-as</a>, <a href="#edef_sub" class="eref" shape="rect">sub</a>, <a href="#edef_sentence" class="eref" shape="rect">s</a>, <a href="#edef_token" class="eref" shape="rect">token</a>, <a href="#edef_voice" class="eref" shape="rect">voice</a>, <a href="#edef_word" class="eref" shape="rect">w</a>.</p>

<p>&nbsp;</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;

  &lt;lexicon uri="http://www.example.com/lexicon.pls"
           xml:id="pls"/&gt;
  &lt;lexicon uri="http://www.example.com/strange-words.file"
           xml:id="sw"
           type="media-type"/&gt;
  &lt;lookup ref="pls"&gt;
    tokens here are looked up in lexicon.pls
    &lt;lookup ref="sw"&gt;
      tokens here are looked up first in strange-words.file and then, if not found, in lexicon.pls
    &lt;/lookup&gt;
    tokens here are looked up in lexicon.pls
  &lt;/lookup&gt;
  tokens here are not looked up in lexicon documents
  ...
&lt;/speak&gt;
  </pre>
<h3 id="g316"><a id="S3.1.6" name="S3.1.6" shape="rect">3.1.6</a> <a class="edef" id="edef_meta" name="edef_meta" shape="rect">meta</a> Element</h3>

<p>The <a href="#edef_metadata" class="eref" shape="rect">metadata</a> and <a href="#edef_meta" class="eref" shape="rect">meta</a> elements are containers in which information about the document can be placed. The <a href="#edef_metadata" class="eref" shape="rect">metadata</a> element provides more general and powerful treatment of metadata information than <a href="#edef_meta" class="eref" shape="rect">meta</a> by using a metadata schema.</p>

<p>A <a href="#edef_meta" class="eref" shape="rect">meta</a> declaration associates a string to a declared meta property or declares "http-equiv" content. Either a <code class="att">name</code> or <code class="att">http-equiv</code> attribute is  <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em>. It is an <a href="#term-error" shape="rect">error</a> to provide both <code class="att">name</code> and <code class="att">http-equiv</code> attributes. A <code class="att">content</code> attribute is  <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em>. The <code class="att">seeAlso</code> property is the only defined <a href="#edef_meta" class="eref" shape="rect">meta</a> property name. It is used to specify a resource that might provide additional metadata information about the content. This property is modeled on the <a href="http://www.w3.org/TR/2004/REC-rdf-schema-20040210/#ch_seealso" shape="rect"><code class="att">seeAlso</code></a> property of Resource Description Framework (RDF) Schema Specification 1.0 [<a href="#ref-rdf-schema" shape="rect">RDF-SCHEMA</a> §5.4.1]. The <code class="att">http-equiv</code> attribute has a special significance when documents are retrieved via HTTP. Although the preferred method of providing HTTP header information is by using HTTP header fields, the "http-equiv" content  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> be used in situations where the SSML document author is unable to configure HTTP header fields associated with their document on the origin server, for example, cache control information. Note that HTTP servers and caches are not required to introspect the contents of <a href="#edef_meta" class="eref" shape="rect">meta</a> in SSML documents and thereby override the header values they would send otherwise.</p>

<p>Informative: This is an example of how <a href="#edef_meta" class="eref" shape="rect">meta</a> elements can be included in an SSML document to specify a resource that provides additional metadata information and also indicate that the document must not be cached.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1"
      xmlns="http://www.w3.org/2001/10/synthesis"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
      xml:lang="en-US"&gt;

  &lt;meta name="seeAlso" content="http://example.com/my-ssml-metadata.xml"/&gt;
  &lt;meta http-equiv="Cache-Control" content="no-cache"/&gt;
&lt;/speak&gt;
</pre>

<p>The <a href="#edef_meta" class="eref" shape="rect">meta</a> element is an empty element.</p>

<h3 id="g317"><a id="S3.1.7" name="S3.1.7" shape="rect">3.1.7</a> <a name="edef_metadata" id="edef_metadata" class="edef" shape="rect">metadata</a> Element</h3>

<p>The <a href="#edef_metadata" class="eref" shape="rect">metadata</a> element is a container in which information about the document can be placed using a metadata schema. Although any metadata schema can be used with <a href="#edef_metadata" class="eref" shape="rect">metadata</a>, it is  <em title="RECOMMENDED in RFC 2119 context" class="RFC2119">RECOMMENDED</em> that the XML syntax of the Resource Description Framework (RDF) [<a href="#ref-rdf-xml" shape="rect">RDF-XMLSYNTAX</a>] be used in conjunction with the general metadata properties defined in the Dublin Core Metadata Initiative [<a href="#ref-dc" shape="rect">DC</a>].</p>

<p>The Resource Description Format [<a href="#ref-rdf" shape="rect">RDF</a>] is a declarative language and provides a standard way for using XML to represent metadata in the form of statements about properties and relationships of items on the Web. Content creators should refer to W3C metadata Recommendations [<a href="#ref-rdf-xml" shape="rect">RDF-XMLSYNTAX</a>] and [<a href="#ref-rdf-schema" shape="rect">RDF-SCHEMA</a>] when deciding which metadata RDF schema to use in their documents. Content creators should also refer to the Dublin Core Metadata Initiative [<a href="#ref-dc" shape="rect">DC</a>], which is a set of generally applicable core metadata properties (e.g., Title, Creator, Subject, Description, Rights, etc.).</p>

<p>Document properties declared with the <a href="#edef_metadata" class="eref" shape="rect">metadata</a> element can use any metadata schema.</p>

<p>Informative: This is an example of how <a href="#edef_metadata" class="eref" shape="rect">metadata</a> can be included in an SSML document using the Dublin Core version 1.0 RDF schema [<a href="#ref-dc" shape="rect">DC</a>] describing general document information such as title, description, date, and so on:</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
    
  &lt;metadata&gt;
   &lt;rdf:RDF
       xmlns:rdf = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"
       xmlns:rdfs = "http://www.w3.org/2000/01/rdf-schema#"
       xmlns:dc = "http://purl.org/dc/elements/1.1/"&gt;

   &lt;!-- Metadata about the synthesis document --&gt;
   &lt;rdf:Description rdf:about="http://www.example.com/meta.ssml"
       dc:title="Hamlet-like Soliloquy"
       dc:description="Aldine's Soliloquy in the style of Hamlet"
       dc:publisher="W3C"
       dc:language="en-US"
       dc:date="2002-11-29"
       dc:rights="Copyright 2002 Aldine Turnbet"
       dc:format="application/ssml+xml" &gt;                
       &lt;dc:creator&gt;
          &lt;rdf:Seq ID="CreatorsAlphabeticalBySurname"&gt;
             &lt;rdf:li&gt;William Shakespeare&lt;/rdf:li&gt;
             &lt;rdf:li&gt;Aldine Turnbet&lt;/rdf:li&gt;
          &lt;/rdf:Seq&gt;
       &lt;/dc:creator&gt;
   &lt;/rdf:Description&gt;
  &lt;/rdf:RDF&gt;
 &lt;/metadata&gt;

&lt;/speak&gt;
</pre>

<p>The <a href="#edef_metadata" class="eref" shape="rect">metadata</a> element can have arbitrary content, although none of the content will be rendered by the <a href="#term-processor" shape="rect">synthesis processor</a>.</p>

<h3 id="g318"><a id="S3.1.8" name="S3.1.8" shape="rect">3.1.8</a> Text Structure</h3>

<h3 id="g3181"><a id="S3.1.8.1" name="S3.1.8.1" shape="rect">3.1.8.1</a> <a class="edef" id="edef_paragraph" name="edef_paragraph" shape="rect">p</a> and <a class="edef" id="edef_sentence" name="edef_sentence" shape="rect">s</a> Elements</h3>

<p>A <a href="#edef_paragraph" class="eref" shape="rect">p</a> element represents a paragraph. An <a href="#edef_sentence" class="eref" shape="rect">s</a> element represents a sentence.</p>

<p><a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a>, <a href="#adef_xmlid" class="aref" shape="rect"><code class="att">xml:id</code></a>, and <a href="#adef_onlangfailure" class="aref" shape="rect"><code class="att">onlangfailure</code></a> are defined attributes on the <a href="#edef_paragraph" class="eref" shape="rect">p</a> and <a href="#edef_sentence" class="eref" shape="rect">s</a> elements.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
  &lt;p&gt;
    &lt;s&gt;This is the first sentence of the paragraph.&lt;/s&gt;
    &lt;s&gt;Here's another sentence.&lt;/s&gt;
  &lt;/p&gt;
&lt;/speak&gt;
</pre>

<p>The use of <a href="#edef_paragraph" class="eref" shape="rect">p</a> and <a href="#edef_sentence" class="eref" shape="rect">s</a> elements is  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em>. Where text occurs without an enclosing <a href="#edef_paragraph" class="eref" shape="rect">p</a> or <a href="#edef_sentence" class="eref" shape="rect">s</a> element the <a href="#term-processor" shape="rect">synthesis processor</a> <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> attempt to determine the structure using language-specific knowledge of the format of plain text.</p>

<p>The <a href="#edef_paragraph" class="eref" shape="rect">p</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref" shape="rect">audio</a>, <a href="#edef_mark" class="eref" shape="rect">break</a>, <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a>, <a href="#edef_lang" class="eref" shape="rect">lang</a>, <a href="#edef_lookup" class="eref" shape="rect">lookup</a>, <a href="#edef_mark" class="eref" shape="rect">mark</a>, <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a>, <a href="#edef_prosody" class="eref" shape="rect">prosody</a>, <a href="#edef_say-as" class="eref" shape="rect">say-as</a>, <a href="#edef_sub" class="eref" shape="rect">sub</a>, <a href="#edef_sentence" class="eref" shape="rect">s</a>, <a href="#edef_token" class="eref" shape="rect">token</a>, <a href="#edef_voice" class="eref" shape="rect">voice</a>, <a href="#edef_word" class="eref" shape="rect">w</a>.</p>

<p>The <a href="#edef_sentence" class="eref" shape="rect">s</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref" shape="rect">audio</a>, <a href="#edef_mark" class="eref" shape="rect">break</a>, <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a>, <a href="#edef_lang" class="eref" shape="rect">lang</a>, <a href="#edef_lookup" class="eref" shape="rect">lookup</a>, <a href="#edef_mark" class="eref" shape="rect">mark</a>, <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a>, <a href="#edef_prosody" class="eref" shape="rect">prosody</a>, <a href="#edef_say-as" class="eref" shape="rect">say-as</a>, <a href="#edef_sub" class="eref" shape="rect">sub</a>, <a href="#edef_token" class="eref" shape="rect">token</a>, <a href="#edef_voice" class="eref" shape="rect">voice</a>, <a href="#edef_word" class="eref" shape="rect">w</a>.</p>

<h3 id="g3182"><a id="S3.1.8.2" name="S3.1.8.2" shape="rect">3.1.8.2</a> <a name="edef_token" id="edef_token" class="edef" shape="rect">token</a> and <a name="edef_word" id="edef_word" class="edef" shape="rect">w</a> Elements</h3>

<p>The <a href="#edef_token" class="eref" shape="rect">token</a> element allows the author to indicate its content is a  token and to eliminate token (word) segmentation ambiguities of the synthesis processor.</p>

<p>The <a href="#edef_token" class="eref" shape="rect">token</a> element is necessary in order to render languages</p>
<ul>
  <li>that do not use white space as a token boundary identifier, such as Chinese, Thai, and Japanese</li>
  <li>that use white space for syllable segmentation, such as Vietnamese</li>
  <li>that use white space for other purposes, such as Urdu</li>
</ul>

<p>Use of this element can result in  improved cues for prosodic control (e.g., pause) and may assist the synthesis processor in selection of the correct pronunciation for homographs. Other elements such as <a href="#edef_break" class="eref" shape="rect">break</a>, <a href="#edef_mark" class="eref" shape="rect">mark</a>, and <a href="#edef_prosody" class="eref" shape="rect">prosody</a> are permitted within <a href="#edef_token" class="eref" shape="rect">token</a> to allow annotation at a sub-token level (e.g., syllable, mora, or whatever units are reasonable for the current language). <a href="#term-processor">Synthesis processors</a> are  <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em> to parse these annotations and  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> render them as they are able.</p>

<p>The text contents of the token element and its subelements are together considered to be one token for lexical lookup purposes as follows:</p>
<ol>
  <li>All markup within the token element is removed (leaving the contents of the markup). </li>
  <li>All remaining text is concatenated together in the order in which it appears in the document.</li>
  <li>Leading and trailing spaces are removed from this single block of text.</li>
  <li>Multiple contiguous white space characters are converted into a single space. </li>
  <li>The result is treated as a single token for lexical lookup purposes. </li>
</ol>
<p>Thus, "&lt;token&gt;&lt;emphasis&gt;hap&lt;/emphasis&gt;py&lt;/token&gt;" and "&lt;token&gt;&lt;emphasis&gt; hap &lt;/emphasis&gt; py&lt;/token&gt;" would refer to the tokens "happy" and "hap&nbsp;py", respectively. Note that this is different from how text and markup outside a <a href="#edef_token" class="eref" shape="rect">token</a> element  are treated (see  "Text normalization" in <a href="#S1.2">Section 1.2</a>).</p>
<p>The  use of <a href="#edef_token" class="eref" shape="rect">token</a> elements is <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em>. Where text occurs without an enclosing <a href="#edef_token" class="eref" shape="rect">token</a> element  the <a href="#term-processor" shape="rect">synthesis  processor</a> <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> attempt to determine the  token segmentation using  language-specific knowledge of the format of plain text.</p>
<p><a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a> is a defined attribute on the <a href="#edef_token" class="eref" shape="rect">token</a> element to identify the written language of the content.</p>

<p><a href="#adef_xmlid" class="aref" shape="rect"><code class="att">xml:id</code></a> is a defined attribute on the <a href="#edef_token" class="eref" shape="rect">token</a> element.</p>

<p> <a href="#adef_onlangfailure" class="aref" shape="rect"><code class="att">onlangfailure</code></a> is an <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute specifying the desired behavior upon language speaking failure.</p>

<p> <code class="att">role</code> is an  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> defined attribute on the <a href="#edef_token" class="eref" shape="rect">token</a> element. The <code class="att">role</code> attribute takes as its value one or more white space separated <a href="http://www.w3.org/TR/REC-xml-names/#NT-QName" shape="rect">QNames</a> (as defined in Section 4 of Namespaces in XML (1.0 [<a href="#ref-xmlns10" shape="rect">XMLNS 1.0</a>] or 1.1 [<a href="#ref-xmlns11" shape="rect">XMLNS 1.1</a>], depending on the version of XML being used)). A <a href="http://www.w3.org/TR/REC-xml-names/#NT-QName" shape="rect">QName</a>
  in the attribute content is expanded into an <a href="http://www.w3.org/TR/REC-xml-names/#dt-expname" shape="rect">expanded-name</a>
  using the namespace declarations in scope for the containing <a href="#edef_token" class="eref" shape="rect">token</a> element.
  Thus, each QName provides a reference to a specific item in the
  designated namespace. In the second example below, the QName within the
  <code class="att">role</code> attribute expands to the "VV0" item in the
  "http://www.example.com/claws7tags" namespace.
  
  This mechanism allows for referencing defined taxonomies of word
  classes, with the expectation that they are documented at the
specified namespace URI.</p>

<p>The <code class="att">role</code> attribute is intended to be of use in synchronizing with other specifications, for example to describe  additional information to help the selection of the most appropriate  pronunciation for the contained text inside an external lexicon (see <a href="#S3.1.5">lexicon documents</a>). </p>

<p>The <a href="#edef_token" class="eref" shape="rect">token</a>  element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref" shape="rect">audio</a>, <a href="#edef_break" class="eref" shape="rect">break</a>, <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a>, <a href="#edef_mark" class="eref" shape="rect">mark</a>, <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a>, <a href="#edef_prosody" class="eref" shape="rect">prosody</a>, <a href="#edef_say-as" class="eref" shape="rect">say-as</a>, <a href="#edef_sub" class="eref" shape="rect">sub</a>.</p>

<p>The <a href="#edef_token" class="eref" shape="rect">token</a>  element can only be contained in the following elements: <a href="#edef_audio" class="eref" shape="rect">audio</a>, <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a>, <a href="#edef_lang" class="eref" shape="rect">lang</a>, <a href="#edef_lookup" class="eref" shape="rect">lookup</a>, <a href="#edef_prosody" class="eref" shape="rect">prosody</a>, <a href="#edef_speak" class="eref" shape="rect">speak</a>, <a href="#edef_paragraph" class="eref" shape="rect">p</a>, <a href="#edef_sentence" class="eref" shape="rect">s</a>, <a href="#edef_voice" class="eref" shape="rect">voice</a>.</p>
<p>The <a href="#edef_word" class="eref" shape="rect">w</a> element is an alias for the <a href="#edef_token" class="eref" shape="rect">token</a> element. </p>
<p>Here is an example showing the use of the <a href="#edef_token" class="eref" shape="rect">token</a> element. </p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="zh-CN"&gt;

  &lt;!-- The Nanjing Changjiang River Bridge --&gt;
  &lt;token&gt;南京市&lt;/token&gt;&lt;token&gt;长江大桥&lt;/token&gt;
  &lt;!-- The mayor of Nanjing city, Jiang Daqiao --&gt;
  南京市长&lt;w&gt;江大桥&lt;/w&gt;
  &lt;!-- Shanghai is a metropolis --&gt;
  上海是个&lt;w&gt;大都会&lt;/w&gt;
  &lt;!-- Most Shanghainese will say something like that --&gt;
  上海人&lt;w&gt;大都&lt;/w&gt;会那么说
&lt;/speak&gt;</pre> 

<p>The next example shows the use of the <code class="att">role</code> attribute. The first document below is a sample lexicon (PLS) for the Chinese word "处". The second references this lexicon and shows how the role attribute may be used to select the appropriate pronunciation of the Chinese word "处" in the dialog.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;lexicon version="1.0"
         xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
         xmlns:claws="http://www.example.com/claws7tags"
         alphabet="x-myorganization-pinyin"
         xml:lang="zh-CN"&gt;
  &lt;lexeme role="claws:VV0"&gt;
    &lt;!-- base form of lexical verb --&gt;
    &lt;grapheme&gt;&lt;/grapheme&gt;
    &lt;phoneme&gt;chu3&lt;/phoneme&gt;
    &lt;!-- pinyin string is: "chǔ" in 处罚 处置 --&gt;
  &lt;/lexeme&gt;
  &lt;lexeme role="claws:NN"&gt;
    &lt;!-- common noun, neutral for number --&gt;
    &lt;grapheme&gt;&lt;/grapheme&gt;
    &lt;phoneme&gt;chu4&lt;/phoneme&gt;
    &lt;!-- pinyin string is: "chù" in 处所 妙处 --&gt;
  &lt;/lexeme&gt;
&lt;/lexicon&gt;</pre>

<p>&nbsp;</p>
<pre class="example" xml:space="preserve">
&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                           http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xmlns:claws="http://www.example.com/claws7tags"
       xml:lang="zh-CN"&gt;
  &lt;lexicon uri="http://www.example.com/lexicon.pls"
           type="application/pls+xml"
           xml:id="mylex"/&gt;
  &lt;lookup ref="mylex"&gt;
    他这个人很不好相&lt;w role="claws:VV0"&gt;&lt;/w&gt;
&lt;w role="claws:NN"&gt;&lt;/w&gt;不准照相。
  &lt;/lookup&gt;
&lt;/speak&gt;  
</pre>
<h3><a id="S3.1.9" name="S3.1.9" shape="rect">3.1.9</a> <a id="edef_say-as" name="edef_say-as" class="edef" shape="rect">say-as</a> Element</h3>

<p>The <a href="#edef_say-as" class="eref" shape="rect">say-as</a> element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text.</p>

<p>Defining a comprehensive set of text format types is difficult because of the variety of languages that have to be considered and because of the innate flexibility of written languages. SSML only specifies the <a href="#edef_say-as" class="eref" shape="rect">say-as</a> element, its attributes, and their purpose. It does not enumerate the possible values for the attributes. The Working Group expects to produce a separate document that will define standard values and associated normative behavior for these values. Examples given here are only for illustrating the purpose of the element and the attributes.</p>

<p>The <a href="#edef_say-as" class="eref" shape="rect">say-as</a> element has three attributes: <code class="att">interpret-as</code>, <code class="att">format</code>, and <code class="att">detail</code>. The <code class="att">interpret-as</code> attribute is always  <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em>; the other two attributes are  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em>. The legal values for the <code class="att">format</code> attribute depend on the value of the <code class="att">interpret-as</code> attribute.</p>

<p>The <a href="#edef_say-as" class="eref" shape="rect">say-as</a> element can only contain text to be rendered.</p>

<h4 id="g972">The <code class="att">interpret-as</code> and <code class="att">format</code> attributes</h4>

<p>The <code class="att">interpret-as</code> attribute indicates the content type of the contained text construct. Specifying the content type helps the <a href="#term-processor" shape="rect">synthesis processor</a> to distinguish and interpret text constructs that may be rendered in different ways depending on what type of information is intended. In addition, the  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> <code class="att">format</code> attribute can give further hints on the precise formatting of the contained text for content types that may have ambiguous formats.</p>

<p>When specified, the <code class="att">interpret-as</code> and <code class="att">format</code> values are to be interpreted by the <a href="#term-processor" shape="rect">synthesis processor</a> as hints provided by the markup document author to aid <a href="#text_normalization" shape="rect">text normalization</a> and pronunciation.</p>

<p>In all cases, the text enclosed by any <a href="#edef_say-as" class="eref" shape="rect">say-as</a> element is intended to be a standard, orthographic form of the language currently in context. A <a href="#term-processor" shape="rect">synthesis processor</a> <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> be able to support the common, orthographic forms of the specified language for every content type that it supports.</p>

<p>When the value for the <code class="att">interpret-as</code> attribute is unknown or unsupported by a processor, it  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> render the contained text as if no <code class="att">interpret-as</code> value were specified.</p>

<p>When the value for the <code class="att">format</code> attribute is unknown or unsupported by a processor, it  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> render the contained text as if no <code class="att">format</code> value were specified, and  <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> render it using the <code class="att">interpret-as</code> value that is specified.</p>

<p>When the content of the <a href="#edef_say-as" class="eref" shape="rect">say-as</a> element contains additional text next to the content that is in the indicated <code class="att">format</code> and <code class="att">interpret-as</code> type, then this additional text  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> be rendered. The processor  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> make the rendering of the additional text dependent on the <code class="att">interpret-as</code> type of the element in which it appears.<br clear="none" />
When the content of the <a href="#edef_say-as" class="eref" shape="rect">say-as</a> element contains no content in the indicated <code class="att">interpret-as</code> type or <code class="att">format</code>, the processor  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> render the content either as if the <code class="att">format</code> attribute were not present, or as if the <code class="att">interpret-as</code> attribute were not present, or as if neither the <code class="att">format</code> nor <code class="att">interpret-as</code> attributes were present. The processor  <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> also notify the environment of the mismatch.</p>

<p>Indicating the content type or format does not necessarily affect the way the information is pronounced. A <a href="#term-processor" shape="rect">synthesis processor</a> <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> pronounce the contained text in a manner in which such content is normally produced for the language.</p>

<h4 id="g1000">The <code class="att">detail</code> attribute</h4>

<p>The <code class="att">detail</code> attribute is an  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute that indicates the level of detail to be read aloud or rendered. Every value of the <code class="att">detail</code> attribute  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> render all of the informational content in the contained text; however, specific values for the <code class="att">detail</code> attribute can be used to render content that is not usually informational in running text but may be important to render for specific purposes. For example, a <a href="#term-processor" shape="rect">synthesis processor</a> will usually render punctuations through appropriate changes in prosody. Setting a higher level of detail may be used to speak punctuations explicitly, e.g. for reading out coded part numbers or pieces of software code.</p>

<p>The <code class="att">detail</code> attribute can be used for all <code class="att">interpret-as</code> types.</p>

<p>If the <code class="att">detail</code> attribute is not specified, the level of detail that is produced by the <a href="#term-processor" shape="rect">synthesis processor</a> depends on the text content and the language.</p>

<p>When the value for the <code class="att">detail</code> attribute is unknown or unsupported by a processor, it  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> render the contained text as if no value were specified for the <code class="att">detail</code> attribute.</p>

<h3 id="g9"><a id="S3.1.10" name="S3.1.10" shape="rect">3.1.10</a> <a name="edef_phoneme" id="edef_phoneme" class="edef" shape="rect">phoneme</a> Element</h3>

<p>The <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a> element provides a phonemic/phonetic pronunciation for the contained text. The <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a> element  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> be empty. However, it is  <em title="RECOMMENDED in RFC 2119 context" class="RFC2119">RECOMMENDED</em> that the element contain human-readable text that can be used for non-spoken rendering of the document. For example, the content may be displayed visually for users with hearing impairments.</p>

<p>The <code class="att">ph</code> attribute is a  <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em> attribute that specifies the phoneme/phone string.</p>

<p>This element is designed strictly for phonemic and phonetic notations and is intended to be used to provide pronunciations for words or very short phrases. The phonemic/phonetic string does not undergo <a href="#text_normalization" shape="rect">text normalization</a> and is not treated as a token for lookup in the lexicon (see <a href="#S3.1.5" shape="rect">Section 3.1.5</a>), while values in <a href="#edef_say-as" class="eref" shape="rect">say-as</a> and <a href="#edef_sub" class="eref" shape="rect">sub</a> may undergo both. Briefly, phonemic strings consist of phonemes, language-dependent speech units that characterize linguistically significant differences in the language; loosely, phonemes represent all the sounds needed to distinguish one word from another in a given language. On the other hand, phonetic strings consist of phones, speech units that characterize the manner (puff of air, click, vocalized, etc.) and place (front, middle, back, etc.) of articulation within the human vocal tract and are thus independent of language; phones represent realized distinctions in human speech production.</p>

<p>The <code class="att">alphabet</code> attribute is an  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute that specifies the phonemic/phonetic pronunciation alphabet. A pronunciation  alphabet in this context refers to a collection of symbols to represent the sounds of one or more human languages. The only valid values for this attribute are "<strong>ipa</strong>" (see the next paragraph), values defined in the  <a href="#S3.1.10.1">Pronunciation Alphabet Registry</a> and vendor-defined strings of the form "<strong>x-organization</strong>" or "<strong>x-organization-alphabet</strong>". For example, the Japan Electronics and Information Technology Industries Association [<a href="#ref-jeita" shape="rect">JEITA</a>] might wish to encourage the use of an   alphabet such as "x-JEITA" or "x-JEITA-IT-4002" for their phoneme alphabet [<a href="#ref-jeidaalphabet" shape="rect">JEIDAALPHABET</a>].</p>
<p><a href="#term-processor" shape="rect">Synthesis processors</a> <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> support a value for <code class="att">alphabet</code> of "<strong>ipa</strong>", corresponding to Unicode representations of the phonetic characters developed by the International Phonetic Association [<a href="#ref-ipa" shape="rect">IPA</a>]. In addition to an exhaustive set of vowel and consonant symbols, this character set supports a syllable delimiter, numerous diacritics, stress symbols, lexical tone symbols, intonational markers and more. For this alphabet, legal <code class="att">ph</code> values are strings of the values specified in Appendix 2 of [<a href="#ref-ipahndbk" shape="rect">IPAHNDBK</a>]; note that an IPA transcription may contain white space characters to assist readability, which have no implications for the pronunciation. Informative tables of the IPA-to-Unicode mappings can be found at [<a href="#ref-ipaunicode1" shape="rect">IPAUNICODE1</a>] and [<a href="#ref-ipaunicode2" shape="rect">IPAUNICODE2</a>]. Note that not all of the IPA characters are available in Unicode. For processors supporting this alphabet,</p>

<ul>
<li>The processor  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> syntactically accept all legal <code class="att">ph</code> values.</li>

<li>The processor  <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> produce output when given Unicode IPA codes that can reasonably be considered to belong to the current language.</li>

<li>The production of output when given other codes is entirely at processor discretion.</li>
</ul>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
  &lt;phoneme alphabet="ipa" ph="t&amp;#x259;mei&amp;#x325;&amp;#x27E;ou&amp;#x325;"&gt; tomato &lt;/phoneme&gt;
  &lt;!-- This is an example of IPA using character entities --&gt;
  &lt;!-- Because many platform/browser/text editor combinations do not
       correctly cut and paste Unicode text, this example uses the entity
       escape versions of the IPA characters.  Normally, one would directly
       use the UTF-8 representation of these symbols: "təmei̥ɾou̥". --&gt;
&lt;/speak&gt;
</pre>

<p>It is an <a href="#term-error" shape="rect">error</a> if a value for <code class="att">alphabet</code> is specified that is not known or cannot be applied by a <a href="#term-processor" shape="rect">synthesis processor</a>. The default behavior when the <code class="att">alphabet</code> attribute is left unspecified is processor-specific.</p>
<p>The <code class="att">type</code> attribute is an optional attribute that indicates additional information about how the pronunciation information is to be interpreted. The only allowed values for this attribute  are "<strong>default</strong>", which has no implications, and "<strong>ruby</strong>", which indicates that the pronunciation information is from ruby text [<a href="#ref-ruby">RUBY</a>]. The default value of this attribute is "<strong>default</strong>". </p>
<p>The <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a> element itself can only contain text (no elements).</p>
<h4 id="g10"><a name="S3.1.10.1" id="S3.1.10.1" shape="rect">3.1.10.1</a> Pronunciation   Alphabet Registry</h4>

<p>Links to  the Pronunciation Alphabet Registry can be found on the SSML namespace page at <a href="http://www.w3.org/2001/10/synthesis">http://www.w3.org/2001/10/synthesis</a>. </p>

<h3 id="g11"><a id="S3.1.11" name="S3.1.11" shape="rect">3.1.11</a> <a name="edef_sub" id="edef_sub" class="edef" shape="rect">sub</a> Element</h3>

<p>The <a href="#edef_sub" class="eref" shape="rect">sub</a> element is employed to indicate that the text in the <code class="att">alias</code> attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. The  <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em> <code class="att">alias</code> attribute specifies the string to be spoken instead of the enclosed string. The processor  <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> apply <a href="#text_normalization" shape="rect">text normalization</a> to the <code class="att">alias</code> value.</p>

<p>The <a href="#edef_sub" class="eref" shape="rect">sub</a> element can only contain text (no elements).</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
  &lt;sub alias="World Wide Web Consortium"&gt;W3C&lt;/sub&gt;
  &lt;!-- World Wide Web Consortium --&gt;
&lt;/speak&gt;
</pre>

<h3 id="g3.1.1.12"><a id="S3.1.12" name="S3.1.12" shape="rect">3.1.12</a> <a name="edef_lang" id="edef_lang" class="edef" shape="rect">lang</a> Element</h3>

<p>The <a href="#edef_lang" class="eref" shape="rect">lang</a> element is used to specify the natural language of the content.</p>

<p> <a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a> is a <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em> attribute specifying the language of the root document.</p>

<p> <a href="#adef_onlangfailure" class="aref" shape="rect"><code class="att">onlangfailure</code></a> is an <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute specifying the desired behavior upon language speaking failure.</p>

<p>This element  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> be used when  there is a change in the natural language. There is no text structure associated with the language change indicated by the <a href="#edef_lang" class="eref" shape="rect">lang</a> element. It  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> be used to specify the language of the content at a level other than a paragraph, sentence or word level. When language change is to be associated with text structure, it is  <em title="RECOMMENDED in RFC 2119 context" class="RFC2119">RECOMMENDED</em> to use the <a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a> attribute on the respective <a href="#edef_paragraph" class="eref" shape="rect">p</a>, <a href="#edef_sentence" class="eref" shape="rect">s</a>, <a href="#edef_token" class="eref" shape="rect">token</a>, or <a href="#edef_word" class="eref" shape="rect">w</a> element.</p>
<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
  The French word for cat is &lt;w xml:lang="fr"&gt;chat&lt;/w&gt;.
  He prefers to eat pasta that is &lt;lang xml:lang="it"&gt;al dente&lt;/lang&gt;.
&lt;/speak&gt;
</pre>
<p>The <a href="#edef_lang" class="eref" shape="rect">lang</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref" shape="rect">audio</a>, <a href="#edef_mark" class="eref" shape="rect">break</a>, <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a>, <a href="#edef_lang" class="eref" shape="rect">lang</a>, <a href="#edef_lookup" class="eref" shape="rect">lookup</a>, <a href="#edef_mark" class="eref" shape="rect">mark</a>, <a href="#edef_paragraph" class="eref" shape="rect">p</a>, <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a>, <a href="#edef_prosody" class="eref" shape="rect">prosody</a>, <a href="#edef_say-as" class="eref" shape="rect">say-as</a>, <a href="#edef_sub" class="eref" shape="rect">sub</a>, <a href="#edef_sentence" class="eref" shape="rect">s</a>, <a href="#edef_token" class="eref" shape="rect">token</a>, <a href="#edef_voice" class="eref" shape="rect">voice</a>, <a href="#edef_word" class="eref" shape="rect">w</a>. </p>

<h3><a id="S3.1.13" name="S3.1.13" shape="rect">3.1.13</a> Language Speaking Failure: <a class="adef" id="adef_onlangfailure" name="adef_onlangfailure" shape="rect"><code>onlangfailure</code></a> Attribute</h3>

<p>The <code class="att">onlangfailure</code> attribute is an  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute that contains one value from the following enumerated list describing the desired behavior of the <a href="#term-processor" shape="rect">synthesis processor</a> upon language speaking failure.  A conforming <a href="#term-processor" shape="rect">synthesis processor</a> <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> report a language speaking failure in addition to taking the action(s) below.</p>

<ul>
  <li><strong>changevoice</strong> - if a voice exists that can speak the language, the <a href="#term-processor" shape="rect">synthesis processor</a> will switch to that voice and speak the content. Otherwise, the processor chooses another behavior (either ignoretext or ignorelang). </li>
  <li><strong>ignoretext</strong> - the <a href="#term-processor" shape="rect">synthesis processor</a> will not attempt to render the text that is in the failed language. </li>
  <li><strong>ignorelang</strong> - the <a href="#term-processor" shape="rect">synthesis processor</a> will ignore the change in language and speak as if the content were in the previous language.</li>
  <li><strong>processorchoice</strong> - the <a href="#term-processor" shape="rect">synthesis processor</a> chooses the behavior (either changevoice, ignoretext, or ignorelang). </li>
</ul>

<p>A language speaking failure occurs whenever the <a href="#term-processor" shape="rect">synthesis processor</a> decides that the currently-selected voice (see <a href="#S3.2.1">Section 3.2.1</a>) cannot speak the declared language of the text. This can occur when the <a href="#term-processor" shape="rect">synthesis processor</a> encounters a new  <a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a> value or characters or character sequences that the voice does not know how to process.</p>

<p>The value of this attribute is inherited down the document hierarchy, i.e. it <span>needs</span> to be given only once if the desired behavior for the whole document is the same, and settings of this value nest, i.e. inner attributes overwrite outer attributes. The top-level default value for this attribute is "processorchoice". Other languages which embed fragments of SSML (without a <a href="#edef_speak" class="eref" shape="rect">speak</a> element)  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> declare the top-level default value for this attribute. </p>

<p><a href="#adef_onlangfailure" class="aref" shape="rect"><code class="att">onlangfailure</code></a> is permitted on all elements which can contain <a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a>, so it is a defined attribute for the <a href="#edef_speak" class="eref" shape="rect">speak</a>, <span><a href="#edef_lang" class="eref" shape="rect">lang</a>, <a href="#edef_desc" class="eref" shape="rect">desc</a>,</span> <a href="#edef_paragraph" class="eref" shape="rect">p</a>, <a href="#edef_sentence" class="eref" shape="rect">s</a>, <a href="#edef_token" class="eref" shape="rect">token</a>, and <a href="#edef_word" class="eref" shape="rect">w</a> elements.</p>
  <h2 id="g12"><a id="S3.2" name="S3.2" shape="rect">3.2</a> Prosody and Style</h2>

<h3 id="g13"><a id="S3.2.1" name="S3.2.1" shape="rect">3.2.1</a> <a name="edef_voice" id="edef_voice" class="edef" shape="rect">voice</a> Element</h3>

<p>The <a href="#edef_voice" class="eref" shape="rect">voice</a> element is a production element that requests a change in speaking voice. There are two kinds of attributes for the voice element: those that indicate desired features of a voice and those that control behavior. The voice feature attributes are:</p>
<ul>
<li>
<p><code class="att">gender</code>:  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute indicating the preferred gender of the voice to speak the contained text. Enumerated values are: "<b>male</b>", "<b>female</b>", "<b>neutral</b>", or the empty string "".</p>
</li>

<li>
<p><code class="att">age</code>:  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute indicating the preferred age in years (since birth) of the voice to speak the contained text. Acceptable values are of type <a href="http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#nonNegativeInteger" shape="rect"><strong>xsd:nonNegativeInteger</strong></a> [<a href="#ref-schema2" shape="rect">SCHEMA2</a> §3.3.20] or the empty string "".</p>
</li>

<li>
<p><code class="att">variant</code>:  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute indicating a preferred variant of the other voice characteristics to speak the contained text. (e.g. the second male child voice). Valid values of <code class="att">variant</code> are of type <a href="http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#positiveInteger" shape="rect"><strong>xsd:positiveInteger</strong></a> [<a href="#ref-schema2" shape="rect">SCHEMA2</a> §3.3.25] or the empty string "".</p>
</li>

<li>
<p><code class="att">name</code>:  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute indicating a processor-specific voice name to speak the contained text. The value  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> be a space-separated list of names ordered from top preference down or the empty string "". As a result a name  <em title="MUST NOT in RFC 2119 context" class="RFC2119">MUST NOT</em>  contain any white space.</p>
</li>
<li>
  <p><code class="att">languages</code>:  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute indicating the list of languages the voice  is desired to speak. The value <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> be either the empty string "" or a space-separated list of languages, with  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> accent indication per language. Each language/accent pair is of the form "<em>language</em>" or "<em>language</em>:<em>accent</em>", where both <em>language</em> and <em>accent</em> <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> be an Extended Language Range [<a href="#ref-bcp47" shape="rect">BCP47, Matching of Language Tags</a> §2.2], except that the values "und" and "zxx" are disallowed. A voice satisfies the <code class="att">languages</code> feature if, for each  language/accent pair in the list, </p>
  <ol>
    <li>the voice is documented (see <a href="#voice_descriptions">Voice descriptions</a>) as reading/speaking  a language that  matches the Extended Language Range given by <em>language</em> according to the Extended Filtering matching algorithm [<a href="#ref-bcp47" shape="rect">BCP47, Matching of Language Tags</a> §3.3.2], and</li>
    <li>if an <em>accent</em> is given, the voice is documented (see <a href="#voice_descriptions">Voice descriptions</a>) as reading/speaking  the language above with an accent that  matches the Extended Language Range given by <em>accent</em> according to the Extended Filtering matching algorithm [<a href="#ref-bcp47" shape="rect">BCP47, Matching of Language Tags</a> §3.3.2], except that the script and extension subtags of the <em>accent</em> <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> be ignored by the <a href="#term-processor" shape="rect">synthesis processor</a>. It is recommended that authors and voice providers do not use the script or extension subtags for accents because they are not relevant for speaking.</li>
    </ol>
  <p>For example, a <code class="att">languages</code> value of "en:pt fr:ja" can legally be matched by any voice that can both read English (speaking it with a Portuguese accent) and read French (speaking it with a Japanese accent). Thus, a voice that only supports "en-US" with a "pt-BR" accent and "fr-CA" with a "ja" accent would match.  As another example, if we have &lt;voice languages="fr:pt"&gt; and there is no voice that supports French with a Portuguese accent, then a voice selection failure will occur.  Note that if no accent indication is given for a language, then any voice that speaks the language is acceptable, regardless of accent.  Also, note that author control over language support during voice selection is independent of any value of <a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a> in the text. </p>
</li>
</ul>

<p>For the feature attributes above, an empty string value indicates that any voice will satisfy the feature. The top-level default value for all feature attributes is "", the empty string. </p>

<p>The behavior control attributes of <a href="#edef_voice" class="eref" shape="rect">voice</a> are:</p>
<ul>
<li><code class="att">required</code>:  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute that specifies a set of features by their respective attribute names. This set of features is used by the voice selection algorithm described below.  Valid values of <code class="att">required</code> are a space-separated list composed of values from the list of feature names: "<strong>name</strong>", "<strong>languages</strong>", "<strong>gender</strong>", "<strong>age</strong>", "<strong>variant</strong>" or the empty string "". The default value for this attribute is  "languages". </li>
<li><code class="att">ordering</code>: <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute that specifies  the  priority ordering of features.  Valid values of <code class="att">ordering</code> are a space-separated list composed of values from the list of feature names: "<strong>name</strong>", "<strong>languages</strong>", "<strong>gender</strong>", "<strong>age</strong>", "<strong>variant</strong>" or the empty string "", where features named earlier in the list have higher priority . The default value for this attribute is  "languages". Features not listed in the <code class="att">ordering</code> list   have equal priority to each other but lower than that of the last  feature in the list. Note that if the <code class="att">ordering</code> attribute  is set to  the empty  string then all features have the same  priority.</li>
<li><code class="att">onvoicefailure</code>:  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute containing one value from the following enumerated list describing the desired behavior of the <a href="#term-processor" shape="rect">synthesis processor</a> upon voice selection failure.   The default value for this attribute is "priorityselect".
  <ul>
    <li><strong>priorityselect</strong> - the <a href="#term-processor" shape="rect">synthesis processor</a> uses the values of all voice feature attributes to select a voice by <a href="#feature-priority">feature priority</a>, where the starting <em>candidate set</em> is the set of all available voices. </li>
    <li><strong>keepexisting</strong> - the voice does not change. </li>
    <li><strong>processorchoice</strong> - the <a href="#term-processor" shape="rect">synthesis processor</a> chooses the behavior (either priorityselect or keepexisting). </li>
  </ul>
</li>
</ul>

<p>The following voice selection algorithm <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> be used:</p>
<ol>
  <li>All available voices are identified for which the values of all voice feature   attributes listed in the <code class="att">required</code> attribute value are   matched. When the value of the <code class="att">required</code> attribute is the   empty string "", any and all voices are considered successful matches. If one or   more voices are identified, the selection is considered successful; otherwise   there is voice selection failure.</li>
  <li>If a successful selection identifies only one voice, the <a href="#term-processor" shape="rect">synthesis processor</a> <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> use that voice.</li>
  <li>If a successful selection identifies more than one voice, the remaining   features (those not listed in the <code class="att">required</code> attribute   value) are used to choose a voice by <a href="#feature-priority" shape="rect">feature priority</a>, where the starting <em>candidate set</em> is the set of all voices identified.</li>
  <li>If there is voice selection failure, a conforming <a href="#term-processor" shape="rect">synthesis processor</a> <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> report the voice selection   failure in addition to taking the action(s) expressed by the value of the <code class="att">onvoicefailure</code> attribute.</li>
  <li>To choose a voice by <a name="feature-priority" id="feature-priority">feature priority</a>, each feature is taken in turn starting   with the highest priority feature, as controlled by the <code class="att">ordering</code> attribute.
    <ul>
      <li>If at least one voice matches the value of the current voice feature attribute then all voices not matching that value are removed from the <em>candidate set</em>. If a single voice remains in the <em>candidate set</em> the synthesis processor must use it. If more than one voice remains in the <em>candidate set</em> then the next priority feature is examined for the <em>candidate set</em>.</li>
      <li>If no voices match the value of the current voice feature attribute then the next priority feature is examined for the <em>candidate set</em>.</li>
    </ul>
  </li>
  <li>After examining all feature attributes on the <code class="att">ordering</code> list, if multiple voices remain in the <em>candidate set</em>, the <a href="#term-processor" shape="rect">synthesis processor</a> <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> use any one of   them.</li>
</ol>
<p>Although each attribute individually is optional, it is an <a href="#term-error">error</a> if no attributes are specified when the <a href="#edef_voice" class="eref" shape="rect">voice</a> element is used.</p>
<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;   
  &lt;voice gender="female" languages="en-US" required="languages gender variant"&gt;Mary had a little lamb,&lt;/voice&gt;
  &lt;!-- now request a different female child's voice --&gt;
  &lt;voice gender="female" variant="2"&gt;
  Its fleece was white as snow.
  &lt;/voice&gt;
  &lt;!-- processor-specific voice selection --&gt;
  &lt;voice name="Mike" required="name"&gt;I want to be like Mike.&lt;/voice&gt;
&lt;/speak&gt;
</pre>

<h4 id="voice_descriptions">Voice descriptions</h4>
<p>For every voice made available to a synthesis processor, the vendor of the voice must document the following:</p>
<ul>
  <li>a list of language tags [<a href="#ref-bcp47" shape="rect">BCP47, Tags for Identifying Languages</a>] representing the languages the voice can read.</li>
  <li>for each language, a language tag [<a href="#ref-bcp47" shape="rect">BCP47, Tags for Identifying Languages</a>] representing the accent the voice uses when reading the language.</li>
</ul>
<p>Although indication of language (using  <a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a>) and selection of voice (using  <a href="#edef_voice" class="eref" shape="rect">voice</a>) are independent, there is no requirement that a  synthesis processor support every possible combination of values of the two. However, a synthesis processor  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> document expected rendering behavior for every possible combination. See the <a href="#adef_onlangfailure" class="aref" shape="rect"><code class="att">onlangfailure</code></a> attribute for information on what happens when the processor encounters text content that the voice cannot speak. </p>
<p><a href="#edef_voice" class="eref" shape="rect">voice</a> attributes are inherited down the tree including to within elements that change the language. The defaults described for each attribute only apply at the top (document) level and are overridden by explicit author use of the <a href="#edef_voice" class="eref" shape="rect">voice</a> element. In addition, changes in voice are scoped and apply only to the content of the element in which the change occurred. When processing reaches the end of a <a href="#edef_voice" class="eref" shape="rect">voice</a> element content, i.e. the closing &lt;/voice&gt; tag, the voice in effect before the beginning tag is restored.</p>

<p>Similarly, if a voice is changed by the processor as a result of a language speaking failure, the prior voice is restored when that voice is again able to speak the content. Note that there is always an active voice, since the <a href="#term-processor" shape="rect">synthesis processor</a> is required to select a default voice before beginning execution of the document (see <a href="#S3.1.1">section 3.1.1</a>).</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
  &lt;voice gender="female" required="languages gender age" languages="en-US ja"&gt; 
    Any female voice here.
    &lt;voice age="6"&gt; 
      A female child voice here.
      &lt;lang xml:lang="ja"&gt; 
        &lt;!--  Same female child voice rendering Japanese text. --&gt;
      &lt;/lang&gt;
    &lt;/voice&gt;
  &lt;/voice&gt;
&lt;/speak&gt;
</pre>

<p>Relative changes in prosodic parameters  <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> be carried across voice changes. However, different voices have different natural defaults for pitch, speaking rate, etc. because they represent different personalities, so absolute values of the prosodic parameters may vary across changes in the voice.</p>

<p>The quality of the output audio or voice may suffer if a change in voice is requested within a sentence.</p>

<p>The <a href="#edef_voice" class="eref" shape="rect">voice</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref" shape="rect">audio</a>, <a href="#edef_mark" class="eref" shape="rect">break</a>, <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a>, <a href="#edef_lang" class="eref" shape="rect">lang</a>, <a href="#edef_lookup" class="eref" shape="rect">lookup</a>, <a href="#edef_mark" class="eref" shape="rect">mark</a>, <a href="#edef_paragraph" class="eref" shape="rect">p</a>, <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a>, <a href="#edef_prosody" class="eref" shape="rect">prosody</a>, <a href="#edef_say-as" class="eref" shape="rect">say-as</a>, <a href="#edef_sub" class="eref" shape="rect">sub</a>, <a href="#edef_sentence" class="eref" shape="rect">s</a>, <a href="#edef_token" class="eref" shape="rect">token</a>, <a href="#edef_voice" class="eref" shape="rect">voice</a>, <a href="#edef_word" class="eref" shape="rect">w</a>. </p>

<h3 id="g15"><a id="S3.2.2" name="S3.2.2" shape="rect">3.2.2</a> <a name="edef_emphasis" id="edef_emphasis" class="edef" shape="rect">emphasis</a> Element</h3>

<p>The <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a> element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The <a href="#term-processor" shape="rect">synthesis processor</a> determines how to render emphasis since the nature of emphasis differs between languages, dialects or even voices. The attributes are:</p>

<ul>
<li>
<p><code class="att">level</code>: the  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> <code class="att">level</code> attribute indicates the strength of emphasis to be applied. Defined values are <strong>"strong"</strong>, <strong>"moderate"</strong>, <strong>"none"</strong> and <strong>"reduced"</strong>. The default <code class="att">level</code> is <strong>"moderate"</strong>. The meaning of <strong>"strong"</strong> and <strong>"moderate"</strong> emphasis is interpreted according to the language being spoken (languages indicate emphasis using a possible combination of pitch change, timing changes, loudness and other acoustic differences). The <strong>"reduced"</strong> <code class="att">level</code> is effectively the opposite of emphasizing a word. For example, when the phrase "going to" is reduced it may be spoken as "gonna". The <strong>"none"</strong> <code class="att">level</code> is used to prevent the <a href="#term-processor" shape="rect">synthesis processor</a> from emphasizing words that it might typically emphasize. The values <strong>"none"</strong>, <strong>"moderate"</strong>, and <strong>"strong"</strong> are monotonically non-decreasing in strength.</p>
</li>
</ul>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
  That is a &lt;emphasis&gt; big &lt;/emphasis&gt; car!
  That is a &lt;emphasis level="strong"&gt; huge &lt;/emphasis&gt;
  bank account!
&lt;/speak&gt;
</pre>

<p>The <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref" shape="rect">audio</a>, <a href="#edef_mark" class="eref" shape="rect">break</a>, <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a>, <a href="#edef_lang" class="eref" shape="rect">lang</a>, <a href="#edef_lookup" class="eref" shape="rect">lookup</a>, <a href="#edef_mark" class="eref" shape="rect">mark</a>, <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a>, <a href="#edef_prosody" class="eref" shape="rect">prosody</a>, <a href="#edef_say-as" class="eref" shape="rect">say-as</a>, <a href="#edef_sub" class="eref" shape="rect">sub</a>, <a href="#edef_token" class="eref" shape="rect">token</a>, <a href="#edef_voice" class="eref" shape="rect">voice</a>, <a href="#edef_word" class="eref" shape="rect">w</a>.</p>

<h3 id="g16"><a id="S3.2.3" name="S3.2.3" shape="rect">3.2.3</a> <a name="edef_break" id="edef_break" class="edef" shape="rect">break</a> Element</h3>

<p>The <a href="#edef_break" class="eref" shape="rect">break</a> element is an empty element that controls the pausing or other prosodic boundaries between tokens. The use of the <a href="#edef_break" class="eref" shape="rect">break</a> element between any pair of tokens is  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em>. If the element is not present between tokens, the <a href="#term-processor" shape="rect">synthesis processor</a> is expected to automatically determine a break based on the linguistic context. In practice, the <a href="#edef_break" class="eref" shape="rect">break</a> element is most often used to override the typical automatic behavior of a synthesis processor. The attributes on this element are:</p>

<ul>
<li>
<p><code class="att">strength</code>: the <code class="att">strength</code> attribute is an  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute having one of the following values: <strong>"none"</strong>, <strong>"x-weak"</strong>, <strong>"weak"</strong>, <strong>"medium"</strong> (default value), <strong>"strong"</strong>, or <strong>"x-strong"</strong>. This attribute is used to indicate the strength of the prosodic break in the speech output. The value <strong>"none"</strong> indicates that no prosodic break boundary should be outputted, which can be used to prevent a prosodic break which the processor would otherwise produce. The other values indicate monotonically non-decreasing (conceptually increasing) break strength between tokens. The stronger boundaries are typically accompanied by pauses. "<strong>x-weak</strong>" and "<strong>x-strong</strong>" are mnemonics for "extra weak" and "extra strong", respectively.</p>
</li>

<li>
<p><code class="att">time</code>: the <code class="att">time</code> attribute is an  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attribute indicating the duration of a pause to be inserted in the output in seconds or milliseconds. It follows the time value format from the Cascading Style Sheets Level 2 Recommendation [<a href="#ref-css2" shape="rect">CSS2</a>], e.g. "250ms", "3s".</p>
</li>
</ul>

<p>The <code class="att">strength</code> attribute is used to indicate the prosodic strength of the break. For example, the breaks between paragraphs are typically much stronger than the breaks between words within a sentence. The <a href="#term-processor" shape="rect">synthesis processor</a> <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> insert a pause as part of its implementation of the prosodic break. A pause of a specific length can also be inserted by using the <code class="att">time</code> attribute.</p>

<p>If a <a class="eref" href="#edef_break" shape="rect">break</a> element is used with neither <code class="att">strength</code> nor <code class="att">time</code> attributes, a break will be produced by the processor with a prosodic strength greater than that which the processor would otherwise have used if no <a class="eref" href="#edef_break" shape="rect">break</a> element was supplied.</p>

<p>If both <code class="att">strength</code> and <code class="att">time</code> attributes are supplied, the processor will insert a break with a duration as specified by the <code class="att">time</code> attribute, with other prosodic changes in the output based on the value of the <code class="att">strength</code> attribute.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
  Take a deep breath &lt;break/&gt;
  then continue. 
  Press 1 or wait for the tone. &lt;break time="3s"/&gt;
  I didn't hear you! &lt;break strength="weak"/&gt; Please repeat.
&lt;/speak&gt;
</pre>

<h3 id="g18"><a id="S3.2.4" name="S3.2.4" shape="rect">3.2.4</a> <a name="edef_prosody" id="edef_prosody" class="edef" shape="rect">prosody</a> Element</h3>

<p>The <a href="#edef_prosody" class="eref" shape="rect">prosody</a> element permits control of the pitch, speaking rate and volume of the speech output. The attributes, all  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em>, are:</p>

<ul>
<li>
<p><code class="att">pitch</code>: the baseline pitch for the contained text. Although the exact meaning of "baseline pitch" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the approximate pitch of the output. Legal values are: a <a href="#number_values" shape="rect">number</a> followed by "Hz", a <a href="#relative_values" shape="rect">relative change</a> or <strong>"x-low"</strong>, <strong>"low"</strong>, <strong>"medium"</strong>, <strong>"high"</strong>, <strong>"x-high"</strong>, or <strong>"default"</strong>. Labels <strong>"x-low"</strong> through <strong>"x-high"</strong> represent a sequence of monotonically non-decreasing pitch levels.</p>
</li>

<li>
<p><code class="att">contour</code>: sets the actual pitch contour for the contained text. The format is specified in <a href="#pitch_contour" shape="rect">Pitch contour</a> below.</p>
</li>

<li>
<p><code class="att">range</code>: the pitch range (variability) for the contained text. Although the exact meaning of "pitch range" will vary across synthesis processors, increasing/decreasing this value will typically increase/decrease the dynamic range of the output pitch. Legal values are: a <a href="#number_values" shape="rect">number</a> followed by "Hz", a <a href="#relative_values" shape="rect">relative change</a> or <strong>"x-low"</strong>, <strong>"low"</strong>, <strong>"medium"</strong>, <strong>"high"</strong>, <strong>"x-high"</strong>, or <strong>"default"</strong>. Labels <strong>"x-low"</strong> through <strong>"x-high"</strong> represent a sequence of monotonically non-decreasing pitch ranges.</p>
</li>

<li>
<p><code class="att">rate</code>: a change in the speaking rate for the contained text. Legal values are: a  <a href="#perc_values" shape="rect">non-negative percentage</a> or <strong>"x-slow"</strong>, <strong>"slow"</strong>, <strong>"medium"</strong>, <strong>"fast"</strong>, <strong>"x-fast"</strong>, or <strong>"default"</strong>. Labels <strong>"x-slow"</strong> through <strong>"x-fast"</strong> represent a sequence of monotonically non-decreasing speaking rates. When the value is a <a href="#perc_values" shape="rect">non-negative percentage</a> it acts as a multiplier of the default rate. For example, a value of 100% means no change in speaking rate, a value of 200% means a speaking rate twice the default rate, and a value of 50% means a speaking rate of half the default rate. The default rate for a voice depends on the language and dialect and on the personality of the voice. The default rate for a voice <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> be such that it is experienced as a normal speaking rate for the voice when reading aloud text. Since voices are processor-specific, the default rate will be as well.</p>
</li>

<li>
<p><code class="att">duration</code>: a value in seconds or milliseconds for the desired time to take to read the contained text. Follows the time value format from the Cascading Style Sheet Level 2 Recommendation [<a href="#ref-css2" shape="rect">CSS2</a>], e.g. "250ms", "3s".</p>
</li>

<li>
<p><code class="att">volume</code>: the volume for the contained text. Legal values are: a <a href="#number_values" shape="rect">number</a> preceded by "+" or "-" and immediately followed by "dB";  or <strong>"silent"</strong>, <strong>"x-soft"</strong>, <strong>"soft"</strong>, <strong>"medium"</strong>, <strong>"loud"</strong>, <strong>"x-loud"</strong>, or <strong>"default"</strong>.  The default is +0.0dB. Specifying a value of  <strong>"silent"</strong> amounts to specifying minus infinity decibels (dB). Labels <strong>"silent"</strong> through <strong>"x-loud"</strong> represent a sequence of monotonically non-decreasing volume levels. When the value is a signed <a href="#number_values" shape="rect">number</a> (dB), it specifies the ratio of the squares of the new signal amplitude (a<sub>1</sub>) and the current amplitude (a<sub>0</sub>), and is defined in terms of dB:</p>

<blockquote><code class="att">volume</code><sub>(dB)</sub> = 20 log<sub>10</sub> (a<sub>1</sub> / a<sub>0</sub>)</blockquote>
<p>Note that all numerical volume levels (in dB) are relative to the current level and that they are always signed (including zero). Also note that once the current volume level is set to <strong>"silent"</strong> all child relative changes also result in silence. A child <a class="eref" href="#edef_prosody" shape="rect">prosody</a> element <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> use the label <strong>"default"</strong> to reset the current volume level.</p>
<p>So that for a value of: </p>
  <ul>
    <li><strong>"silent"</strong>, the contained text is read silently; </li>
    <li>'-6.0dB', the contained text is read at approximately half the amplitude of the current signal amplitude;</li>
    <li>'-0dB', the contained text is read with no relative change in volume;</li> 
    <li>'+6.0dB', the contained text is read at approximately twice the amplitude of the current signal amplitude. </li>
  </ul>
</li>
</ul>

<p>Note that the behavior of this attribute for label values may differ from that of numerical values. Use of a numerical value causes direct modification of the waveform, while use of a label value may result in prosodic modifications that more accurately reflect how a human being would increase or decrease the perceived loudness of his speech, e.g., adjusting frequency and power differently for different sound units.</p>

<p>Although each attribute individually is optional, it is an <a href="#term-error" shape="rect">error</a> if no attributes are specified when the <a href="#edef_prosody" class="eref" shape="rect">prosody</a> element is used. The "<strong>x-<em>foo</em></strong>" attribute value names are intended to be mnemonics for "extra <em>foo</em>". All units ("Hz", "st") are case-sensitive. Note also that customary pitch levels and standard pitch ranges may vary significantly by language, as may the meanings of the labelled values for pitch targets and ranges.</p>

<p>Here is an example of how to use the <code class="att">volume</code> attribute: </p>

<pre class="example" xml:space="preserve">
&lt;speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;

   &lt;s&gt;I am speaking this at the default volume for this voice.&lt;/s&gt;

   &lt;s&gt;&lt;prosody volume="+6dB"&gt;
       I am speaking this at approximately twice the original signal amplitude.
   &lt;/prosody&gt;&lt;/s&gt;

   &lt;s&gt;&lt;prosody volume="-6dB"&gt;
       I am speaking this at approximately half the original signal amplitude.
   &lt;/prosody&gt;&lt;/s&gt;
&lt;/speak&gt;</pre>

<h4 id="g499"><a name="number_values" id="number_values" shape="rect">Number</a></h4>

<p>A number is a simple positive floating point value without exponentials. Legal formats are "n", "n.", ".n" and "n.n" where "n" is a sequence of one or more digits.</p>

<h4 id="g4999"><a name="perc_values" id="perc_values" shape="rect">Non-negative percentage</a></h4>

<p>A non-negative percentage is  an unsigned <a href="#number_values">number</a> immediately followed by "%".</p>

<h4 id="g19"><a name="relative_values" id="relative_values" shape="rect">Relative values</a></h4>

<p>Relative changes for the attributes above can be specified</p>

<ul>
<li>as a percentage (a <a href="#number_values" shape="rect">number</a>  preceded by "+" or "-" and followed by "%"), e.g. "+15.2%", "-8.0%", or</li>

<li>as a relative number: 
<ul><li>For the <code class="att">pitch</code> and <code class="att">range</code> attributes, relative changes can be given in semitones (a <a href="#number_values" shape="rect">number</a> preceded by "+" or "-" and followed by "st") or in Hertz (a <a href="#number_values" shape="rect">number</a> preceded by "+" or "-" and followed by "Hz"): "+0.5st", "+5st", "-2st", "+10Hz", "-5.5Hz". A semitone is half of a tone (a half step) on the standard diatonic scale.</li>
</ul>
</li>
</ul>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
  The price of XYZ is &lt;prosody rate="90%"&gt;$45&lt;/prosody&gt;
&lt;/speak&gt;
</pre>

<h4 id="g20"><a name="pitch_contour" id="pitch_contour" shape="rect">Pitch contour</a></h4>

<p>The pitch contour is defined as a set of white space-separated targets at specified time positions in the speech output. The algorithm for interpolating between the targets is processor-specific. In each pair of the form <code>(time position,target)</code>, the first value is a percentage of the period of the contained text (a <a href="#number_values" shape="rect">number</a> followed by "%") and the second value is the value of the <code class="att">pitch</code> attribute (a <a href="#number_values" shape="rect">number</a> followed by "Hz", a <a href="#relative_values" shape="rect">relative change</a>, or a label value). Time position values outside 0% to 100% are ignored. If a pitch value is not defined for 0% or 100% then the nearest pitch target is copied. All relative values for the pitch are relative to the pitch value just before the contained text.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
  &lt;prosody contour="(0%,+20Hz) (10%,+30%) (40%,+10Hz)"&gt;
    good morning
  &lt;/prosody&gt;
&lt;/speak&gt;
</pre>

<p>The <code class="att">duration</code> attribute takes precedence over the <code class="att">rate</code> attribute. The <code class="att">contour</code> attribute takes precedence over the <code class="att">pitch</code> and <code class="att">range</code> attributes.</p>

<p>The default value of all prosodic attributes is no change. For example, omitting the <code class="att">rate</code> attribute means that the rate is the same within the element as outside.</p>

<p>The <a href="#edef_prosody" class="eref" shape="rect">prosody</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref" shape="rect">audio</a>, <a href="#edef_break" class="eref" shape="rect">break</a>, <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a>, <a href="#edef_lang" class="eref" shape="rect">lang</a>, <a href="#edef_lookup" class="eref" shape="rect">lookup</a>, <a href="#edef_mark" class="eref" shape="rect">mark</a>, <a href="#edef_paragraph" class="eref" shape="rect">p</a>, <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a>, <a href="#edef_prosody" class="eref" shape="rect">prosody</a>, <a href="#edef_say-as" class="eref" shape="rect">say-as</a>, <a href="#edef_sub" class="eref" shape="rect">sub</a>, <a href="#edef_sentence" class="eref" shape="rect">s</a>, <a href="#edef_token" class="eref" shape="rect">token</a>, <a href="#edef_voice" class="eref" shape="rect">voice</a>, <a href="#edef_word" class="eref" shape="rect">w</a>.</p>

<h4 id="g22">Limitations</h4>

<p>All prosodic attribute values are indicative. If a <a href="#term-processor" shape="rect">synthesis processor</a> is unable to accurately render a document as specified (e.g., trying to set the pitch to 1 MHz or the speaking rate to 1,000,000 words per minute), it  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> make a best effort to continue processing by imposing a limit or a substitute for the specified, unsupported value and  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> inform the host environment when such limits are exceeded.</p>

<p>In some cases, <a href="#term-processor" shape="rect">synthesis processors</a> <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> elect to ignore a given prosodic markup if the processor determines, for example, that the indicated value is redundant, improper or in error. In particular, concatenative-type synthetic speech systems that employ large acoustic units  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> reject prosody-modifying markup elements if they are redundant with the prosody of a given acoustic unit(s) or would otherwise result in degraded speech quality.</p>

<h2 id="g23"><a id="S3.3" name="S3.3" shape="rect">3.3</a> Other Elements</h2>

<h3 id="g24"><a id="S3.3.1" name="S3.3.1" shape="rect">3.3.1</a> <a name="edef_audio" id="edef_audio" class="edef" shape="rect">audio</a> Element</h3>

<p>The <a href="#edef_audio" class="eref" shape="rect">audio</a> element supports the insertion of recorded audio files (see <a href="#AppA" shape="rect">Appendix A</a> for  <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em> formats) and the insertion of other audio formats in conjunction with synthesized speech output. The <a href="#edef_audio" class="eref" shape="rect">audio</a> element  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> be empty. If the <a href="#edef_audio" class="eref" shape="rect">audio</a> element is not empty then the contents should be the marked-up text to be spoken if the audio document is not available. The alternate content  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> include text, speech markup, <a href="#edef_desc" class="eref" shape="rect">desc</a> elements, or other <a href="#edef_audio" class="eref" shape="rect">audio</a> elements. The alternate content  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> also be used when rendering the document to non-audible output and for accessibility (see the <a href="#edef_desc" class="eref" shape="rect">desc</a> element). In addition to the <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> attributes described in subsections below, <a href="#edef_audio" class="eref" shape="rect">audio</a> has the following attributes:</p>
<table border="1">
  <tbody>
    <tr>
      <th>Name</th>
      <th>Required</th>
      <th>Type</th>
      <th>Default Value</th>
      <th>Description</th>
    </tr>
    <tr>
      <td><code class="att">src</code></td>
      <td>false</td>
      <td><a href="#term-uri" shape="rect">URI</a></td>
      <td><em>None</em></td>
      <td>The URI of a document with an appropriate media type. If absent, the <a href="#edef_audio" class="eref" shape="rect">audio</a> element behaves as if src were present with a legal URI but the document could not be fetched. </td>
    </tr>
    <tr>
      <td><code class="att">fetchtimeout</code></td>
      <td>false</td>
      <td><a href="#def_time_designation">Time Designation</a></td>
      <td><em>Processor-specific</em></td>
      <td>The timeout for fetches. </td>
    </tr>
    <tr>
      <td><code class="att">fetchhint</code></td>
      <td>false</td>
      <td>The value "prefetch" or the value "safe"</td>
      <td>prefetch</td>
      <td>This tells the <a href="#term-processor" shape="rect">synthesis processor</a> whether or not it can attempt to optimize rendering by pre-fetching audio. The value is either safe to say that audio is only fetched when it is needed, never before; or prefetch to permit, but not require the processor to pre-fetch the audio.</td>
    </tr>
    <tr>
      <td><code class="att">maxage</code></td>
      <td>false</td>
      <td><a href="http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#nonNegativeInteger" shape="rect"><strong>xsd:nonNegativeInteger</strong></a></td>
      <td><em>None</em></td>
      <td>Indicates that the document is willing to use content whose age is no greater than the specified time  (cf. 'max-age' in HTTP 1.1 [<a href="#ref-rfc2616">RFC2616</a>]). The document is not willing to use stale content, unless <code class="att">maxstale</code> is also provided.</td>
    </tr>
    <tr>
      <td><code class="att">maxstale</code></td>
      <td>false</td>
      <td><a href="http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#nonNegativeInteger" shape="rect"><strong>xsd:nonNegativeInteger</strong></a></td>
      <td><em>None</em></td>
      <td>Indicates that the document is willing to use content that has exceeded its expiration time (cf. 'max-stale' in HTTP 1.1 [<a href="#ref-rfc2616">RFC2616</a>]). If <code class="att">maxstale</code> is assigned a value, then the document is willing to accept content that has exceeded its expiration time by no more than the specified amount of time.</td>
    </tr>
  </tbody>
</table>
<p>&nbsp; </p>
<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
                 
  &lt;!-- Empty element --&gt;
  Please say your name after the tone.  &lt;audio src="beep.wav"/&gt;

  &lt;!-- Container element with alternative text --&gt;
  &lt;audio src="prompt.au"&gt;What city do you want to fly from?&lt;/audio&gt;
  &lt;audio src="welcome.wav"&gt;  
    &lt;emphasis&gt;Welcome&lt;/emphasis&gt; to the Voice Portal. 
  &lt;/audio&gt;

&lt;/speak&gt;
</pre>

<p>An <a href="#edef_audio" class="eref" shape="rect">audio</a> element is successfully rendered by:</p>

<ol>
  <li>Playing the referenced audio source successfully</li>
  <li>If the referenced   audio source fails to play, rendering the alternative content</li>
  <li>Additionally if the processor can detect that text-only output is required   then it  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> render the alternative content</li>
</ol>
<p>
  When attempting to play the   audio source a number of different issues
  may arise such as mismatched  media types or bad header information
  about the media. In general the <a href="#term-processor" shape="rect">synthesis processor</a>   makes a best effort to play the referenced media and, when unsuccessful, the processor  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> play
  the alternative content. Note the processor  <em title="MUST NOT in RFC 2119 context" class="RFC2119">MUST NOT</em> render both all
  or part of the referenced media and all or part of the   referenced
  alternative content. If any of the referenced media is processed   and
  rendered then the playback is considered a successful playback   within
  the context of this section. If an error occurs that causes   the
  alternative content to be rendered instead of the referenced media   the
  processor  <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> notify the hosting environment that such an error   has
  occurred. The processor  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> notify the hosting   environment
  immediately with an asynchronous event, or the processor  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em>   notify
  the hosting environment only at the end of playback when it signals   to
  the hosting environment that it has completed rendering the request, or   the processor  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> make the error notification through its logging
  system. The processor  <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> include information about the error
  where possible; for   example, if the media resource couldn't be fetched
  due to an http 404 error,   that error code could be included with the
  notification.</p>
  
<p>The <a href="#edef_audio" class="eref" shape="rect">audio</a> element can only contain text to be rendered and the following elements: <a href="#edef_audio" class="eref" shape="rect">audio</a>, <a href="#edef_mark" class="eref" shape="rect">break</a>, <a href="#edef_desc" class="eref" shape="rect">desc</a>, <a href="#edef_emphasis" class="eref" shape="rect">emphasis</a>, <a href="#edef_lang" class="eref" shape="rect">lang</a>, <a href="#edef_lookup" class="eref" shape="rect">lookup</a>, <a href="#edef_mark" class="eref" shape="rect">mark</a>, <a href="#edef_paragraph" class="eref" shape="rect">p</a>, <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a>, <a href="#edef_prosody" class="eref" shape="rect">prosody</a>, <a href="#edef_say-as" class="eref" shape="rect">say-as</a>, <a href="#edef_sub" class="eref" shape="rect">sub</a>, <a href="#edef_sentence" class="eref" shape="rect">s</a>, <a href="#edef_token" class="eref" shape="rect">token</a>, <a href="#edef_voice" class="eref" shape="rect">voice</a>, <a href="#edef_word" class="eref" shape="rect">w</a>.</p>

<h3 id="g3311"><a id="S3.3.1.1" name="S3.3.1.1" shape="rect">3.3.1.1</a> Trimming attributes</h3>
<p> Trimming attributes define the span of the audio to be
rendered. Both the start and the end of the span within the <a href="#edef_audio" class="eref" shape="rect">audio</a> content can be specified using  time offsets. The duration of the span, including repetitions, can also be specified with repeat attributes. <a href="#term-processor" shape="rect">Synthesis processor</a> support for these attributes is  <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em> in the <a href="#S2.2.5">Extended profile</a>.</p>
<p>The following  trimming attributes are defined for <a href="#edef_audio" class="eref" shape="rect">audio</a>: </p>
<table border="1">
  <tbody>
    <tr>
      <th>Name</th>
      <th>Required</th>
      <th>Type</th>
      <th>Default Value</th>
      <th>Description</th>
    </tr>
    <tr>
      <td><code class="att">clipBegin</code></td>
      <td>false</td>
      <td><a href="#def_time_designation">Time Designation</a> </td>
      <td>0s</td>
      <td>offset from start of media to begin rendering. This offset is
        measured in normal media playback time from the beginning of the
      media.</td>
    </tr>
    <tr>
      <td><code class="att">clipEnd</code></td>
      <td>false</td>
      <td><a href="#def_time_designation">Time Designation</a></td>
      <td>None</td>
      <td>offset from start of media to end rendering. This offset is
        measured in normal media playback time from the beginning of the
      media.</td>
    </tr>
    <tr>
      <td><code class="att">repeatCount</code></td>
      <td>false</td>
      <td>a positive  <a href="#def_real_number">Real Number </a></td>
      <td>1</td>
      <td>number of iterations of media to render. A fractional value describes a portion of the rendered media. </td>
    </tr>
    <tr>
      <td><code class="att">repeatDur</code></td>
      <td>false</td>
      <td><a href="#def_time_designation">Time Designation</a></td>
      <td>None</td>
      <td>total duration for repeatedly rendering media. This duration is measured in normal media playback time from the beginning of the media. </td>
    </tr>
  </tbody>
</table>
<p>Calculations of rendered durations and interaction with other timing
properties follow SMIL <a
href="http://www.w3.org/TR/SMIL/smil-timing.html#Timing-ComputingActiveDur"> Computing the active duration </a>where </p>
<ul>
  <li><a href="#edef_audio" class="eref" shape="rect">audio</a> is a time container  </li>
  <li><a href="#def_time_designation">Time Designation</a> values for <code class="att">clipBegin</code>, <code class="att">clipEnd</code>, and <code class="att">repeatDur</code> are a subset of SMIL Clock-value</li>
  <li>If the length of an audio clip is not known in advance then it is treated as indefinite. Consequently <code class="att">repeatCount</code> will have no effect. </li>
  <li>If <code class="att">clipEnd</code> is after the end of the audio, then  rendering ends at the audio end.</li>
  <li>If <code class="att">clipBegin</code> is after <code class="att">clipEnd</code>, no audio will be produced. </li>
  <li><code class="att">repeatDur</code> takes precedence over <code class="att">repeatCount</code> in determining the total time for rendering media.</li>
</ul>
<p>Note that not all SMIL Timing features are supported.</p>

<h4 id="def_real_number">Real Numbers</h4>
<p>Real numbers and integers are specified in decimal notation only.</p>
<p>An integer consists of one or more digits "0" to "9".</p>
<p>A real number may be an integer, or it may be zero or more digits followed by a dot (.) followed by one or more digits. Both integers and real numbers may be preceded by a "-" or "+" to indicate the sign.</p>
<h4 id="def_time_designation">Time Designation</h4>
<p>Time designations consist of a non-negative <a href="#def_real_number">real number</a> followed by a time unit identifier. The time unit identifiers are:</p>
<ul>
  <li>ms: milliseconds</li>
  <li>s: seconds  </li>
</ul>
<p>Examples include: "3s", "850ms", "0.7s", ".5s" and "+1.5s".</p>
<h4 id="g3311.e">Examples</h4>
<p>In the following example, rendering of the media begins 10 seconds  into the audio: </p>
<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
       xml:lang="en-US"&gt;

  &lt;audio src="radio.wav" clipBegin="10s" /&gt;

&lt;/speak&gt;
</pre>

<p> Here the rendering of the media ends after 20 seconds of audio: </p>
<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
       xml:lang="en-US"&gt;

  &lt;audio src="radio.wav" clipBegin="10s" clipEnd="20s" /&gt;

&lt;/speak&gt;
</pre>

<p>Note that if the duration of "radio.wav" is less than 20 seconds, the <code class="att">clipEnd</code> value is ignored, and the rendering end is set equal to the effective end of the media.</p>
<p>In the following example, the duration of the audio is constrained by  <code class="att">repeatCount</code>: </p>
<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
       xml:lang="en-US"&gt;

  &lt;audio src="3second_sound.au" repeatCount="0.5" /&gt; 

&lt;/speak&gt;
</pre>

<p>Only the first half of the clip will play; the active duration will be 1.5 seconds.</p>
<p>In the following example, the audio will repeat for a total of 7  seconds. It will play fully two times, followed by a fractional part of  2 seconds. This is equivalent to a <code class="att">repeatCount</code> of 2.8. </p>
<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
       xml:lang="en-US"&gt;

  &lt;audio src="2.5second_music.mp3" repeatDur="7s" /&gt;

&lt;/speak&gt;
</pre>
In the following example, the active duration of the audio will be 4 seconds.   Playback will start 1 second into the audio (as specified by the <code class="att">clipBegin</code> value) and then play for 1 second (since <code class="att">clipEnd</code> is specified as 2 seconds), and then this span will be   repeated so that the total duration is 4 seconds (as specified by <code class="att">repeatDur</code>). Note that the value of <code class="att">repeatDur</code> takes precedence over the value of <code class="att">repeatCount</code>.
<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
       xml:lang="en-US"&gt;

  &lt;audio src="2.5second_music.mp3" clipBegin="1s" clipEnd="2s"
         repeatCount="5" repeatDur="4s" /&gt;

&lt;/speak&gt;
</pre>

<p>These attributes can interact with the rendering specified by <a href="#edef_speak" class="eref" shape="rect">speak</a> trimming attributes: </p>
<pre class="example" xml:space="preserve">
&lt;speak version="1.1" startmark="mark1" endmark="mark2"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
       xml:lang="en-US"&gt;
   &lt;audio src="first.wav"/&gt;
   &lt;mark name="mark1"/&gt;
   &lt;audio src="15second_music.mp3" clipBegin="2s" clipEnd="7s" /&gt;
   &lt;mark name="mark2"/&gt;
   &lt;audio src="last.wav"/&gt;
&lt;/speak&gt;
</pre>
<p>The <a href="#edef_speak" class="eref" shape="rect">speak</a> <code class="att">startmark</code> and <code class="att">endmark</code> allow only the "15second_music.mp3" clip to be played. The actual duration of the audio is 5 seconds: the clip begins at 2 seconds into the audio and ends after 7 seconds, hence a duration of 5
seconds.</p>
<h3 id="g3.3.1.2"><a id="S3.3.1.2" name="S3.3.1.2" shape="rect">3.3.1.2</a> <code class="att">soundLevel</code> Attribute</h3>
<p>The <code class="att">soundLevel </code>attribute  specifies the relative volume of the referenced audio. It is inspired by the similarly-named attribute in SMIL [<a href="#ref-smil3">SMIL3</a>]. <a href="#term-processor" shape="rect">Synthesis processor</a> support for this attribute  is  <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em> in the <a href="#S2.2.5">Extended profile</a>.</p>

<p>&nbsp;</p>
<table border="1">
  <tbody>
    <tr>
      <th>Name</th>
      <th>Required</th>
      <th>Type</th>
      <th>Default Value</th>
      <th>Description</th>
    </tr>
    <tr>
      <td><code class="att">soundLevel</code></td>
      <td>false</td>
      <td><p>signed ("+" or "-") <a href="http://www.w3.org/TR/CSS2/syndata.html#numbers" shape="rect">CSS2 numbers</a> immediately followed by "dB"</p></td>
      <td>The default value is +0.0dB. </td>
      <td>Decibel values  are interpreted as a ratio of the squares of the new signal amplitude (a<sub>1</sub>) and the current amplitude (a<sub>0</sub>) and  are defined in terms of dB: <code class="att">soundLevel</code><sub>(dB)</sub> = 20 log<sub>10</sub> (a<sub>1</sub> &#8725; a<sub>0</sub>).        A setting of a large negative value effectively plays the media silently. A value of '-6.0dB' will play the   media at approximately half the amplitude of its current signal amplitude. Similarly, a value of '+6.0dB' will play the   media at approximately twice the amplitude of its current signal amplitude (subject to hardware   limitations).  The absolute sound level of media   perceived is further subject to system volume settings, which cannot be   controlled with this attribute.</td>
    </tr>
  </tbody>
</table>
<p>Here is an example of how to use the <code class="att">soundLevel </code>attribute: </p>
<pre class="example" xml:space="preserve">
&lt;speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
       xml:lang="en-US"&gt;

   &lt;s&gt;This is the original, unmodified waveform:   
       &lt;audio src="message.wav"/&gt;
   &lt;/s&gt;
   &lt;s&gt;This is the same audio at approximately twice the signal amplitude:   
       &lt;audio soundLevel="+6dB" src="message.wav"/&gt;
   &lt;/s&gt;
   &lt;s&gt;This is the same audio at approximately half the original signal amplitude:   
       &lt;audio soundLevel="-6dB" src="message.wav"/&gt;
   &lt;/s&gt;
&lt;/speak&gt;</pre>

<h3 id="g3.3.1.3"><a id="S3.3.1.3" name="S3.3.1.3" shape="rect">3.3.1.3</a> <code class="att">speed</code> Attribute</h3>

<p>The <code class="att">speed</code> attribute controls the playback speed of the referenced audio, to speed up or slow down the effective rate of play relative to the original speed of the waveform. The argument value does not specify an absolute play speed, but rather is   relative to the playback speed of the original waveform.  <a href="#term-processor" shape="rect">Synthesis processor</a> support for this attribute is  <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em> in the <a href="#S2.2.5">Extended profile</a>. </p>
<table border="1">
  <tbody>
    <tr>
      <th>Name</th>
      <th>Required</th>
      <th>Type</th>
      <th>Default Value</th>
      <th>Description</th>
    </tr>
    <tr>
      <td><code class="att">speed</code></td>
      <td>false</td>
      <td>x%<br />
      (where x is a  positive real value)</td>
      <td>The default value is 100%, which corresponds to the speed of an unmodified audio waveform. </td>
      <td>The speed at which to play the referenced audio, relative to the original speed.<br />      The speed is set to  the requested percentage of the speed of the original waveform.</td>
    </tr>
  </tbody>
</table>
<p>A change in the value of the <code class="att">speed</code> attribute will change the rate at which recorded samples are played back. Note that this will affect the pitch.</p>
<p>Here is an example of how to use the <code class="att">speed</code> attribute:</p>
<pre class="example" xml:space="preserve">
&lt;speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd"
       xml:lang="en-US"&gt;

   &lt;s&gt;This is the original, unmodified waveform:   
       &lt;audio src="message.wav"/&gt;
   &lt;/s&gt;
   &lt;s&gt;This is the same audio at twice the speed:   
       &lt;audio speed="200%" src="message.wav"/&gt;
   &lt;/s&gt;
   &lt;s&gt;This is the same audio at half the original speed:   
       &lt;audio speed="50%" src="message.wav"/&gt;
   &lt;/s&gt;
&lt;/speak&gt;</pre>
<h3 id="g26"><a id="S3.3.2" name="S3.3.2" shape="rect">3.3.2</a> <a name="edef_mark" id="edef_mark" class="edef" shape="rect">mark</a> Element</h3>

<p>A <a href="#edef_mark" class="eref" shape="rect">mark</a> element is an empty element that places a marker into the text/tag sequence. It has one  <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em> attribute, <code class="att">name</code>, which is of type <code><a href="http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#token" shape="rect">xsd:token</a></code> [<a href="#ref-schema2" shape="rect">SCHEMA2</a> §3.3.2]. The <a href="#edef_mark" class="eref" shape="rect">mark</a> element can be used to reference a specific location in the text/tag sequence, and can additionally be used to insert a marker into an output stream for asynchronous notification. When processing a <a href="#edef_mark" class="eref" shape="rect">mark</a> element, a <a href="#term-processor" shape="rect">synthesis processor</a> <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> do one or both of the following:</p>

<ul>
<li>inform the hosting environment with the value of the <code class="att">name</code> attribute and with information allowing the platform to retrieve the corresponding position in the rendered output.</li>

<li>when audio output of the SSML document reaches the <a href="#edef_mark" class="eref" shape="rect">mark</a>, issue an event that includes the  <em title="REQUIRED in RFC 2119 context" class="RFC2119">REQUIRED</em> <code class="att">name</code> attribute of the element. The hosting environment defines the destination of the event.</li>
</ul>

<p>The <a href="#edef_mark" class="eref" shape="rect">mark</a> element does not affect the speech output process.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
                 
Go from &lt;mark name="here"/&gt; here, to &lt;mark name="there"/&gt; there!

&lt;/speak&gt;
</pre>

<h3 id="g233"><a id="S3.3.3" name="S3.3.3" shape="rect">3.3.3</a> <a name="edef_desc" id="edef_desc" class="edef" shape="rect">desc</a> Element</h3>

<p>The <a href="#edef_desc" class="eref" shape="rect">desc</a> element can only occur within the content of the <a href="#edef_audio" class="eref" shape="rect">audio</a> element. When the audio source referenced in <a href="#edef_audio" class="eref" shape="rect">audio</a> is not speech, e.g. audio wallpaper or sonicon punctuation, it should contain a <a href="#edef_desc" class="eref" shape="rect">desc</a> element whose textual content is a description of the audio source (e.g. "door slamming"). If text-only output is being produced by the <a href="#term-processor" shape="rect">synthesis processor</a>, the content of the <a href="#edef_desc" class="eref" shape="rect">desc</a> element(s)  <em title="SHOULD in RFC 2119 context" class="RFC2119">SHOULD</em> be rendered instead of other alternative content in <a href="#edef_audio" class="eref" shape="rect">audio</a>. The  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> <a href="#adef_xmllang" class="aref" shape="rect">xml:lang</a> attribute can be used to indicate that the content of the element is in a different language from that of the content surrounding the element.  The <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> <a href="#adef_onlangfailure" class="aref" shape="rect"><code class="att">onlangfailure</code></a> attribute can be used to specify the desired behavior upon language speaking failure.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
                 
  &lt;!-- Normal use of &lt;desc&gt; --&gt;
  Heads of State often make mistakes when speaking in a foreign language.
  One of the most well-known examples is that of John F. Kennedy:
  &lt;audio src="ichbineinberliner.wav"&gt;If you could hear it, this would be
  a recording of John F. Kennedy speaking in Berlin.
    &lt;desc&gt;Kennedy's famous German language gaffe&lt;/desc&gt;
  &lt;/audio&gt;

  &lt;!-- Suggesting the language of the recording --&gt;
  &lt;!-- Although there is no requirement that a recording be in the current language
       (since it might even be non-speech such as music), an author might wish to
       suggest the language of the recording by marking the entire &lt;audio&gt; element
       using &lt;lang&gt;.  In this case, the xml:lang attribute on &lt;desc&gt; can be used
       to put the description back into the original language. --&gt;
  Here's the same thing again but with a different fallback:
  &lt;lang xml:lang="de-DE"&gt;
    &lt;audio src="ichbineinberliner.wav"&gt;Ich bin ein Berliner.
      &lt;desc xml:lang="en-US"&gt;Kennedy's famous German language gaffe&lt;/desc&gt;
    &lt;/audio&gt;
  &lt;/lang&gt;
&lt;/speak&gt;
</pre>

<p>The <a href="#edef_desc" class="eref" shape="rect">desc</a> element can only contain descriptive text.</p>

<h2 id="g40"><a id="S4" name="S4" shape="rect">4.</a> References</h2>

<h3 id="g41"><a id="S4.1" name="S4.1" shape="rect">4.1</a> Normative References</h3>

<dl>
<dt><a id="ref-bcp47" name="ref-bcp47" shape="rect">[BCP47]</a></dt>

<dd><cite><a href="http://www.rfc-editor.org/bcp/bcp47.txt" shape="rect">Tags for Identifying Languages</a></cite> and <cite><a href="http://www.rfc-editor.org/bcp/bcp47.txt" shape="rect">Matching of Language Tags</a></cite>, A. Phillips and M. Davis, Editors. IETF, September 2009.  Available at http://www.rfc-editor.org/bcp/bcp47.txt.</dd>

<dt><a id="ref-css2" name="ref-css2" shape="rect">[CSS2]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/1998/REC-CSS2-19980512/" shape="rect">Cascading Style Sheets, level 2: CSS2 Specification</a></cite>, B. Bos, et al., Editors. World Wide Web Consortium, 12 May 1998. This version of the CSS2 Recommendation is http://www.w3.org/TR/1998/REC-CSS2-19980512/. The <a href="http://www.w3.org/TR/CSS2/" shape="rect">latest version of CSS2</a> is available at http://www.w3.org/TR/CSS2/.
Note this reference may be revised when the 
<a href="http://www.w3.org/TR/css3-speech/">CSS3 Speech Module</a>
becomes a W3C Recommendation.
</dd>

<dt><a id="ref-ipahndbk" name="ref-ipahndbk" shape="rect">[IPAHNDBK]</a></dt>

<dd><cite><a href="http://www.langsci.ucl.ac.uk/ipa/handbook.html" shape="rect">Handbook of the International Phonetic Association</a></cite>, International Phonetic Association, Editors. Cambridge University Press, July 1999. Information on the Handbook is available at http://www.langsci.ucl.ac.uk/ipa/handbook.html.</dd>

<dt><a id="ref-pls" name="ref-pls" shape="rect">[PLS]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/" shape="rect">Pronunciation Lexicon Specification (PLS) Version 1.0</a></cite>, P. Baggia, Editor. World Wide Web Consortium, 14 October 2008. This version of the PLS Recommendation is http://www.w3.org/TR/2008/REC-pronunciation-lexicon-20081014/. The <a href="http://www.w3.org/TR/pronunciation-lexicon/" shape="rect">latest version of PLS</a> is available at http://www.w3.org/TR/pronunciation-lexicon/.</dd>

<dt><a id="ref-rfc1521" name="ref-rfc1521" shape="rect">[RFC1521]</a></dt>

<dd><cite><a href="http://www.ietf.org/rfc/rfc1521.txt" shape="rect">MIME (Multipurpose Internet Mail Extensions) Part One: Mechanisms for Specifying and Describing the Format of Internet Message Bodies</a></cite>, N. Borenstein and N. Freed, Editors. IETF, September 1993. This RFC is available at http://www.ietf.org/rfc/rfc1521.txt.</dd>

<dt><a id="ref-rfc2045" name="ref-rfc2045" shape="rect">[RFC2045]</a></dt>

<dd><cite><a href="http://www.ietf.org/rfc/rfc2045.txt" shape="rect">Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies.</a></cite>, N. Freed and N. Borenstein, Editors. IETF, November 1996. This RFC is available at http://www.ietf.org/rfc/rfc2045.txt.</dd>

<dt><a id="ref-rfc2046" name="ref-rfc2046" shape="rect">[RFC2046]</a></dt>

<dd><cite><a href="http://www.ietf.org/rfc/rfc2046.txt" shape="rect">Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types</a></cite>, N. Freed and N. Borenstein, Editors. IETF, November 1996. This RFC is available at http://www.ietf.org/rfc/rfc2046.txt.</dd>

<dt><a id="ref-rfc2119" name="ref-rfc2119" shape="rect">[RFC2119]</a></dt>

<dd><cite><a href="http://www.ietf.org/rfc/rfc2119.txt" shape="rect">Key words for use in RFCs to Indicate Requirement Levels</a></cite>, S. Bradner, Editor. IETF, March 1997. This RFC is available at http://www.ietf.org/rfc/rfc2119.txt.</dd>

<dt><a id="ref-rfc3986" name="ref-rfc3986" shape="rect">[RFC3986]</a></dt>

<dd><cite><a href="http://www.ietf.org/rfc/rfc3986.txt" shape="rect">Uniform Resource Identifier (URI): Generic Syntax</a></cite>, T. Berners-Lee et al., Editors. IETF, January 2005. This RFC is available at http://www.ietf.org/rfc/rfc3986.txt.</dd>

<dt><a id="ref-rfc3987" name="ref-rfc3987" shape="rect">[RFC3987]</a></dt>

<dd><cite><a href="http://www.ietf.org/rfc/rfc3987.txt" shape="rect">Internationalized Resource Identifiers (IRIs)</a></cite>, M. Duerst and M. Suignard, Editors. IETF, January 2005. This RFC is available at http://www.ietf.org/rfc/rfc3987.txt.</dd>

<dt><a id="ref-rfc4267" name="ref-rfc4267" shape="rect">[RFC4267]</a></dt>

<dd><cite><a href="http://www.ietf.org/rfc/rfc4267.txt" shape="rect">The W3C Speech Interface Framework Media Types: application/voicexml+xml, application/ssml+xml, application/srgs, application/srgs+xml, application/ccxml+xml, and application/pls+xml</a></cite>, M. Froumentin, Editor. IETF, November 2005. This RFC is available at http://www.ietf.org/rfc/rfc4267.txt.</dd>

<dt><a id="ref-schema1" name="ref-schema1" shape="rect">[SCHEMA1]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/" shape="rect">XML Schema Part 1: Structures Second Edition</a></cite>, H. S. Thompson, et al., Editors. World Wide Web Consortium, 28 October 2004. This version of the XML Schema Part 1 Recommendation is http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/. The <a href="http://www.w3.org/TR/xmlschema-1/" shape="rect">latest version of XML Schema 1</a> is available at http://www.w3.org/TR/xmlschema-1/.</dd>

<dt><a id="ref-schema2" name="ref-schema2" shape="rect">[SCHEMA2]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/" shape="rect">XML Schema Part 2: Datatypes Second Edition</a></cite>, P.V. Biron and A. Malhotra, Editors. World Wide Web Consortium, 28 October 2004. This version of the XML Schema Part 2 Recommendation is http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/. The <a href="http://www.w3.org/TR/xmlschema-2/" shape="rect">latest version of XML Schema 2</a> is available at http://www.w3.org/TR/xmlschema-2/.</dd>

<dt><a id="ref-smil3" name="ref-smil3" shape="rect">[SMIL3]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2008/REC-SMIL3-20081201/" shape="rect">Synchronized Multimedia Integration Language (SMIL 3.0)</a></cite>, D. Bulterman, et al., Editors. World Wide Web Consortium, 1 December 2008. This version of the SMIL 3 Recommendation is http://www.w3.org/TR/2008/REC-SMIL3-20081201/. The <a href="http://www.w3.org/TR/SMIL3/" shape="rect">latest version of SMIL3</a> is available at http://www.w3.org/TR/SMIL3/. This document is a work in progress. </dd>

<dt><a id="ref-mimetypes" name="ref-mimetypes" shape="rect">[TYPES]</a></dt>

<dd><cite><a href="http://www.iana.org/assignments/media-types/index.html" shape="rect">MIME Media types</a></cite>, IANA. This continually-updated list of media types registered with IANA is available at http://www.iana.org/assignments/media-types/index.html.</dd>

<dt><a id="ref-xml10" name="ref-xml10" shape="rect">[XML 1.0]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2008/REC-xml-20081126/" shape="rect">Extensible Markup Language (XML) 1.0 (Fifth Edition)</a></cite>, T. Bray et al., Editors. World Wide Web Consortium, 26 August 2008. This version of the XML 1.0 Recommendation is http://www.w3.org/TR/2008/REC-xml-20081126/. The <a href="http://www.w3.org/TR/xml/" shape="rect">latest version of XML 1.0</a> is available at http://www.w3.org/TR/xml/.</dd>

<dt><a id="ref-xml11" name="ref-xml11" shape="rect">[XML 1.1]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2006/REC-xml11-20060816/" shape="rect">Extensible Markup Language (XML) 1.1 (Second Edition)</a></cite>, T. Bray et al., Editors. World Wide Web Consortium, 16 August 2006. This version of the XML 1.1 Recommendation is http://www.w3.org/TR/2006/REC-xml11-20060816/. The <a href="http://www.w3.org/TR/xml11/" shape="rect">latest version of XML 1.1</a> is available at http://www.w3.org/TR/xml11/.</dd>

<dt><a id="ref-xml-base" name="ref-xml-base" shape="rect" />[XML-BASE]</dt>

<dd><cite><a href="http://www.w3.org/TR/2009/REC-xmlbase-20090128/" shape="rect">XML Base (Second Edition)</a></cite>, J. Marsh and R. Tobin, Editors. World Wide Web Consortium, 28 January 2009. This version of the XML Base Recommendation is http://www.w3.org/TR/2009/REC-xmlbase-20090128/. The <a href="http://www.w3.org/TR/xmlbase/" shape="rect">latest version of XML Base</a> is available at http://www.w3.org/TR/xmlbase/.</dd>

<dt><a id="ref-xml-id" name="ref-xml-id" shape="rect" />[XML-ID]</dt>

<dd><cite><a href="http://www.w3.org/TR/2005/REC-xml-id-20050909/" shape="rect">xml:id Version 1.0</a></cite>, J. Marsh et al., Editors. World Wide Web Consortium, 9 September 2005. This version of the xml:id Recommendation is http://www.w3.org/TR/2005/REC-xml-id-20050909/. The <a href="http://www.w3.org/TR/xml-id/" shape="rect">latest version of xml:id</a> is available at http://www.w3.org/TR/xml-id/.</dd>

<dt><a id="ref-xmlns10" name="ref-xmlns10" shape="rect">[XMLNS 1.0]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2009/REC-xml-names-20091208/" shape="rect">Namespaces in XML 1.0 (Third Edition)</a></cite>, T. Bray et al., Editors. World Wide Web Consortium, 8 December 2009. This version of the XML Namespaces 1.0 Recommendation is http://www.w3.org/TR/2009/REC-xml-names-20091208/. The <a href="http://www.w3.org/TR/REC-xml-names/" shape="rect">latest version of XML Namespaces 1.0</a> is available at http://www.w3.org/TR/REC-xml-names/.</dd>

<dt><a id="ref-xmlns11" name="ref-xmlns11" shape="rect">[XMLNS 1.1]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2006/REC-xml-names11-20060816/" shape="rect">Namespaces in XML 1.1 (Second Edition)</a></cite>, T. Bray et al., Editors. World Wide Web Consortium, 16 August 2006. This version of the XML Namespaces 1.1 Recommendation is http://www.w3.org/TR/2006/REC-xml-names11-20060816/. The <a href="http://www.w3.org/TR/xml-names11/" shape="rect">latest version of XML Namespaces 1.1</a> is available at http://www.w3.org/TR/xml-names11/.</dd>
</dl>

<h3 id="g43"><a id="S4.2" name="S4.2" shape="rect">4.2</a> Informative References</h3>

<dl><dt><a id="ref-dc" name="ref-dc" shape="rect">[DC]</a></dt>

<dd><cite>Dublin Core Metadata Initiative.</cite> See <a href="http://dublincore.org/" shape="rect">http://dublincore.org/</a></dd>

<dt><a id="ref-html" name="ref-html" shape="rect">[HTML]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/1999/REC-html401-19991224/" shape="rect">HTML 4.01 Specification</a></cite>, D. Raggett et al., Editors. World Wide Web Consortium, 24 December 1999. This version of the HTML 4 Recommendation is http://www.w3.org/TR/1999/REC-html401-19991224/. The <a href="http://www.w3.org/TR/html4/" shape="rect">latest version of HTML 4</a> is available at http://www.w3.org/TR/html4/.</dd>

<dt><a id="ref-ipa" name="ref-ipa" shape="rect">[IPA]</a></dt>

<dd><cite><a href="http://www.langsci.ucl.ac.uk/ipa/" shape="rect">International Phonetic Association</a></cite>. See http://www.langsci.ucl.ac.uk/ipa/ for the organization's website.</dd>

<dt><a id="ref-ipaunicode1" name="ref-ipaunicode1" shape="rect">[IPAUNICODE1]</a></dt>

<dd><cite><a href="http://web.uvic.ca/ling/resources/ipa/charts/unicode_ipa-chart.htm" shape="rect">The International Phonetic Alphabet</a></cite>, J. Esling. This table of IPA characters in Unicode is available at http://web.uvic.ca/ling/resources/ipa/charts/unicode_ipa-chart.htm.</dd>

<dt><a id="ref-ipaunicode2" name="ref-ipaunicode2" shape="rect">[IPAUNICODE2]</a></dt>

<dd><cite><a href="http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm" shape="rect">The International Phonetic Alphabet in Unicode</a></cite>, J. Wells. This table of Unicode values for IPA characters is available at http://www.phon.ucl.ac.uk/home/wells/ipa-unicode.htm.</dd>

<dt><a id="ref-jeidaalphabet" name="ref-jeidaalphabet" shape="rect">[JEIDAALPHABET]</a></dt>

<dd><cite><a href="http://it.jeita.or.jp/document/publica/standard/summary/JEIDA-62-2000.pdf" shape="rect">JEIDA-62-2000 Phoneme Alphabet</a></cite>. JEITA. An abstract of this document (in Japanese) is available at http://it.jeita.or.jp/document/publica/standard/summary/JEIDA-62-2000.pdf.</dd>

<dt><a id="ref-jeita" name="ref-jeita" shape="rect">[JEITA]</a></dt>

<dd><cite><a href="http://www.jeita.or.jp" shape="rect">Japan Electronics and Information Technology Industries Association</a></cite>. See http://www.jeita.or.jp/.</dd>

<dt><a id="ref-jsml" name="ref-jsml" shape="rect">[JSML]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2000/NOTE-jsml-20000605/" shape="rect">JSpeech Markup Language</a></cite>, A. Hunt, Editor. World Wide Web Consortium, 5 June 2000. Copyright ©2000 Sun Microsystems, Inc. This version of the JSML submission is http://www.w3.org/TR/2000/NOTE-jsml-20000605/. The <a href="http://www.w3.org/TR/jsml/" shape="rect">latest W3C Note of JSML</a> is available at http://www.w3.org/TR/jsml/.</dd>

<dt><a id="ref-lex" name="ref-lex" shape="rect">[LEX]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2004/WD-lexicon-reqs-20041029/" shape="rect">Pronunciation Lexicon Markup Requirements</a></cite>, P. Baggia and F. Scahill, Editors. World Wide Web Consortium, 29 October 2004. This document is a work in progress. This version of the Lexicon Requirements is http://www.w3.org/TR/2004/WD-lexicon-reqs-20041029/. The <a href="http://www.w3.org/TR/lexicon-reqs/" shape="rect">latest version of the Lexicon Requirements</a> is available at http://www.w3.org/TR/lexicon-reqs/.</dd>

<dt><a id="ref-rdf" name="ref-rdf" shape="rect">[RDF]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2004/REC-rdf-primer-20040210/" shape="rect">RDF Primer</a></cite>, F. Manola and E. Miller, Editors. World Wide Web Consortium, 10 February 2004. This version of the RDF Primer Recommendation is http://www.w3.org/TR/2004/REC-rdf-primer-20040210/. The <a href="http://www.w3.org/TR/rdf-primer/" shape="rect">latest version of the RDF Primer</a> is available at http://www.w3.org/TR/rdf-primer/.</dd>

<dt><a id="ref-rdf-xml" name="ref-rdf-xml" shape="rect">[RDF-XMLSYNTAX]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/" shape="rect">RDF/XML Syntax Specification</a></cite>, D. Beckett, Editor. World Wide Web Consortium, 10 February 2004. This version of the RDF/XML Syntax Recommendation is http://www.w3.org/TR/2004/REC-rdf-syntax-grammar-20040210/. The <a href="http://www.w3.org/TR/rdf-syntax-grammar/" shape="rect">latest version of the RDF XML Syntax</a> is available at http://www.w3.org/TR/rdf-syntax-grammar/.</dd>

<dt><a id="ref-rdf-schema" name="ref-rdf-schema" shape="rect">[RDF-SCHEMA]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2004/REC-rdf-schema-20040210/" shape="rect">RDF Vocabulary Description Language 1.0: RDF Schema</a></cite>, D. Brickley and R. Guha, Editors. World Wide Web Consortium, 10 February 2004. This version of the RDF Schema Recommendation is http://www.w3.org/TR/2004/REC-rdf-schema-20040210/. The <a href="http://www.w3.org/TR/rdf-schema/" shape="rect">latest version of RDF Schema</a> is available at http://www.w3.org/TR/rdf-schema/.</dd>

<dt><a id="ref-reqs" name="ref-reqs" shape="rect">[REQS]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/1999/WD-voice-tts-reqs-19991223/" shape="rect">Speech Synthesis Markup Requirements for Voice Markup Languages</a></cite>, A. Hunt, Editor. World Wide Web Consortium, 23 December 1999. This document is a work in progress. This version of the Synthesis Requirements is http://www.w3.org/TR/1999/WD-voice-tts-reqs-19991223/. The <a href="http://www.w3.org/TR/voice-tts-reqs/" shape="rect">latest version of the Synthesis Requirements</a> is available at http://www.w3.org/TR/voice-tts-reqs/.</dd>

<dt><a id="ref-reqs11" name="ref-reqs11" shape="rect">[REQS11]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2007/WD-ssml11reqs-20070611/" shape="rect">Speech Synthesis Markup Language Version 1.1 Requirements</a></cite>, D. Burnett and Z. Shuang, Editors. World Wide Web Consortium, 11 June 2007. This document is a work in progress. This version of the SSML 1.1 Requirements is http://www.w3.org/TR/2007/WD-ssml11reqs-20070611/. The <a href="http://www.w3.org/TR/ssml11reqs/" shape="rect">latest version of the SSML 1.1 Requirements</a> is available at http://www.w3.org/TR/ssml11reqs/.</dd>

<dt><a id="ref-rfc2616" name="ref-rfc2616" shape="rect">[RFC2616]</a></dt>

<dd><cite><a href="http://www.ietf.org/rfc/rfc2616.txt" shape="rect">Hypertext Transfer Protocol -- HTTP/1.1</a></cite>, R. Fielding, et al., Editors. IETF, June 1999. This RFC is available at http://www.ietf.org/rfc/rfc2616.txt.</dd>

<dt><a id="ref-rfc2732" name="ref-rfc2732" shape="rect">[RFC2732]</a></dt>

<dd><cite><a href="http://www.ietf.org/rfc/rfc2732.txt" shape="rect">Format for Literal IPv6 Addresses in URL's</a></cite>, R. Hinden, et al., Editors. IETF, December 1999. This RFC is available at http://www.ietf.org/rfc/rfc2732.txt.</dd>

<dt><a id="ref-ruby" name="ref-ruby" shape="rect">[RUBY]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2001/REC-ruby-20010531/" shape="rect">Ruby Annotation</a></cite>,
 Marcin Sawicki, et al., Editors. World Wide Web Consortium, 31 May 2001. This version of the Ruby Recommendation is 
<a href="http://www.w3.org/TR/2001/REC-ruby-20010531/" shape="rect">http://www.w3.org/TR/2001/REC-ruby-20010531/</a>.
The latest version is available at <a href="http://www.w3.org/TR/ruby/" shape="rect">http://www.w3.org/TR/ruby/</a>.</dd>

<dt><a id="ref-sable" name="ref-sable" shape="rect">[SABLE]</a></dt>

<dd>"SABLE: A Standard for TTS Markup", Richard Sproat, et al. <cite>Proceedings of the International Conference on Spoken Language Processing</cite>, R. Mannell and J. Robert-Ribes, Editors. <a href="http://www.causalproductions.com/" shape="rect">Causal Productions Pty Ltd</a> (Adelaide), 1998. Vol. 5, pp. 1719-1722. Conference proceedings are available from the publisher at http://www.causalproductions.com/.</dd>

<dt><a id="ref-ssml" name="ref-ssml" shape="rect">[SSML]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/" shape="rect">Speech Synthesis Markup Language (SSML) Version 1.0</a></cite>,
 Daniel C. Burnett, et al., Editors. World Wide Web Consortium, 7 September 2004. This version of the SSML 1.0 Recommendation is 
<a href="http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/" shape="rect">http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/</a>.
The latest version is available at <a href="http://www.w3.org/TR/speech-synthesis/" shape="rect">http://www.w3.org/TR/speech-synthesis/</a>.</dd>

<dt><a id="ref-unicode" name="ref-unicode" shape="rect">[UNICODE]</a></dt>

<dd><cite><a href="http://www.unicode.org/standard/standard.html" shape="rect">The Unicode Standard</a></cite>. The Unicode Consortium. Information about the Unicode Standard and its versions can be found at http://www.unicode.org/standard/standard.html.</dd>

<dt><a id="ref-web-arch" name="ref-web-arch" shape="rect">[WEB-ARCH]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2004/REC-webarch-20041215/" shape="rect">Architecture of the World Wide Web, Version One</a></cite>, I. Jacobs, N. Walsh, Editors. World Wide Web Consortium, 15 December 2004. This version of the WWW Architecture is http://www.w3.org/TR/2004/REC-webarch-20041215/. The <a href="http://www.w3.org/TR/webarch/" shape="rect">latest version of WWW Architecture</a> is available at http://www.w3.org/TR/webarch/.</dd>

<dt><a id="ref-vxml" name="ref-vxml" shape="rect">[VXML]</a></dt>

<dd><cite><a href="http://www.w3.org/TR/2004/REC-voicexml20-20040316/" shape="rect">Voice Extensible Markup Language (VoiceXML) Version 2.0</a></cite>, S. McGlashan, et al., Editors. World Wide Web Consortium, 16 March 2004. This version of the VoiceXML 2.0 Recommendation is http://www.w3.org/TR/2004/REC-voicexml20-20040316/. The <a href="http://www.w3.org/TR/voicexml20/" shape="rect">latest version of VoiceXML 2</a> is available at http://www.w3.org/TR/voicexml20/.</dd>

<dt><a id="ref-WS" name="ref-WS" shape="rect">[WS]</a></dt>

<dd><cite><a href="http://www.w3.org/2005/08/SSML/ssml-workshop-agenda.html" shape="rect">Minutes</a></cite>, W3C Workshop on Internationalizing the Speech Synthesis Markup Language, 2-3 November 2005. The agenda and minutes are available at  
<a href="http://www.w3.org/2005/08/SSML/ssml-workshop-agenda.html" shape="rect">http://www.w3.org/2005/08/SSML/ssml-workshop-agenda.html</a>.</dd>

<dt><a id="ref-WS2" name="ref-WS2" shape="rect">[WS2]</a></dt>

<dd><cite><a href="http://www.w3.org/2006/02/SSML/minutes.html" shape="rect">Minutes</a></cite>, W3C Workshop on Internationalizing the Speech Synthesis Markup Language, 30-31 May 2006. The agenda is available at <a href="http://www.w3.org/2006/02/SSML/agenda.html" shape="rect">http://www.w3.org/2006/02/SSML/agenda.html</a>. The minutes are available at 
<a href="http://www.w3.org/2006/02/SSML/minutes.html" shape="rect">http://www.w3.org/2006/02/SSML/minutes.html</a>.</dd>

<dt><a id="ref-WS3" name="ref-WS3" shape="rect">[WS3]</a></dt>

<dd><cite><a href="http://www.w3.org/2006/10/SSML/minutes.html" shape="rect">Minutes</a></cite>, W3C Workshop on Internationalizing the Speech Synthesis Markup Language, 13-14 January 2007. The agenda is available at <a href="http://www.w3.org/2006/10/SSML/agenda.html" shape="rect">http://www.w3.org/2006/10/SSML/agenda.html</a>. The minutes are available at  
<a href="http://www.w3.org/2006/02/SSML/minutes.html" shape="rect">http://www.w3.org/2006/10/SSML/minutes.html</a>.</dd>
</dl>

<h2 id="g44"><a id="S5" name="S5" shape="rect">5.</a> Acknowledgments</h2>

<p>This document was written with the participation of the following participants in the W3C Voice Browser Working Group and other W3C Working Groups <em>(listed in family name alphabetical order)</em>:</p>

<dl>
<dd>芦村 和幸 (Kazuyuki Ashimura), W3C<br/>
  Max Froumentin, W3C (at the time of participation)<br />
  黄力行 (Lixing Huang), Chinese Academy of Sciences (at the time of participation)<br />
  Andrew Hunt, Speechworks (at the time of participation)<br />
  今竹 渉 (Wataru Imatake), Invited Expert<br />
  Richard Ishida, W3C<br/>
  Jim Larson, Invited Expert (formerly of Intervoice)<br/>
  Wai-Kit Lo, Chinese University of Hong Kong (at the time of participation)<br/>
  Mark Walker, Intel (at the time of participation)<br/>
</dd>
</dl>

<p>The editors also wish to thank the members of the W3C Internationalization Working Group, who have provided significant review and contributions to SSML 1.0 and 1.1.</p>

<h2 id="g49"><a id="AppA" name="AppA" shape="rect">Appendix A</a>: Audio File Formats</h2>

<p><b>This appendix is normative.</b></p>

<p>SSML requires that a platform support the playing of the audio formats specified below.</p>

<table cellspacing="0" cellpadding="5" width="80%" border="1" summary="This table lists the audio formats, with associated media types, that synthesis processors are required to support.">
<caption>Required audio formats</caption>

<tr>
<th scope="col" rowspan="1" colspan="1">Audio Format</th>
<th scope="col" rowspan="1" colspan="1">Media Type</th>
</tr>

<tr>
<td rowspan="1" colspan="1">Raw (headerless) 8kHz 8-bit mono mu-law (PCM) single channel. (G.711)</td>
<td rowspan="1" colspan="1">audio/basic (from [<a href="#ref-rfc1521" shape="rect">RFC1521</a>])</td>
</tr>

<tr>
<td rowspan="1" colspan="1">Raw (headerless) 8kHz 8 bit mono A-law (PCM) single channel. (G.711)</td>
<td rowspan="1" colspan="1">audio/x-alaw-basic</td>
</tr>

<tr>
<td rowspan="1" colspan="1">WAV (RIFF header) 8kHz 8-bit mono mu-law (PCM) single channel.</td>
<td rowspan="1" colspan="1">audio/x-wav</td>
</tr>

<tr>
<td rowspan="1" colspan="1">WAV (RIFF header) 8kHz 8-bit mono A-law (PCM) single channel.</td>
<td rowspan="1" colspan="1">audio/x-wav</td>
</tr>
</table>

<p>The 'audio/basic'  media type is commonly used with the 'au' header format as well as the headerless 8-bit 8kHz mu-law format. If this  media type is specified for playing, the mu-law format <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> be used. For playback with the 'audio/basic'  media type, processors <em title="MUST in RFC 2119 context" class="RFC2119">MUST</em> support the mu-law format and  <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> support the 'au' format.</p>

<h2 id="g51"><a id="AppB" name="AppB" shape="rect">Appendix B</a>: Internationalization</h2>

<p><b>This appendix is normative.</b></p>

<p>SSML is an application of XML [<a href="#ref-xml10" shape="rect">XML 1.0</a> or <a href="#ref-xml11" shape="rect">XML 1.1</a>] and thus supports [<a href="#ref-unicode" shape="rect">UNICODE</a>] which defines a standard universal character set.</p>

<p>SSML provides a mechanism for control of the spoken language via the use of the <a href="#adef_xmllang" class="aref" shape="rect"><code class="att">xml:lang</code></a> attribute. Language changes can occur as frequently as per token (word), although excessive language changes can diminish the output audio quality. SSML also permits finer control over output pronunciations via the <a href="#edef_lexicon" class="eref" shape="rect">lexicon</a> and <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a> elements, features that can help to mitigate poor quality default lexicons for languages with only minimal commercial support today.</p>

<h2 id="g50"><a id="AppC" name="AppC" shape="rect">Appendix C</a>:  Media Types and File Suffix</h2>

<p><b>This appendix is normative.</b></p>





<p>The media type associated with the Speech Synthesis Markup  Language specification is "application/ssml+xml" and the filename suffix is ".ssml" as defined in  [<a href="#ref-rfc4267" shape="rect">RFC4267</a>].</p>


<h2 id="g48"><a id="AppD" name="AppD" shape="rect">Appendix D</a>: Schema for the Speech Synthesis Markup Language</h2>

<p><b>This appendix is normative.</b></p>

<p>The synthesis schema for the Core profile (<a href="#S2.2.5">Sec. 2.2.5</a>) is  located at <a href="http://www.w3.org/TR/speech-synthesis11/synthesis.xsd" shape="rect">http://www.w3.org/TR/speech-synthesis11/synthesis.xsd</a>, and the schema for the Extended profile (<a href="#S2.2.5">Sec. 2.2.5</a>) is located at <a href="http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd" shape="rect">http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd</a>.</p>

<p>Note: the synthesis schemas  include no-namespace  schemas for the Core and Extended profiles, located respectively at <a href="http://www.w3.org/TR/speech-synthesis11/synthesis-nonamespace.xsd" shape="rect">http://www.w3.org/TR/speech-synthesis11/synthesis-nonamespace.xsd</a> and <a href="http://www.w3.org/TR/speech-synthesis11/synthesis-nonamespace-extended.xsd" shape="rect">http://www.w3.org/TR/speech-synthesis11/synthesis-nonamespace-extended.xsd</a>, which <em title="MAY in RFC 2119 context" class="RFC2119">MAY</em> be used as a basis for specifying Speech Synthesis Markup Language Fragments (<a href="#S2.2.1" shape="rect">Sec. 2.2.1</a>) embedded in non-synthesis namespace schemas.

Also for stability it is <em title="RECOMMENDED in RFC 2119 contect" class="RFC2119">RECOMMENDED</em> that you use the following dated URIs for the above schema files:</p>
<ul>
<li><a href="http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis.xsd">
http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis.xsd</a>
  (for http://www.w3.org/TR/speech-synthesis11/synthesis.xsd)</li>

<li><a href="http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis-extended.xsd">
http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis-extended.xsd</a>
  (for http://www.w3.org/TR/speech-synthesis11/synthesis-extended.xsd)</li>

<li><a href="http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis-nonamespace.xsd">
http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis-nonamespace.xsd</a>
  (for http://www.w3.org/TR/speech-synthesis11/synthesis-nonamespace.xsd)</li>

<li><a href="http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis-nonamespace-extended.xsd">
http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/synthesis-nonamespace-extended.xsd</a>
  (for http://www.w3.org/TR/speech-synthesis11/synthesis-nonamespace-extended.xsd)</li>
</ul>

<h2 id="g47"><a id="AppE" name="AppE" shape="rect">Appendix E</a>: Example SSML </h2>

<p><b>This appendix is informative.</b></p>

<p>The following is an example of reading headers of email messages. The <a href="#edef_paragraph" class="eref" shape="rect">p</a> and <a href="#edef_sentence" class="eref" shape="rect">s</a> elements are used to mark the text structure. The <a href="#edef_break" class="eref" shape="rect">break</a> element is placed before the time and has the effect of marking the time as important information for the listener to pay attention to. The <a href="#edef_prosody" class="eref" shape="rect">prosody</a> element is used to slow the speaking rate of the email subject so that the user has extra time to listen and write down the details.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
  &lt;p&gt;
    &lt;s&gt;You have 4 new messages.&lt;/s&gt;
    &lt;s&gt;The first is from Stephanie Williams and arrived at &lt;break/&gt; 3:45pm.
    &lt;/s&gt;
    &lt;s&gt;
      The subject is &lt;prosody rate="-20%"&gt;ski trip&lt;/prosody&gt;
    &lt;/s&gt;
  &lt;/p&gt;
&lt;/speak&gt;
</pre>

<p>The following example combines audio files and different spoken voices to provide information on a collection of music.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;

  &lt;p&gt;
    &lt;voice gender="male"&gt;
      &lt;s&gt;Today we preview the latest romantic music from Example.&lt;/s&gt;

      &lt;s&gt;Hear what the Software Reviews said about Example's newest hit.&lt;/s&gt;
    &lt;/voice&gt;
  &lt;/p&gt;

  &lt;p&gt;
    &lt;voice gender="female"&gt;
      He sings about issues that touch us all.
    &lt;/voice&gt;
  &lt;/p&gt;

  &lt;p&gt;
    &lt;voice gender="male"&gt;
      Here's a sample.  &lt;audio src="http://www.example.com/music.wav"/&gt;
      Would you like to buy it?
    &lt;/voice&gt;
  &lt;/p&gt;

&lt;/speak&gt;
</pre>

<p>It is often the case that an author wishes to include a bit of foreign text (say, a movie title) in an application without having to switch languages (for example via the  <a href="#edef_lang" class="eref" shape="rect">lang</a> element). A simple way to do this is shown here. In this example the synthesis processor would render the movie name using the pronunciation rules of the container language ("en-US" in this case), similar to how a reader who doesn't know the foreign language might try to read (and pronounce) it.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
  
  The title of the movie is:
  "La vita è bella"
  (Life is beautiful),
  which is directed by Roberto Benigni.
&lt;/speak&gt;
</pre>

<p>With some additional work the output quality can be improved tremendously either by creating a custom pronunciation in an external lexicon (see <a href="#S3.1.5" shape="rect">Section 3.1.5</a>) or via the <a href="#edef_phoneme" class="eref" shape="rect">phoneme</a> element as shown in the next example.</p>

<p>It is worth noting that IPA alphabet support is an  <em title="OPTIONAL in RFC 2119 context" class="RFC2119">OPTIONAL</em> feature and that phonemes for an external language may be rendered with some approximation (see <a href="#S3.1.5" shape="rect">Section 3.1.5</a> for details). The following example only uses phonemes common to US English.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0" encoding="ISO-8859-1"?&gt;
&lt;speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;
  
  The title of the movie is: 
  &lt;phoneme alphabet="ipa"
    ph="&amp;#x2C8;l&amp;#x251; &amp;#x2C8;vi&amp;#x2D0;&amp;#x27E;&amp;#x259; &amp;#x2C8;&amp;#x294;e&amp;#x26A; &amp;#x2C8;b&amp;#x25B;l&amp;#x259;"&gt; 
  La vita è bella &lt;/phoneme&gt;
  &lt;!-- The IPA pronunciation is <span class="ipa">ˈlɑ ˈviːɾə ˈʔeɪ ˈbɛlə</span> --&gt;
  (Life is beautiful), 
  which is directed by 
  &lt;phoneme alphabet="ipa"
    ph="&amp;#x279;&amp;#x259;&amp;#x2C8;b&amp;#x25B;&amp;#x2D0;&amp;#x279;&amp;#x27E;o&amp;#x28A; b&amp;#x25B;&amp;#x2C8;ni&amp;#x2D0;nji"&gt; 
  Roberto Benigni &lt;/phoneme&gt;
  &lt;!-- The IPA pronunciation is <span class="ipa">ɹəˈbɛːɹɾoʊ bɛˈniːnji</span> --&gt;

  &lt;!-- Note that in actual practice an author might change the
     encoding to UTF-8 and directly use the Unicode characters in
     the document rather than using the escapes as shown.
     The escaped values are shown for ease of copying. --&gt;
&lt;/speak&gt;
</pre>

<h3 id="g46">SMIL Integration Example</h3>

<p>The SMIL language [<a href="#ref-smil3" shape="rect">SMIL3</a>] is an XML-based multimedia control language. It is especially well suited for describing dynamic media applications that include synthetic speech output.</p>

<p>File <b>'greetings.ssml'</b> contains the following:</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0"?&gt;
&lt;speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US"&gt;

  &lt;s&gt;
    &lt;mark name="greetings"/&gt;
    &lt;emphasis&gt;Greetings&lt;/emphasis&gt; from the &lt;sub alias="World Wide Web Consortium"&gt;W3C&lt;/sub&gt;!
  &lt;/s&gt;
&lt;/speak&gt;
</pre>

<p><em>SMIL Example 1:</em> W3C logo image appears, and then one second later, the speech sequence is rendered. File <b>'greetings.smil'</b> contains the following:</p>

<pre class="example" xml:space="preserve">
&lt;smil xmlns="http://www.w3.org/ns/SMIL" version="3.0" baseProfile="Language"&gt;
  &lt;head&gt;
    &lt;top-layout width="640" height="320"&gt;
      &lt;region id="whole" width="640" height="320"/&gt;
    &lt;/top-layout&gt;
  &lt;/head&gt;
  &lt;body&gt;
    &lt;par&gt;
      &lt;img src="http://www.w3.org/Icons/w3c_home" region="whole" begin="0s"/&gt;
      &lt;ref src="greetings.ssml" begin="1s"/&gt;
    &lt;/par&gt;
  &lt;/body&gt;
&lt;/smil&gt;
</pre>

<p><em>SMIL Example 2:</em> W3C logo image appears, then clicking on the image causes it to disappear and the speech sequence to be rendered. File <b>'greetings.smil'</b> contains the following:</p>

<pre class="example" xml:space="preserve">
&lt;smil xmlns="http://www.w3.org/ns/SMIL" version="3.0" baseProfile="Language"&gt;
  &lt;head&gt;
    &lt;top-layout width="640" height="320"&gt;
      &lt;region id="whole" width="640" height="320"/&gt;
    &lt;/top-layout&gt;
  &lt;/head&gt;
  &lt;body&gt;
    &lt;seq&gt;
      &lt;img id="logo" src="http://www.w3.org/Icons/w3c_home" region="whole" begin="0s" end="logo.activateEvent"/&gt;
      &lt;ref src="greetings.ssml"/&gt;
    &lt;/seq&gt;
  &lt;/body&gt;
&lt;/smil&gt;
</pre>

<h3 id="AppFVoiceXML">VoiceXML Integration Example</h3>
<p>The following is an example of SSML in VoiceXML (see <a href="#S2.3.3" shape="rect">Section 2.3.3</a>) for <a href="#term-voicebrowser" shape="rect">voice browser</a> applications. It is worth noting that the VoiceXML namespace includes the SSML namespace elements and attributes. See Appendix O of [<a href="#ref-vxml" shape="rect">VXML</a>] for details.</p>

<pre class="example" xml:space="preserve">
&lt;?xml version="1.0" encoding="UTF-8"?&gt; 
&lt;vxml version="2.0" xmlns="http://www.w3.org/2001/vxml" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
  xsi:schemaLocation="http://www.w3.org/2001/vxml 
   http://www.w3.org/TR/voicexml20/vxml.xsd"&gt;
   &lt;form&gt;
      &lt;block&gt;
         &lt;prompt&gt;
           &lt;emphasis&gt;Welcome&lt;/emphasis&gt; to the Bird Seed Emporium.
           &lt;audio src="rtsp://www.birdsounds.example.com/thrush.wav"/&gt;
           We have 250 kilogram drums of thistle seed for
           $299.95
           plus shipping and handling this month.
           &lt;audio src="http://www.birdsounds.example.com/mourningdove.wav"/&gt;
         &lt;/prompt&gt;
      &lt;/block&gt;
   &lt;/form&gt;
&lt;/vxml&gt;
</pre>
<h2 id="SAppF"><a id="AppF" name="AppF" shape="rect">Appendix F</a>: Changes since SSML 1.0</h2>

<p><b>This appendix is informative.</b></p>

<p>In the event of modifying an SSML 1.0 conformant document for a
synthesis processor that supports only SSML 1.1, document authors are
informed of the following note on compatibility:
</p>

<ul>

<li>SSML 1.0 conformant elements requiring no changes for SSML 1.1
are: &lt;break&gt;, &lt;emphasis&gt;, &lt;mark&gt;, &lt;metadata&gt;, &lt;meta&gt;, &lt;sub&gt;, &lt;say-as&gt;, &lt;desc&gt;, and partially &lt;prosody&gt; (excluding rate and volume attributes)</li>

<li>Elements with attribute changes in SSML 1.1 are: &lt;s&gt; &lt;p&gt; &lt;phoneme&gt; &lt;audio&gt; &lt;speak&gt;. Their SSML 1.0 syntax and semantics are compatible in SSML 1.1</li>

<li>Elements with a changed syntax and/or semantics between SSML 1.0 and SSML 1.1 are: &lt;lexicon&gt;, &lt;voice&gt;, and partially &lt;prosody&gt; (rate and volume attributes only)</li>

<li>New elements in SSML 1.1 are: &lt;lang&gt;, &lt;lookup&gt; and &lt;token&gt;/&lt;w&gt;</li>

</ul>


<p>And the following is a consolidated list of all changes since SSML 1.0. </p>
<ul>
  <li>In 3.1.7, added &lt;token&gt; element to mark the boundaries of a word. Added &lt;w&gt; as an alias for &lt;token&gt;. </li>
  <li> Added role attribute onto &lt;token&gt; element to provide more explicit linkage with PLS lexicons.</li>
  <li>In 3.1.2, xml:lang is defined to describe the language of the text content and is added onto the new &lt;token&gt; element.</li>
  <li>Added xml:id attribute (new section after 3.1.3 xml:base) onto &lt;lexicon&gt;, &lt;p&gt;, &lt;s&gt;, &lt;token&gt;, and &lt;w&gt; elements to enhance external references into SSML content.</li>
  <li>In 3.1.5 lexicon, added &lt;lookup&gt; element to control which lexicons are currently in use.  &lt;lexicon&gt; now only defines which lexicons are used in the document.</li>
  <li>Removed general text describing how text may be mapped to entries in the lexicon. </li>
  <li>The default &lt;lexicon&gt; type is now "application/pls+xml", as defined by the PLS 1.0 specification.</li>
  <li>Introduced the notion of a Pronunciation Alphabet Registry that would maintain a list of registered values for the alphabet attribute of the &lt;phoneme&gt; element.</li>
  <li>Removed the xml:lang attribute from the &lt;voice&gt; element to reduce confusion.</li>
  <li>Added &lt;lang&gt; element to allow setting xml:lang for arbitrary text content.</li>
  <li>Clarified in &lt;voice&gt; description that indication of language and voice are independent, no synthesis processor is required to support all combinations thereof, and processors must document behavior for every combination thereof.</li>
  <li>3.1.5: Now mandates that if a referenced lexicon is a PLS document, then the information in it must be used by the processor. </li>
  <li>3.1.5.2: Clarified that the processor already has built-in system lexicons whose values are overridden by use of the &lt;lexicon&gt; and &lt;lookup&gt; elements.</li>
  <li>Updated entire document to allow for XML and XMLNS 1.1 in addition to 1.0. Clarified in definition of URI that IRIs are allowed and added an informative reference to RFC3987. </li>
  <li>Completely revamped how voice selection and language speaking control are done.<br /> 
  In 3.2.1, added "languages", "required", "ordering", and "onvoicefailure" attributes, introduced a new voice selection algorithm. Voice selection is now scoped.<br />
  Added new "onlangfailure" attribute (new section 3.1.13) on all elements that take the "xml:lang" attribute:  &lt;speak&gt;, &lt;lang&gt;, &lt;desc&gt;, &lt;p&gt;, &lt;s&gt;, &lt;token&gt;, and &lt;w&gt;.</li>
  <li>Added trimming attributes to &lt;speak&gt; (section 3.1.1) to accommodate expected VoiceXML 3 needs. </li>
  <li>In 3.3.1, added trimming, soundLevel, and speed attributes to &lt;audio&gt; to accommodate expected VoiceXML 3 needs. In 1.2 made reference to the new capabilities of the &lt;audio&gt; element.</li>
  <li>Revised error text in &lt;audio&gt; description to better explain appropriate notification behavior.</li>
  <li>In 2.2.3, added example of using Ruby within SSML. Added Informative reference to Ruby. </li>
  <li>Changed many instances of "word" to "token" to clarify when the specification is referring to tokens; this is particularly relevant in discussions of parsing and linkage with lexicons.</li>
  <li>Added text in status section explaining the motivation and background for SSML 1.1. Also added the requirements document and the workshops to the informative references section. </li>
  <li>Applied styling to relevant uses of RFC2119 keywords. </li>
  <li>Clarified in status section that we are referring to natural (human) languages. Clarified in 3.1.2 that xml:lang indicates the language of the *written* content.</li>
  <li>Moved a paragraph from 3.1.5 to 3.1.5.2 that explains when the information in a lexicon document must be used. </li>
  <li>Clarified in 1.2 that the tokens may not span markup elements except within the &lt;token&gt; and &lt;w&gt; elements. </li>
  <li>In 3.1.10, added a new optional "type" attribute with values of "default" (the default) and "ruby". </li>
  <li>Split into two profiles: Core and extended<br />
  Added text in section 2.2.4 allowing for processor conformance variations as given in individual sections (intended to support profile variations).<br />
  In Appendix D there are now four schemas:  no-namespace and regular schemas for the Core and Extended profiles. Updated all examples accordingly. <br />
  Added new section 2.2.5 describing the two profiles. </li>
  <li>In 3.1.2, removed the list of elements that "should be rendered in a manner appropriate to the current language" because it is no longer necessary.</li>
  <li>In 1.2, added text to step 6 explaining the default volume/soundLevel, speed/rate, and pitch/frequency for voices and recordings in the document. </li>
  <li>Renumbered section 2.2.5 to be 2.2.6. Inserted new section 2.2.5 defining profiles. Added sub-sections to 2.2.1, 2.2.2, and 2.2.4 for the two profiles. Updated relevant text in 3.3.1.1, 3.3.1.2, and 3.3.1.4.</li>
  <li>In 3.2.4, adjusted &lt;prosody&gt; "volume" definition to align better with the new &lt;audio&gt; "soundLevel" attribute definition (now a logarithmic control), also noting potential difference between behavior when using labels and behavior when using numerical values. </li>
  <li>In 3.2.4, for the "duration" attribute, adjusted the wording about contained elements to match that of the other attributes ("contained text").</li>
  <li>In 3.2.4, changed definition of non-negative percentage to be an unsigned number followed by a '%'. </li>
  <li>In 3.2.4, updated "rate" attribute description to use "non-negative percentage" instead of "relative value".</li>
  <li>In 3.3.1.2, changed SMIL reference to be SMIL3 and added SMIL3 as a Normative reference in the references section. </li>
  <li> In 3.1.5.1 (lexicon) and 3.3.1 (audio), added maxage and maxstale attributes. </li>
  <li>In 2.1, clarified that when no xsi:schemaLocation is given the Core profile is to be assumed.</li>
  <li>In 3.2.1, added link to the "onlangfailure" attribute. Also  added "Voice description" section and links to it from the "languages" attribute description.</li>
  <li>Changed all references to RFC4647 to be "BCP 47, Matching of Language Tags". BCP47 is now a Normative Reference.</li>
  <li>Removed stale reference to RFC3066.</li>
  <li>In 3.1.2, added text explaining the relationship between xml:lang, onlangfailure, and how a voice will speak text.</li>
  <li>In 3.3.1, made "src" optional and added "fetchhint" and "fetchtimeout" attributes to the &lt;audio&gt; element.</li>
  <li>In 3.1.5.1, added "fetchtimeout" attribute to the &lt;lexicon&gt; element.  Also added text explaining what is expected when there is a failure fetching the lexicon.</li>
  <li>In 2.3, clarified that non-SSML attributes are permitted as well. The schema have been updated to allow for non-SSML attributes on all SSML elements and non-SSML elements as children of all SSML elements.</li>
  <li>Removed the DTD (Appendix E), relabeled Appendix F (Example SSML) as Appendix E, and relabeled Appendix G (Changes since SSML 1.0) as Appendix F. </li>
  <li>Added an informative reference to "Architecture of the World Wide Web, Volume One". Updated the definition of URI in section 1.5 to refer to it.</li>
  <li>In 3.1.10, noted that white space characters may occur in IPA strings and that they have no effect on the pronunciation. </li>
  <li>Removed DOCTYPE from Section 2 and all examples because there is no longer a DTD to reference.</li>

</ul>


<h2 id="SAppG"><a id="AppG" name="AppG" shape="rect">Appendix G</a>: Changes since last draft</h2>

<p><b>This appendix is informative.</b></p>

<p>
Since the publication of the Proposed Recommendation of the SSML 1.1
specification, the following minor editorial changes have been
added to the draft in response to comments.
</p>

<ul>
<li>
In <a href="#S4">section 4</a>, updated URIs of IPA sites.
</li>

<li>
In <a href="#AppD">Appendix D</a>,
changed schemaLocation URI
from
"http://www.w3.org/TR/2010/PR-speech-synthesis11-20100223/"
to 
"http://www.w3.org/TR/2010/REC-speech-synthesis11-20100907/".
</li>

<li>
In <a href="#AppF">Appendix F</a>,
added a note on backwards compatibility with SSML 1.0.
</li>

<li>
Fixed errors in
the no-namespace schema
(<a href="synthesis-nonamespace.xsd">synthesis-nonamespace.xsd</a>).
</li>

</ul>


</body>
</html>