NOTE-emma-usecases-20091215 79.1 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263 1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293 1294 1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 1439 1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617 1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 1 September 2005), see www.w3.org" />
<title>Use Cases for Possible Future EMMA Features</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style type="text/css">
/*<![CDATA[*/
code           { font-family: monospace; }

div.constraint,
div.issue,
div.note,
div.notice     { margin-left: 2em; }

ol.enumar      { list-style-type: decimal; }
ol.enumla      { list-style-type: lower-alpha; }
ol.enumlr      { list-style-type: lower-roman; }
ol.enumua      { list-style-type: upper-alpha; }
ol.enumur      { list-style-type: upper-roman; }


div.exampleInner pre { margin-left: 1em;
                       margin-top: 0em; margin-bottom: 0em}
div.exampleOuter {border: 4px double gray;
                  margin: 0em; padding: 0em}
div.exampleInner { background-color: #d5dee3;
                   border-top-width: 4px;
                   border-top-style: double;
                   border-top-color: #d3d3d3;
                   border-bottom-width: 4px;
                   border-bottom-style: double;
                   border-bottom-color: #d3d3d3;
                   padding: 4px; margin: 0em }
div.exampleWrapper { margin: 4px }
div.exampleHeader { font-weight: bold;
                    margin: 4px}

table {
        width:80%;
                border:1px solid #000;
                border-collapse:collapse;
                font-size:90%;
        }

td,th{
                border:1px solid #000;
                border-collapse:collapse;
                padding:5px;
        }       


caption{
                background:#ccc;
                font-size:140%;
                border:1px solid #000;
                border-bottom:none;
                padding:5px;
                text-align:center;
        }

img.center {
  display: block;
  margin-left: auto;
  margin-right: auto;
}
p.caption {
  text-align: center
}


.RFC2119 {
  text-transform: lowercase;
  font-style: italic;
}
/*]]>*/
</style>

<style type="text/css">
/*<![CDATA[*/
 p.c1 {font-weight: bold}
/*]]>*/
</style>

<link href="http://www.w3.org/StyleSheets/TR/W3C-WG-NOTE.css" type="text/css" rel="stylesheet" />
<meta content="MSHTML 6.00.6000.16762" name="GENERATOR" />
<style type="text/css">
/*<![CDATA[*/
 ol.c2 {list-style-type: lower-alpha}
 li.c1 {list-style: none}
/*]]>*/
</style>
</head>
<body xml:lang="en" lang="en">
<div class="head"><a href="http://www.w3.org/"><img alt="W3C" src=
"http://www.w3.org/Icons/w3c_home" width="72" height="48" /></a>
<h1 id="title">Use Cases for Possible Future EMMA Features</h1>
<h2 id="w3c-doctype">W3C Working Group Note <i>15</i> <i>December</i> <i>2009</i></h2>

<dl>
<dt>This version:</dt>
<dd><a href="http://www.w3.org/TR/2009/NOTE-emma-usecases-20091215">http://www.w3.org/TR/2009/NOTE-emma-usecases-20091215</a></dd>
<dt>Latest version:</dt>
<dd><a href="http://www.w3.org/TR/emma-usecases">http://www.w3.org/TR/emma-usecases</a></dd>
<dt>Previous version:</dt>
<dd><em>This is the first publication.</em></dd>

<dt>Editor:</dt>
<dd>Michael Johnston, AT&amp;T</dd>
<dt>Authors:</dt>
<dd>Deborah A. Dahl, Invited Expert</dd>
<dd>Ingmar Kliche, Deutsche Telekom AG</dd>
<dd>Paolo Baggia, Loquendo</dd>
<dd>Daniel C. Burnett, Voxeo</dd>
<dd>Felix Burkhardt, Deutsche Telekom AG</dd>
<dd>Kazuyuki Ashimura, W3C</dd>
</dl>
<p class="copyright"><a href=
"http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a>
© 2009 <a href="http://www.w3.org/"><acronym title=
"World Wide Web Consortium">W3C</acronym></a><sup>®</sup> (<a href=
"http://www.csail.mit.edu/"><acronym title=
"Massachusetts Institute of Technology">MIT</acronym></a>, <a href=
"http://www.ercim.org/"><acronym title=
"European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
<a href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved.
W3C <a href=
"http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
<a href=
"http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>
and <a href=
"http://www.w3.org/Consortium/Legal/copyright-documents">document
use</a> rules apply.</p>
</div>
<!-- end of head div -->
<hr title="Separator for header" />
<h2 id="abstract">Abstract</h2>
<p>The EMMA: Extensible MultiModal Annotation specification defines
an XML markup language for capturing and providing metadata on the
interpretation of inputs to multimodal systems. Throughout the
implementation report process and discussion since EMMA 1.0 became
a W3C Recommendation, a number of new possible use cases for the
EMMA language have emerged. These include the use of EMMA to
represent multimodal output, biometrics, emotion, sensor data,
multi-stage dialogs, and interactions with multiple users. In this
document, we describe these use cases and illustrate how the EMMA
language could be extended to support them.</p>

<h2 id="status">Status of this Document</h2>

<p><em>This section describes the status of this document at the
time of its publication. Other documents may supersede this
document. A list of current W3C publications and the latest
revision of this technical report can be found in the <a href=
"http://www.w3.org/TR/">W3C technical reports index</a> at
http://www.w3.org/TR/.</em></p>

<p>This document is a W3C Working Group Note published on 15 December
2009. This is the first publication of this document and it represents
the views of the W3C Multimodal Interaction Working Group at the time
of publication. The document may be updated as new technologies emerge
or mature. Publication as a Working Group Note does not imply
endorsement by the W3C Membership. This is a draft document and may be
updated, replaced or obsoleted by other documents at any time. It is
inappropriate to cite this document as other than work in
progress.</p>

<p>This document is one of a series produced by the
<a href="http://www.w3.org/2002/mmi/">Multimodal Interaction WorkingGroup</a>,
part of the <a href="http://www.w3.org/2002/mmi/Activity">W3C Multimodal Interaction
Activity</a>.

Since <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> became a W3C
Recommendation, a number of new possible use cases for the EMMA language have
emerged, e.g., the use of EMMA to represent multimodal output, biometrics,
emotion, sensor data, multi-stage dialogs and interactions with multiple users.

Therefore the Working Group have been working on a document capturing use cases
and issues for a series of possible extensions to EMMA.

The intention of publishing this Working Group Note is to seek feedback on the
various different use cases.
</p>

<p>Comments on this document can be sent to <a href=
"mailto:www-multimodal@w3.org">www-multimodal@w3.org</a>, the
public forum for discussion of the W3C's work on Multimodal
Interaction. To subscribe, send an email to <a href=
"mailto:www-multimodal-request@w3.org">www-multimodal-request@w3.org</a>
with the word subscribe in the subject line (include the word
unsubscribe if you want to unsubscribe). The <a href=
"http://lists.w3.org/Archives/Public/www-multimodal/">archive</a>
for the list is accessible online.</p>

<p> This document was produced by a group operating under the <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/">5 February 2004 W3C Patent Policy</a>. W3C maintains a <a rel="disclosure" href="http://www.w3.org/2004/01/pp-impl/34607/status">public list of any patent disclosures</a> made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/#def-essential">Essential Claim(s)</a> must disclose the information in accordance with <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/#sec-Disclosure">section 6 of the W3C Patent Policy</a>. </p>

<h2 id="contents">Table of Contents</h2>
<ul>
<li>1. <a href="#s1">Introduction</a></li>
<li>2. <a href="#s2">EMMA use cases</a></li>
</ul>
<ul class="tocline">
<li>2.1 <a href="#s2.1">Incremental results for streaming
modalities such as haptics, ink, monologues, dictation</a></li>
<li>2.2 <a href="#s2.2">Representing biometric information</a></li>
<li>2.3 <a href="#s2.3">Representing emotion in EMMA</a></li>
<li>2.4 <a href="#s2.4">Richer semantic representations in
EMMA</a></li>
<li>2.5 <a href="#s2.5">Representing system output in EMMA</a></li>
<li class="c1">
<ul class="tocline">
<li>2.5.1 <a href="#s2.5.1">Abstracting output from specific
modalities</a></li>
<li>2.5.2 <a href="#s2.5.2">Coordination of outputs distributed
over multiple different modalities</a></li>
</ul>
</li>
<li>2.6 <a href="#s2.6">Representation of dialogs in EMMA</a></li>
<li>2.7 <a href="#s2.7">Logging, analysis, and annotation</a></li>
<li class="c1">
<ul class="tocline">
<li>2.7.1 <a href="#s2.7.1">Log analysis</a></li>
<li>2.7.2 <a href="#s2.7.2">Log annotation</a></li>
</ul>
</li>
<li>2.8 <a href="#s2.8">Multi-sentence inputs</a></li>
<li>2.9 <a href="#s2.9">Multi-participant interactions</a></li>
<li>2.10 <a href="#s2.10">Capturing sensor data such as GPS in
EMMA</a></li>
<li>2.11 <a href="#s2.11">Extending EMMA from NLU to also represent
search or database retrieval results</a></li>
<li>2.12 <a href="#s2.12">Supporting other semantic representation
forms in EMMA</a></li>
</ul>
<ul>
<li><a href="#references">General References</a></li>
</ul>
<hr title="Separator for introduction" />
<h2 id="s1">1. Introduction</h2>
<p>This document presents a set of use cases for possible new
features of the Extensible MultiModal Annotation (EMMA) markup
language. <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> was
designed primarily to be used as a data interchange format by
systems that provide semantic interpretations for a variety of
inputs, including but not necessarily limited to, speech, natural
language text, GUI and ink input. EMMA 1.0 provides a set of
elements for containing the various stages of processing of a
user's input and a set of elements and attributes for specifying
various kinds of metadata such as confidence scores and timestamps.
<a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> became a W3C
Recommendation on February 10, 2009.</p>
<p>A number of possible extensions to <a href=
"http://www.w3.org/TR/emma/">EMMA 1.0</a> have been identified
through discussions with other standards organizations,
implementers of EMMA, and internal discussions within the W3C
Multimodal Interaction Working Group. This document focusses on the
following use cases:</p>
<ol>
<li>Representing incremental results for streaming modalities such
as haptics, ink, monologues, dictation, where it is desirable to
have partial results available before the full input finishes.</li>
<li>Representing biometric results such as the results of speaker
verification or speaker identification (briefly covered in EMMA
1.0).</li>
<li>Representing emotion, for example, as conveyed by intonation
patterns, facial expression, or lexical choice.</li>
<li>Richer semantic representations, for example, integrating EMMA
application semantics with ontologies.</li>
<li>Representing system output in addition to user input, including
topics such as:</li>
<li class="c1">
<ol class="c2">
<li>Isolating presentation logic from dialog/interaction
management.</li>
<li>Coordination of outputs distributed over multiple different
modalities.</li>
</ol>
</li>
<li>Support for archival functions such as logging, human
annotation of inputs, and data analysis.</li>
<li>Representing full dialogs and multi-sentence inputs in addition
to single inputs.</li>
<li>Representing multi-participant interactions.</li>
<li>Representing sensor data such as GPS input.</li>
<li>Representing the results of database queries or search.</li>
<li>Support for forms of representation of application semantics
other than XML, such as JSON.</li>
</ol>
<p>It may be possible to achieve support for some of these features
without modifying the language, through the use of the
extensibility mechanisms of <a href=
"http://www.w3.org/TR/emma/">EMMA 1.0</a>, such as the
<code>&lt;emma:info&gt;</code> element and application-specific
semantics; however, this would significantly reduce
interoperability among EMMA implementations. If features are of
general value then it would be beneficial to define standard ways
of implementing them within the EMMA language. Additionally,
extensions may be needed to support additional new kinds of input
modalities such as multi-touch and accelerometer input.</p>
<p>The W3C Membership and other interested parties are invited to
review this document and send comments to the Working Group's
public mailing list www-multimodal@w3.org <a href=
"http://lists.w3.org/Archives/Public/www-multimodal/">(archive)</a>
.</p>
<h2 id="s2">2. EMMA use cases</h2>
<h3 id="s2.1">2.1 Incremental results for streaming modalities such
as haptic, ink, monologues, dictation</h3>
<p>In EMMA 1.0, EMMA documents were assumed to be created for
completed inputs within a given modality. However, there are
important use cases where it would be beneficial to represent some
level of interpretation of partial results before the input is
complete. For example, in a dictation application, where inputs can
be lengthy it is often desirable to show partial results to give
feedback to the user while they are speaking. In this case, each
new word is appended to the previous sequence of words. Another use
case would be incremental ASR, either for dictation or dialog
applications, where previous results might be replaced as more
evidence is collected. As more words are recognized and provide
more context, earlier word hypotheses may be updated. In this
scenario it may be necessary to replace the previous hypothesis
with a revised one.</p>
<p>In this section, we discuss how the EMMA standard could be
extended to support incremental or streaming results in the
processing of a single input. Some key considerations and areas for
discussion are:</p>
<ol>
<li>Do we need an identifier for a particular stream? Or is
<code>emma:source</code> sufficient? Subsequent messages (carrying
information for a particular stream) may need to have the same
identifier.</li>
<li>Do we need a sequence number to indicate order? Or are
timestamps sufficient (though optional)?</li>
<li>Do we need to mark "begin", "in progress" and "end" of a
stream? There are streams with a particular start and end, like a
dictation. Note that sensors may never explicitly end a
stream.</li>
<li>Do we always append information? Or do we also replace previous
data? A dictation application will probably append new text. But do
we consider sensor data (such as GPS position or device tilt) as
streaming or as "final" data?</li>
</ol>
<p>In the example below for dictation, we show how three new
attributes <code>emma:streamId</code>,
<code>emma:streamSeqNr</code>, and <code>emma:streamProgress</code>
could be used to annotate each result with metadata regarding its
position and status within a stream of input. In this example, the
<code>emma:streamId</code> is a identifier which can be used to
show that different <code>emma:interpretation</code> elements are
members of the same stream. The <code>emma:streamSeqNr</code>
attribute provides a numerical order to elements in the stream
while <code>emma:streamProgress</code> indicates the start of the
stream (and whether to expect more interpretations within the same
stream), and the end of the stream. This is an instance of the
'append' scenario for partial results in EMMA.</p>
<table width="120">
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">User</td>
<td>Hi Joe the meeting has moved</td>
<td>
<pre>
&lt;emma:emma &gt; 
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  &lt;emma:interpretation id="int1"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:function="transcription"
    emma:confidence="0.75" 
    emma:tokens="Hi Joe the meeting has moved" 
    emma:streamId="id1" 
    emma:streamSeqNr="0" 
    emma:streamProgress="begin"&gt;
      &lt;emma:literal&gt;
      Hi Joe the meeting has moved
      &lt;/emma:literal&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;
    
</pre></td>
</tr>
<tr>
<td width="50">User</td>
<td>to friday at four</td>
<td>
<pre>
&lt;emma:emma &gt; 
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  &lt;emma:interpretation id="int2"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:function="transcription"
    emma:confidence="0.75" 
    emma:tokens="to friday at four" 
    emma:streamId="id1" 
    emma:streamSeqNr="1" 
    emma:streamProgress="end"&gt;
      &lt;emma:literal&gt;
      to friday at four
      &lt;/emma:literal&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;
    
</pre></td>
</tr>
</table>
<p>In the example below, a speech recognition hypothesis for the
whole string is updated once more words have been recognized. This
is an instance of the 'replace' scenario for partial results in
EMMA. Note that the <code>emma:streamSeqNr</code> is the same for
each interpretation in this case.</p>
<table width="120">
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">User</td>
<td>Is there a Pisa</td>
<td>
<pre>
&lt;emma:emma &gt; 
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  &lt;emma:interpretation id="int1"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:function="dialog"
    emma:confidence="0.7"
    emma:tokens="is there a pisa" 
    emma:streamId="id2" 
    emma:streamSeqNr="0" 
    emma:streamProgress="begin"&gt;
      &lt;emma:literal&gt;
      is there a pisa
      &lt;/emma:literal&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;
    
</pre></td>
</tr>
<tr>
<td width="50">User</td>
<td>Is there a pizza restaurant</td>
<td>
<pre>
&lt;emma:emma &gt; 
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  &lt;emma:interpretation id="int2" 
    emma:medium="acoustic"
    emma:mode="voice"
    emma:function="dialog"
    emma:confidence="0.9"
    emma:tokens="is there a pizza restaurant" 
    emma:streamId="id2" 
    emma:streamSeqNr="0" 
    emma:streamProgress="end"&gt; 
      &lt;emma:literal&gt;
      is there a pizza restaurant
      &lt;/emma:literal&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;
    
</pre></td>
</tr>
</table>
<p>One issue for the 'replace' case of incremental results, is how
to specify that a result replaces multiple of the previously
received results. For example, a system could receive partial
results consisting of each word in turn of an utterance, and then a
final result which is the final recognition for the whole sequence
of words. One approach to this problem would be to allow
<code>emma:streamSeqNr</code> to specify a range of inputs to be
replaced. For example, if the <code>emma:streamSeqNr</code> for
each of three single word results was 1, 2, and then 3. A final
revised result could be marked as
<code>emma:streamSeqNr="1-3"</code> indicating that it is a revised
result for those three words.</p>
<p>One issue is whether timestamps might be used to track ordering
instead of introducing new attributes. One problem is that
timestamp attributes are not required and may not always be
available. Also as shown in the example, chunks of input in a
stream may not always be in sequential order. Even with timestamps
providing an order some kind of 'begin' and 'end' flag is needed
(like <code>emma:streamProgress)</code> to indicate indicate the
beginning and end of transmission of streamed input. Moreover,
timestamps do not provide sufficient information to detect whether
a message has been lost.</p>
<p>Another possibility to explore for representation of incremental
results would be to use an <code>&lt;emma:sequence&gt;</code>
element containing the interim results and a derived result which
contains the combination.</p>
<p>Another issue to explore is the relationship between incremental
results and the MMI lifecyle events within the <a href=
"http://www.w3.org/TR/mmi-arch/">MMI Architecture</a>.</p>
<h3 id="s2.2">2.2 Representing biometric information</h3>
<p>Biometric technologies include systems designed to identify
someone or verify a claim of identity based on their physical or
behavioral characteristics. These include speaker verification,
speaker identification, face recognition, and iris recognition,
among others. <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a>
provided some capability for representing the results of biometric
analysis through values of the <code>emma:function</code> attribute
such as "verification". However, it did not discuss the specifics
of this use case in any detail. It may be worth exploring further
considerations and consequences of using EMMA to represent
biometric results. As one example, if different biometric results
are represented in EMMA, this would simplify the process of fusing
the outputs of multiple biometric technologies to obtain a more
reliable overall result. &nbsp;It should also make it easier to
take into account non-biometric claims of identity, such as a
statement like "this is Kazuyuki", represented in EMMA, along with
a speaker verification result based on the speaker's voice, which
would also be represented in EMMA. In the following example, we
have extended the set of values for <code>emma:function</code> to
include "identification" for an interpretation showing the results
of a biometric component that picks out an individual from a set of
possible individuals (who are they). This contrasts with
"verification" which is used for verification of a particular user
(are they who they say they are).</p>
<h4 id="biometric_example">Example</h4>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td>an image of a face</td>
<td>
<pre>
&lt;emma:emma&gt; 
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  &lt;emma:interpretation id=“int1"
    emma:confidence="0.75”
    emma:medium="visual" 
    emma:mode="photograph" 
    emma:verbal="false" 
    emma:function="identification"&gt; 
      &lt;person&gt;12345&lt;/person&gt;
      &lt;name&gt;Mary Smith&lt;/name&gt; 
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;
    
</pre></td>
</tr>
</tbody>
</table>
<p>One direction to explore further is the relationship between
work on messaging protocols for biometrics within the OASIS
Biometric Identity Assurance Services (<a href="http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=bias">BIAS</a>)
standards committee and EMMA.</p>
<h3 id="s2.3">2.3 Representing emotion in EMMA</h3>
<p>In addition to speech recognition, and other tasks such as
speaker verification and identification, another kind of
interpretation of speech that is of increasing importance is
determination of the emotional state of the speaker, based on, for
example, their prosody, lexical choice, or other features. This
information can be used, for example, to make the dialog logic of
an interactive system sensitive to the user's emotional state.
Emotion detection can also use other modalities such as vision
(facial expression, posture) and physiological sensors such as skin
conductance measurement or blood pressure. Multimodal approaches
where evidence is combined from multiple different modalities are
also of significance for emotion classification.</p>
<p>The creation of a markup language for emotion has been a recent
focus of attention in W3C. Work that initiated in the W3C Emotion
Markup Language Incubator Group (<a href=
"http://www.w3.org/2005/Incubator/emotion/XGR-emotionml-20081120/">EmotionML
XG</a>), has now transitioned to the <a href=
"http://www.w3.org/2002/mmi/">W3C Multimodal Working Group</a> and
the <a href="http://www.w3.org/TR/emotionml">EmotionML</a> language
has been published as a working draft. One of the major use cases
for that effort is: "Automatic recognition of emotions from
sensors, including physiological sensors, speech recordings, facial
expressions, etc., as well as from multi-modal combinations of
sensors."</p>
<p>Given the similarities to the technologies and annotations used
for other kinds of input processing (recognition, semantic
classification) which are now captured in EMMA, it makes sense to
explore the use of EMMA for capture of emotional classification of
inputs. Just as EMMA does not standardize the application markup
for semantic results, though, it does not make sense to try and
standardize emotion markup within EMMA. One promising approach is
to combine the containers and metadata annotation of EMMA with the
<a href="http://www.w3.org/TR/emotionml">EmotionML</a> markup, as
shown in the following example.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td width="50">expression of boredom</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns:emo="http://www.w3.org/2009/10/emotionml"&gt;
  &lt;emma:interpretation id="emo1"
    emma:start="1241035886246"
    emma:end="1241035888246"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:verbal="false"
    emma:signal="http://example.com/input345.amr"
    emma:media-type="audio/amr; rate:8000;"
    emma:process="engine:type=emo_class&amp;vn=1.2”&gt;
      &lt;emo:emotion&gt;
        &lt;emo:intensity 
          value="0.1" 
          confidence="0.8"/&gt;
        &lt;emo:category 
          set="everydayEmotions" 
          name="boredom" 
          confidence="0.1"/&gt;
      &lt;/emo:emotion&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>In this example, we use the capabilities of EMMA for describing
the input signal, its temporal characteristics, modality, sampling
rate, audio codec etc. and EmotionML is used to provide the
specific representation of the emotion. Other EMMA container
elements also have strong use cases for emotion recognition. For
example, <code>&lt;emma:one-of&gt;</code> can be used to represent
N-best lists of competing classifications of emotion. The
<code>&lt;emma:group&gt;</code> element could be used to combine a
semantic interpretation of a user input with an emotional
classification, as illustrated in the following example. Note that
all of the general properties of the signal can be specified on the
<code>&lt;emma:group&gt;</code> element.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td width="50">spoken input "flights to boston tomorrow" to dialog
system in angry voice</td>
<td>
<pre>
&lt;emma:emma 
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns:emo="http://www.w3.org/2009/10/emotionml"&gt;
  &lt;emma:group id="result1"
    emma:start="1241035886246"
    emma:end="1241035888246"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:verbal="false"
    emma:signal="http://example.com/input345.amr"
    emma:media-type="audio/amr; rate:8000;"&gt;
    &lt;emma:interpretation id="asr1"
      emma:tokens="flights to boston tomorrow"
      emma:confidence="0.76"
      emma:process="engine:type=asr_nl&amp;vn=5.2”&gt;
        &lt;flight&gt;
          &lt;dest&gt;boston&lt;/dest&gt;
          &lt;date&gt;tomorrow&lt;/date&gt;
        &lt;/flight&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation id="emo1"
      emma:process="engine:type=emo_class&amp;vn=1.2”&gt;
      &lt;emo:emotion&gt;
        &lt;emo:intensity 
          value="0.3" 
          confidence="0.8"/&gt;
        &lt;emo:category 
          set="everydayEmotions" 
          name="anger" 
          confidence="0.8"/&gt;
      &lt;/emo:emotion&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:group-info&gt;
    meaning_and_emotion
    &lt;/emma:group-info&gt;
  &lt;/emma:group&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>The element <code>&lt;emma:group&gt;</code> can also be used to
capture groups of emotion detection results from individual
modalities for combination by a multimodal fusion component or when
automatic recognition results are described together with manually
annotated data. This use case is inspired by <a href=
"http://www.w3.org/2005/Incubator/emotion/XGR-emotion/#AppendixUseCases">
Use case 2b (II)</a> of the Emotion Incubator Group Report. The
following example illustrates the grouping of three
interpretations, namely: a speech analysis emotion classifier, a
physiological emotion classifier measuring blood pressure, and a
human annotator viewing video, for two different media files (from
the same episode) that are synchronized via <code>emma:start</code>
and <code>emma:end</code> attributes. In this case, the
physiological reading is for a subinterval of the video and audio
recording.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td width="50">audio, video, and physiological sensor of a test
user acting with a new design.</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns:emo="http://www.w3.org/2009/10/emotionml"&gt;
  &lt;emma:group id="result1"&gt;
    &lt;emma:interpretation id="speechClassification1"      
      emma:medium="acoustic"
      emma:mode="voice"
      emma:verbal="false"
      emma:start="1241035884246"
      emma:end="1241035887246"
      emma:signal="http://example.com/video_345.mov"
      emma:process="engine:type=emo_voice_classifier”&gt;
        &lt;emo:emotion&gt;
          &lt;emo:category 
            set="everydayEmotions" 
            name="anger" 
            confidence="0.8"/&gt;
        &lt;/emo:emotion&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation id="bloodPressure1"               
      emma:medium="tactile"
      emma:mode="blood_pressure"
      emma:verbal="false"
      emma:start="1241035885300"
      emma:end="1241035886900"
      emma:signal="http://example.com/bp_signal_345.cvs"
      emma:process="engine:type=emo_physiological_classifier”&gt;
        &lt;emo:emotion&gt;
          &lt;emo:category 
            set="everydayEmotions" 
            name="anger" 
            confidence="0.6"/&gt;
        &lt;/emo:emotion&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation id="humanAnnotation1"               
      emma:medium="visual"
      emma:mode="video"
      emma:verbal="false"
      emma:start="1241035884246"
      emma:end="1241035887246"
      emma:signal="http://example.com/video_345.mov"
      emma:process="human:type=labeler&amp;id=1”&gt;
        &lt;emo:emotion&gt;
          &lt;emo:category 
            set="everydayEmotions" 
            name="fear" 
            confidence="0.6"/&gt;
        &lt;/emo:emotion&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:group-info&gt;
    several_emotion_interpretations
    &lt;/emma:group-info&gt;
  &lt;/emma:group&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>A combination of <code>&lt;emma:group&gt;</code> and
<code>&lt;emma:derivation&gt;</code> could be used to represent a
combined emotional analysis resulting from analysis of multiple
different modalities of the user's behavior. The
<code>&lt;emma:derived-from&gt;</code> and
<code>&lt;emma:derivation&gt;</code> elements can be used to
capture both the fused result and combining inputs in a single EMMA
document. In the following example, visual analysis of user
activity and analysis of their speech have been combined by a
multimodal fusion component to provide an combined multimodal
classification of the user's emotional state. The specifics of the
multimodal fusion algorithm are not relevant here, or to EMMA in
general. Note though that in this case, the multimodal fusion
appears to have compensated for uncertainty in the visual analysis
which gave two results with equal confidence, one for fear and one
for anger. The <code>emma:one-of</code> element is used to capture
the N-best list of multiple competing results from the video
classifier.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td width="50">multimodal fusion of emotion classification of user
based on analysis of voice and video</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns:emo="http://www.w3.org/2009/10/emotionml"&gt;
  &lt;emma:interpretation id="multimodalClassification1" 
    emma:medium="acoustic,visual"
    emma:mode="voice,video"
    emma:verbal="false"
    emma:start="1241035884246"
    emma:end="1241035887246"
    emma:process="engine:type=multimodal_fusion”&gt;
      &lt;emo:emotion&gt;
        &lt;emo:category 
          set="everydayEmotions" 
          name="anger" 
          confidence="0.7"/&gt;
      &lt;/emo:emotion&gt;
    &lt;emma:derived-from ref="mmgroup1" composite="true"/&gt;
  &lt;/emma:interpretation&gt;
  &lt;emma:derivation&gt;
    &lt;emma:group id="mmgroup1"&gt;
      &lt;emma:interpretation id="speechClassification1"      
        emma:medium="acoustic"
        emma:mode="voice"
        emma:verbal="false"
        emma:start="1241035884246"
        emma:end="1241035887246"
        emma:signal="http://example.com/video_345.mov"
        emma:process="engine:type=emo_voice_classifier”&gt;
          &lt;emo:emotion&gt;
            &lt;emo:category 
              set="everydayEmotions" 
              name="anger" 
              confidence="0.8"/&gt;
          &lt;/emo:emotion&gt;
      &lt;/emma:interpretation&gt;
      &lt;emma:one-of id="video_nbest"               
        emma:medium="visual"
        emma:mode="video"
        emma:verbal="false"
        emma:start="1241035884246"
        emma:end="1241035887246"
        emma:signal="http://example.com/video_345.mov"
        emma:process="engine:type=video_classifier"&gt;
        &lt;emma:interpretation id="video_result1"
          &lt;emo:emotion&gt;
            &lt;emo:category 
              set="everydayEmotions" 
              name="anger" 
              confidence="0.5"/&gt;
          &lt;/emo:emotion&gt;
        &lt;/emma:interpretation&gt;
        &lt;emma:interpretation id="video_result2"
          &lt;emo:emotion&gt;
            &lt;emo:category 
              set="everydayEmotions" 
              name="fear" 
              confidence="0.5"/&gt;
          &lt;/emo:emotion&gt;
        &lt;/emma:interpretation&gt;
      &lt;/emma:one-of&gt;
      &lt;emma:group-info&gt;
      emotion_interpretations
      &lt;/emma:group-info&gt;
    &lt;/emma:group&gt;
  &lt;/emma:derivation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>One issue which need to be addressed is the relationship between
EmotionML <code>confidence</code> attribute values and
<code>emma:confidence</code> values. Could the
<code>emma:confidence</code> value be used as an overall confidence
value for the emotion result, or should confidence values appear
only within the EmotionML markup since confidence is used for
different dimensions of the result? If a series of possible emotion
classifications are contained in <code>emma:one-of</code> should
they be ordered by the EmotionML confidence values?</p>
<h3 id="s2.4">2.4 Richer semantic representations in EMMA</h3>
<p>Enriching the semantic information represented in EMMA would be
helpful for certain use cases. For example, the concepts in an EMMA
application semantics representation might include references to
concepts in an ontology such as WordNet. Then, a translation system
might make use of a sense disambiguator to represent the
probabilities of different senses of a word, for example, "spicy"
in the example has two possible WordNet senses. In the following
example, inputs to a machine translation system are annotated in
the application semantics with specific WordNet senses which are
used to distinguish among different senses of the words. A
translation system might make use of a sense disambiguator to
represent the probabilities of different senses of a word, for
example, "spicy" in the example has two possible WordNet
senses.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td>I love to eat Mexican food because it is spicy</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns="http://example.com/universal_translator"&gt;
  &lt;emma:interpretation id="spanish"&gt;
    &lt;result xml:lang="es"&gt;
    Adoro alimento mejicano porque es picante.
    &lt;/result&gt;
    &lt;emma:derived-from resource="#english" composite="false"/&gt;
  &lt;/emma:interpretation&gt;
  &lt;emma:derivation&gt;
    &lt;emma:interpretation id="english"
      emma:tokens="I love to eat Mexican food
                   because it is spicy"&gt;
      &lt;assertion&gt;
        &lt;interaction
          wordnet="1828736"
          wordnet-desc="love, enjoy (get pleasure from)"
          token="love"&gt;
          &lt;experiencer
            reference="first" 
            token="I"&gt;
                &lt;attribute quantity="single"/&gt;
          &lt;/experiencer&gt;
          &lt;attribute time="present"/&gt;
          &lt;content&gt;
            &lt;interaction wordnet="1157345" 
              wordnet-desc="eat (take in solid food)"
              token="to eat"&gt;
              &lt;object id="obj1"
                wordnet="7555863"
                wordnet-desc="food, solid food (any solid 
                              substance (as opposed to 
                              liquid) that is used as a source
                              of nourishment)"
                        token="food"&gt;
                &lt;restriction 
                  wordnet="3026902"
                  wordnet-desc="Mexican (of or relating
                                to Mexico or its inhabitants)"
                                token="Mexican"/&gt;
              &lt;/object&gt;
            &lt;/interaction&gt;
          &lt;/content&gt;
          &lt;reason token="because"&gt;
            &lt;experiencer reference="third" 
              target="obj1" token="it"/&gt;
                &lt;attribute time="present"/&gt;
                &lt;one-of token="spicy"&gt;
                  &lt;modification wordnet="2397732"
                    wordnet-desc="hot, spicy (producing a 
                                  burning sensation on 
                                  the taste nerves)" 
                    confidence="0.8"/&gt;
                  &lt;modification wordnet="2398378"
                    wordnet-desc="piquant, savory, 
                                  savoury, spicy, zesty
                                  (having an agreeably
                                  pungent taste)"
                    confidence="0.4"/&gt;
                &lt;/one-of&gt;
           &lt;/reason&gt;
         &lt;/interaction&gt;
       &lt;/assertion&gt;
     &lt;/emma:interpretation&gt;
  &lt;/emma:derivation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>In addition to sense disambiguation it could also be useful to
relate concepts to superordinate concepts in some ontology. For
example, it could be useful to know that O'Hare is an airport and
Chicago is a city, even though they might be used interchangeably
in an application. For example, in an air travel application a user
might say "I want to fly to O'Hare" or "I want to fly to
Chicago".</p>
<h3 id="s2.5">2.5 Representing system output in EMMA</h3>
<p><a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> was explicitly
limited in scope to representation of the interpretation of user
inputs. Most interactive systems also produce system output and one
of the major possible extensions of the EMMA language would be to
provide support for representation of the outputs made by the
system in addition to the user inputs. One advantage of having EMMA
representation for system output is that system logs can have
unified markup representation across input and output for viewing
and analyzing user/system interactions. In this section, we
consider two different use cases for addition of output
representation to EMMA.</p>
<h4 id="s2.5.1">2.5.1 Abstracting output from specific modality or
output language</h4>
<p>It is desirable for a multimodal dialog designer to be able to
isolate dialog flow (for example <a href=
"http://www.w3.org/TR/2009/WD-scxml-20091029/">SCXML</a> code) from
the details of specific utterances produced by a system. This can
achieved by using presentation or media planning component that
takes the abstract intent from the system and creates one or more
modality-specific presentations. In addition to isolating dialog
logic from specific modality choice this can also make it easier to
support different technologies for the same modality. For example,
in the example below, the GUI technology is HTML, but abstracting
output would also support using a different GUI technology like
Flash, or <a href="http://www.w3.org/Graphics/SVG/">SVG</a>. If
EMMA is extended to support output, then EMMA documents could be
used for communication from the dialog manager to the presentation
planning component, and also potentially for the documents
generated by the presentation component, which could embed specific
markup such as HTML and <a href=
"http://www.w3.org/TR/speech-synthesis/">SSML</a>. Just as there
can be multiple different stages of processing of a user input,
there may be multiple stages of processing of an output, and the
mechanisms of EMMA can be used to capture and provide metadata on
these various stages of output processing.</p>
<p>Potential benefits for this approach include:</p>
<ol>
<li>Accessibility: it would be useful for an application to be able
to accommodate users who might have an assistive device or devices
without requiring special logic or even special applications.</li>
<li>Device independence: An application could separate the flow in
the IM from the details of the presentation. This might be
especially useful if there are a lot of target devices with
different types of screens, cameras, or possibilities for haptic
output.</li>
<li>Adapting to user preferences: An application could accommodate
different dynamic preferences, for example, switching to visual
presentation from speech in public places without disturbing the
application flow.</li>
</ol>
<p>In the following example, we consider the introduction of a new
EMMA element, <code>&lt;emma:presentation&gt;</code> which is the
output equivalent of the input element
<code>&lt;emma:interpretation&gt;</code>. Like
<code>&lt;emma:interpretation&gt;</code> this element can take
<code>emma:medium</code> and <code>emma:mode</code> attributes
classifying the specific modality. It could also potentially take
timestamp annotations indicating the time at which the output
should be produced. One issue is whether timestamps should be used
for the intended time of production or for the actual time of
production and how to capture both. Relative timestamps could be
used to anchor the planned time of presentation to another element
of system output. In this example we show how the
<code>emma:semantic-rep</code> attribute proposed in <a href=
"#s2.12">Section 2.12</a> could potentially be used to indicate the
markup language of the output.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Output</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">IM (step 1)</td>
<td>semantics of "what would you like for lunch?"</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt; 
  &lt;emma:presentation&gt;
    &lt;question&gt;
      &lt;topic&gt;lunch&lt;/topic&gt;
      &lt;experiencer&gt;second person&lt;/experiencer&gt;
      &lt;object&gt;questioned&lt;/object&gt;
    &lt;/question&gt;
  &lt;/emma:presentation&gt;
&lt;/emma:emma&gt;
    
</pre>
<p>or, more simply, without natural language generation:</p>
<pre>
&lt;emma:emma&gt; 
  &lt;emma:presentation&gt;
    &lt;text&gt;what would you like for lunch?&lt;/text&gt;
  &lt;/emma:presentation&gt;
&lt;/emma:emma&gt;
    
</pre></td>
</tr>
<tr>
<td width="50">presentation manager (voice output)</td>
<td>text "what would you like for lunch?"</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:presentation 
    emma:medium="acoustic" 
    emma:mode="voice"
    emma:verbal="true" 
    emma:function="dialog" 
    emma:semantic-rep="ssml"&gt;
      &lt;speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
        http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
        xml:lang="en-US"&gt;
          what would you like for lunch&lt;/speak&gt; 
  &lt;/emma:presentation&gt;
&lt;/emma:emma&gt;
  
</pre></td>
</tr>
<tr>
<td width="50">presentation manager (GUI output)</td>
<td>text "what would you like for lunch?"</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt; 
  &lt;emma:presentation 
    emma:medium="visual" 
    emma:mode="graphics"
    emma:verbal="true" 
    emma:function="dialog"  
    emma:semantic-rep="html"&gt;
      &lt;html&gt;
        &lt;body&gt;
          &lt;p&gt;what would you like for lunch?"&lt;/p&gt;
          &lt;input name="" type="text"&gt;
          &lt;input type="submit" name="Submit" 
           value="Submit"&gt;
        &lt;/body&gt;
      &lt;/html&gt;
  &lt;/emma:presentation&gt;
&lt;/emma:emma&gt;
    
</pre></td>
</tr>
</tbody>
</table>
<h4 id="s2.5.2">2.5.2 Coordination of outputs distributed over
multiple different modalities</h4>
<p>A critical issue in the enablement of effective multimodal
output is to enable synchronization of outputs in different output
media. For example, text to speech output or prompts may be
coordinated with graphical outputs such as highlighting of items in
an HTML table. EMMA markup could potentially be used to indicate
that elements in each medium should be coordinated in their
presentation. In the following example, a new attribute
<code>emma:sync</code> is used to indicate the relationship between
a <code>&lt;mark&gt;</code> in <a href=
"http://www.w3.org/TR/speech-synthesis/">SSML</a> and an element to
be highlighted in HTML content. The <code>emma:process</code>
attribute could be used to identify the presentation planning
component. Again <code>emma:semantic-rep</code> is used to indicate
the embedded markup language.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Output</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">system</td>
<td width="50">Coordinated presentation of table with TTS</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt; 
  &lt;emma:group id=“gp1" 
    emma:medium="acoustic,visual" 
    emma:mode="voice,graphics" 
    emma:process="http://example.com/presentation_planner"&gt;
    &lt;emma:presentation id=“pres1"
      emma:medium="acoustic" 
      emma:mode="voice" 
      emma:verbal="true" 
      emma:function="dialog" 
      emma:semantic-rep="ssml"&gt; 
      &lt;speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
        http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
        xml:lang="en-US"&gt; 
        Item 4 &lt;mark emma:sync="123"/&gt; costs fifteen dollars.
      &lt;/speak&gt; 
    &lt;/emma:presentation&gt; 
    &lt;emma:presentation id=“pres2"
      emma:medium="visual" 
      emma:mode="graphics" 
      emma:verbal="true" 
      emma:function="dialog" 
      emma:semantic-rep="html" 
      &lt;table xmlns="http://www.w3.org/1999/xhtml"&gt;
        &lt;tr&gt;
          &lt;td emma:sync="123"&gt;Item 4&lt;/td&gt;
          &lt;td&gt;15 dollars&lt;/td&gt;
        &lt;/tr&gt;
      &lt;/table&gt;
    &lt;/emma:presentation&gt; 
  &lt;/emma:group&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>One issue to be considered is the potential role of the
Synchronized Multimedia Integration Language (<a href=
"http://www.w3.org/TR/REC-smil/">SMIL</a>) for capturing multimodal
output synchronization. SMIL markup for multimedia presentation
could potentially be embedded within EMMA markup coming from an
interaction manager to a client for rendering.</p>
<h3 id="s2.6">2.6 Representation of dialogs in EMMA</h3>
<p>The scope of <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a>
was explicitly limited to representation of single turns of user
input. For logging, analysis, and training purposes it could be
useful to be able to represent multi-stage dialogs in EMMA. The
following example shows a sequence of two EMMA documents where the
the first is a request from the system and the second is the user
response. A new attribute <code>emma:in-response-to</code> is used
to relate the system output to the user input. EMMA already has an
attribute <code>emma:dialog-turn</code> used to provide an
indicator of the turn of interaction.</p>
<h4 id="dialog_example">Example</h4>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">system</td>
<td width="50">where would you like to go?</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt; 
  &lt;emma:presentation id="pres1" 
    emma:dialog-turn="turn1" 
    emma:in-response-to="initial"&gt;
      &lt;prompt&gt;
      where would you like to go?
      &lt;/prompt&gt;
  &lt;/emma:presentation&gt; 
&lt;/emma:emma&gt; 
 
</pre></td>
</tr>
<tr>
<td width="50">user</td>
<td>New York</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt; 
  &lt;emma:interpretation id="int1" 
    emma:dialog-turn="turn2"
    emma:tokens="new york"
    emma:in-response-to="pres1"&gt; 
      &lt;location&gt;
      New York 
      &lt;/location&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt; 
</pre></td>
</tr>
</tbody>
</table>
<p>In this case, each utterance is still a single EMMA document,
and markup is being used to encode the fact that the utterance are
part of an ongoing dialog. Another possibility would be to use EMMA
markup to contain a whole dialog within a single EMMA document. For
example, a flight query dialog could be represented as follows
using <code>&lt;emma:sequence&gt;</code>:</p>
<h4 id="sequence_example">Example</h4>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td>flights to boston</td>
<td rowspan="5">
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt; 
  &lt;emma:sequence&gt; 
    &lt;emma:interpretation id="user1" 
      emma:dialog-turn="turn1" 
      emma:in-response-to="initial"&gt;
      &lt;emma:literal&gt;
      flights to boston
      &lt;/emma:literal&gt;
    &lt;/emma:interpretation&gt; 
    &lt;emma:presentation id="sys1" 
      emma:dialog-turn="turn2" 
      emma:in-response-to="user1"&gt; 
      &lt;prompt&gt;
      traveling to boston, 
      which departure city
      &lt;/prompt&gt;
    &lt;/emma:presentation&gt;
    &lt;emma:interpretation id="user2" 
      emma:dialog-turn="turn3" 
      emma:in-response-to="sys1"&gt;
      &lt;emma:literal&gt;
      san francisco
      &lt;/emma:literal&gt;
    &lt;/emma:interpretation&gt; 
    &lt;emma:presentation id="sys2" 
      emma:dialog-turn="turn4" 
      emma:in-response-to="user2"&gt; 
      &lt;prompt&gt;
      departure date
      &lt;/prompt&gt;
    &lt;/emma:presentation&gt;
    &lt;emma:interpretation id="user3" 
      emma:dialog-turn="turn5" 
      emma:in-response-to="sys2"&gt;
      &lt;emma:literal&gt;
      next thursday
      &lt;/emma:literal&gt;
    &lt;/emma:interpretation&gt; 
  &lt;/emma:sequence&gt; 
&lt;/emma:emma&gt; 
      
</pre></td>
</tr>
<tr>
<td width="50">system</td>
<td>traveling to Boston, which departure city?</td>
</tr>
<tr>
<td width="50">user</td>
<td>San Francisco</td>
</tr>
<tr>
<td width="50">system</td>
<td>departure date</td>
</tr>
<tr>
<td width="50">user</td>
<td>next thursday</td>
</tr>
</tbody>
</table>
<p>Note that in this example with
<code>&lt;emma:sequence&gt;</code> the
<code>emma:in-response-to</code> attribute is still important since
there is no guarantee that an utterance in a dialog is a response
to the previous utterance. For example, a sequence of utterances
may all be from the user.</p>
<p>One issue that arises with the representation of whole dialogs
is that the resulting EMMA documents with full sets of metadata may
become quite large. One possible extension that could help with
this would be allow the value of <code>emma:in-response-to</code>
to be URI valued so it can refer to another EMMA document.</p>
<h3 id="s2.7">2.7 Logging, analysis, and annotation</h3>
<p>EMMA was initially designed to facilitate communication among
components of an interactive system. It has become clear over time
that the language can also play an important role in logging of
user/system interactions. In this section, we consider possible
advantages of EMMA for log analysis and illustrate how elements
such as <code>&lt;emma:derived-from&gt;</code> could be used to
capture and provide metadata on annotations made by human
annotators.</p>
<h3 id="s2.7.1">2.7.1 Log analysis</h3>
<p>The proposal above for representing system output in EMMA would
support after the fact analysis of dialogs. For example, if both
the system's and the user's utterance are represented in EMMA, it
should be much easier to examine relationships between factors such
as how the wording of prompts might affect user's responses or even
the modality that users select for their responses. It would also
be easier to study timing relationships between the system prompt
and the user's responses. For example, prompts that are confusing
might consistently elicit longer times before the user starts
speaking. This would be useful even without a presentation manager
or fission component. In the following example, it might be useful
to look into the relationship between the end of the prompt and the
start of the user's response. We use here the
<code>emma:in-response-to</code> attribute suggested in <a href=
"#s2.6">Section 2.6</a> for the representation of dialogs in
EMMA.</p>
<h4 id="log_example">Example</h4>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">system</td>
<td>where would you like to go?</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:presentation id="pres1" 
    emma:dialog-turn="turn1"
    emma:in-response-to="initial"
    emma:start="1241035886246"
    emma:end="1241035888306"&gt;
    &lt;prompt&gt;
    where would you like to go?
    &lt;/prompt&gt;
  &lt;/emma:presentation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
<tr>
<td width="50">user</td>
<td>New York</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:interpretation id="int1" 
    emma:dialog-turn="turn2"
    emma:in-response-to="pres1"
    emma:start="1241035891246"
    emma:end="1241035893000""&gt;
    &lt;destination&gt;
    New York
    &lt;/destination&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;
    
</pre></td>
</tr>
</tbody>
</table>
<h3 id="s2.7.2">2.7.2 Log annotation</h3>
<p>EMMA is generally used to show the recognition, semantic
interpretation etc. assigned to inputs based on <em>machine</em>
processing of the user input. Another potential use case is to
provide a mechanism for showing the interpretation assigned to an
input by a human annotator and using
<code>&lt;emma:derived-from&gt;</code> to show the relationship
between the input received the annotation. The
<code>&lt;emma:one-of&gt;</code> element can then be used to show
multiple competing annotations for an input. The
<code>&lt;emma:group&gt;</code> element could be used to contain
multiple different kinds of annotation on a single input. One
question here is whether <code>emma:process</code> can be used for
identification of the labeller, and whether there is a need for any
additional EMMA machinery to better support this this use case. In
these examples, <code>&lt;emma:literal&gt;</code> contains mixed
content with text and elements. This is in keeping with the EMMA
1.0 schema.</p>
<p>One issue that arises concerns the meaning of an
<code>emma:confidence</code> value on an annotated interpretation.
It may be preferable to have another attribute for annotator
confidence rather than overloading the current
<code>emma:confidence</code>.</p>
<p>Another issue concerns mixing of system results and human
annotation. Should these be grouped or is the annotation a derived
from the system's interpretation. Also it would be useful to
capture the time of the annotation. The current timestamps are used
for the time of the input itself. Where should annotation
timestamps be recorded?</p>
<p>It would also be useful to have a way to specify open ended
information about the annotator such as their native language,
profession, experience etc. One approach would be to be to have a
new attribute e.g. <code>emma:annotator</code> with a URI value
that could point to a description of the annotator.</p>
<p>It could be useful for very common annotations to have in
addition to <code>emma:tokens</code> another dedicated element to
indicate the annotated transcription, for example,
<code>emma:annotated-tokens</code> or
<code>emma:transcription</code>.</p>
<p>In the following example, we show how
<code>emma:interpretation</code> and <code>emma:derived-from</code>
could be used to capture the annotation of an input.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td width="614"><strong>Input</strong></td>
<td width="531"><strong>EMMA</strong></td>
</tr>
<tr>
<td width="93">user</td>
<td>
<p>In this example the user has said:</p>
<p>"flights from boston to san francisco leaving on the fourth of
september"</p>
<p>and the semantic interpretation here is a semantic tagging of
the utterance done by a human annotator. emma:process is used to
provide details about the annotation</p>
</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:interpretation id="annotation1"
    emma:process="annotate:type=semantic&amp;annotator=michael"
    emma:confidence="0.90"&gt;
      &lt;emma:literal&gt;
      flights from &lt;src&gt;san francisco&lt;/src&gt; to 
      &lt;dest&gt;boston&lt;/dest&gt; on 
      &lt;date&gt;the fourth of september&lt;/date&gt;
      &lt;/emma:literal&gt;
    &lt;emma:derived-from resource="#asr1"/&gt;
  &lt;/emma:interpretation&gt;
  &lt;emma:derivation&gt;
    &lt;emma:interpretation id="asr1"
      emma:medium="acoustic" 
      emma:mode="voice"
      emma:function="dialog" 
      emma:verbal="true"
      emma:lang="en-US" 
      emma:start="1241690021513" 
      emma:end="1241690023033"
      emma:media-type="audio/amr; rate=8000"
      emma:process="smm:type=asr&amp;version=watson6"
      emma:confidence="0.80"&gt;
        &lt;emma:literal&gt;
        flights from san francisco 
        to boston on the fourth of september
        &lt;/emma:literal&gt;
    &lt;/emma:interpretation&gt;
  &lt;/emma:derivation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>Taking this example a step further,
<code>&lt;emma:group&gt;</code> could be used to group annotations
made by multiple different annotators of the same utterance:</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td width="614"><strong>Input</strong></td>
<td width="531"><strong>EMMA</strong></td>
</tr>
<tr>
<td width="93">user</td>
<td>
<p>In this example the user has said:</p>
<p>"flights from boston to san francisco leaving on the fourth of
september"</p>
<p>and the semantic interpretation here is a semantic tagging of
the utterance done by two different human annotators.
<code>emma:process</code> is used to provide details about the
annotation</p>
</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:group emma:confidence="1.0"&gt;
    &lt;emma:interpretation id="annotation1"
      emma:process="annotate:type=semantic&amp;annotator=michael"
      emma:confidence="0.90"&gt;
        &lt;emma:literal&gt;
        flights from &lt;src&gt;san francisco&lt;/src&gt; 
        to &lt;dest&gt;boston&lt;/dest&gt; 
        on &lt;date&gt;the fourth of september&lt;/date&gt;
        &lt;/emma:literal&gt;
      &lt;emma:derived-from resource="#asr1"/&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation id="annotation2"
      emma:process="annotate:type=semantic&amp;annotator=debbie"
      emma:confidence="0.90"&gt;
        &lt;emma:literal&gt;
        flights from &lt;src&gt;san francisco&lt;/src&gt; 
        to &lt;dest&gt;boston&lt;/dest&gt; on 
        &lt;date&gt;the fourth of september&lt;/date&gt;
        &lt;/emma:literal&gt;
      &lt;emma:derived-from resource="#asr1"/&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:group-info&gt;semantic_annotations&lt;/emma:group-info&gt;
  &lt;/emma:group&gt;
  &lt;emma:derivation&gt;
    &lt;emma:interpretation id="asr1"
      emma:medium="acoustic" 
      emma:mode="voice"
      emma:function="dialog" 
      emma:verbal="true"
      emma:lang="en-US" 
      emma:start="1241690021513" 
      emma:end="1241690023033"
      emma:media-type="audio/amr; rate=8000"
      emma:process="smm:type=asr&amp;version=watson6"
      emma:confidence="0.80"&gt;
        &lt;emma:literal&gt;
        flights from san francisco to boston
        on the fourth of september
        &lt;/emma:literal&gt;
    &lt;/emma:interpretation&gt;
  &lt;/emma:derivation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<h3 id="s2.8">2.8 Multisentence Inputs</h3>
<p>For certain applications, it is useful to be able to represent
the semantics of multi-sentence inputs, which may be in one of more
modalities such as speech (e.g. voicemail), text (e.g. email), or
handwritten input. One application use case is for summarizing a
voicemail or email. We develop this example below.</p>
<p>There are at least two possible approaches to addressing this
use case.</p>
<ol>
<li>If there is no reason to distinguish the individual sentences
of the input or interpret them individually, the entire input could
be included as the value of the <code>emma:tokens</code> attribute
of an <code>&lt;emma:interpretation&gt;</code> or
<code>&lt;emma:one-of&gt;</code> element, where the semantics of
the input is represented as the value of an
<code>&lt;emma:interpretation&gt;</code>. Although in principle
there is no upper limit on the length of a <code>emma:tokens</code>
attribute, in practice, this approach might be cumbersome for
longer or more complicated texts.</li>
<li>If more structure is required, the interpretations of the
individual sentences in the input could be grouped as individual
<code>&lt;emma:interpretation&gt;</code> elements under an
<code>&lt;emma:sequence&gt;</code> element. A single unified
semantics representing the meaning of the entire input could then
be represented with the sequence as the value of
<code>&lt;emma:derived-from&gt;</code>.</li>
</ol>
The example below illustrates the first approach.
<h4 id="multisentence_example">Example</h4>
<table border="1">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td width="614"><strong>Input</strong></td>
<td width="531"><strong>EMMA</strong></td>
</tr>
<tr>
<td width="93">user</td>
<td>
<p>Hi Group,</p>
<p>You are all invited to lunch tomorrow at Tony's Pizza at 12:00.
Please let me know if you're planning to come so that I can make
reservations. Also let me know if you have any dietary
restrictions. Tony's Pizza is at 1234 Main Street. We will be
discussing ways of using EMMA.</p>
<p>Debbie</p>
</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:interpretation 
    emma:tokens="Hi Group, You are all invited to 
    lunch tomorrow at Tony's Pizza at 12:00. 
    Please let me know if you're planning to
    come so that I can make reservations. 
    Also let me know if you have any dietary
    restrictions. Tony's Pizza is at 1234 
    Main Street. We will be discussing 
    ways of using EMMA." &gt;
      &lt;business-event&gt;lunch&lt;/business-event&gt;
      &lt;host&gt;debbie&lt;/host&gt;
      &lt;attendees&gt;group&lt;/attendees&gt;
      &lt;location&gt;
        &lt;name&gt;Tony's Pizza&lt;/name&gt;
        &lt;address&gt; 1234 Main Street&lt;/address&gt;
      &lt;/location&gt;
      &lt;date&gt; tuesday, March 24&lt;/date&gt;
      &lt;needs-rsvp&gt;true&lt;/needs-rsvp&gt;
      &lt;needs-restrictions&gt;true&lt;/need-restrictions&gt;
      &lt;topic&gt;ways of using EMMA&lt;/topic&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;

      
</pre></td>
</tr>
</tbody>
</table>
<h3 id="s2.9">2.9 Multi-participant interactions</h3>
<p><a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> primarily
focussed on the interpretation of inputs from a single user. Both
for annotation of human-human dialogs and for the emerging systems
which support dialog or multimodal interaction with multiple
participants (such as multimodal systems for meeting analysis), it
is important to support annotation of interactions involving
multiple different participants. The proposals above for capturing
dialog can play an important role. One possible further extension
would be to add specific markup for annotation of the user making a
particular contribution. In the following example, we use an
attribute <code>emma:participant</code> to identify the participant
contributing each response to the prompt.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td width="668"><strong>Input</strong></td>
<td width="480"><strong>EMMA</strong></td>
</tr>
<tr>
<td width="90">system</td>
<td>Please tell me your lunch orders</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:presentation id="pres1" 
    emma:dialog-turn="turn1"
    emma:in-response-to="initial"
    emma:start="1241035886246"
    emma:end="1241035888306"&gt;
      &lt;prompt&gt;please tell me your lunch orders&lt;/prompt&gt;
  &lt;/emma:presentation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
<tr>
<td width="90">user1</td>
<td>I'll have a mushroom pizza</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:interpretation id="int1" 
    emma:dialog-turn="turn2"
    emma:in-response-to="pres1"
    emma:participant="user1"
    emma:start="1241035891246"
    emma:end="1241035893000""&gt;
      &lt;pizza&gt;
        &lt;topping&gt;
        mushroom
        &lt;/topping&gt;
      &lt;/pizza&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;
    
</pre></td>
</tr>
<tr>
<td width="90">user3</td>
<td>I'll have a pepperoni pizza.</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:interpretation id="int2" 
    emma:dialog-turn="turn3"
    emma:in-response-to="pres1"
    emma:participant="user2"
    emma:start="1241035896246"
    emma:end="1241035899000""&gt;
      &lt;pizza&gt;
        &lt;topping&gt;
        pepperoni
        &lt;/topping&gt;
      &lt;/pizza&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;
    
</pre></td>
</tr>
</tbody>
</table>
<h3 id="s2.10">2.10 Capturing sensor data such as GPS in EMMA</h3>
<p>The multimodal examples described in the <a href=
"http://www.w3.org/TR/emma/">EMMA 1.0</a> specification, include
combination of spoken input with a location specified by touch or
pen. With the increase in availability of GPS and other location
sensing technology such as cell tower triangulation in mobile
devices, it is desirable to provide a method for annotating inputs
with the device location and, in some cases fusing the GPS
information with the spoken command in order to derive a complete
interpretation. GPS information could potentially be determined
using the <a href=
"http://www.w3.org/TR/2009/WD-geolocation-API-20090707/">Geolocation
API Specification</a> from the <a href=
"http://www.w3.org/2008/geolocation/">Geolocation working group</a>
and then encoded into a EMMA result sent to a server for
fusion.</p>
<p>One possibility using the current EMMA capabilities is to use
<code>&lt;emma:group&gt;</code> to associate GPS markup with the
semantics of a spoken command. For example, the user might say
"where is the nearest pizza place?" and the interpretation of the
spoken command is grouped with markup capturing the GPS sensor
data. This example uses the existing
<code>&lt;emma:group&gt;</code> element and extends the set of
values of <code>emma:medium</code> and <code>emma:mode</code> to
include <code>"sensor"</code> and <code>"gps"</code>
respectively.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td width="50">where is the nearest pizza place?</td>
<td rowspan="2">
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt; 
  &lt;emma:group&gt;
    &lt;emma:interpretation 
      emma:tokens="where is the nearest pizza place" 
      emma:confidence="0.9" 
      emma:medium="acoustic" 
      emma:mode="voice"
      emma:start="1241035887111" 
      emma:end="1241035888200" 
      emma:process="reco:type=asr&amp;version=asr_eng2.4" 
      emma:media-type="audio/amr; rate=8000" 
      emma:lang="en-US"&gt;
        &lt;category&gt;pizza&lt;/category&gt;
    &lt;/emma:interpretation&gt; 
    &lt;emma:interpretation
      emma:medium="sensor" 
      emma:mode="gps" 
      emma:start="1241035886246" 
      emma:end="1241035886246"&gt;
        &lt;lat&gt;40.777463&lt;/lat&gt;
        &lt;lon&gt;-74.410500&lt;/lon&gt;
        &lt;alt&gt;0.2&lt;/alt&gt; 
    &lt;/emma:interpretation&gt; 
    &lt;emma:group-info&gt;geolocation&lt;/emma:group-info&gt;
  &lt;/emma:group&gt; 
&lt;/emma:emma&gt;
    
</pre></td>
</tr>
<tr>
<td width="50">GPS</td>
<td>(GPS coordinates)</td>
</tr>
</tbody>
</table>
<p>Another, more abbreviated, way to incorporate sensor information
would be to have spatial correlates of the timestamps and allow for
location stamping of user inputs, e.g. <code>emma:lat</code> and
<code>emma:lon</code> attributes that could appear on EMMA
container elements to indicate the location where the input was
produced.</p>
<h3 id="s2.11">2.11 Extending EMMA from NLU to also represent
search or database retrieval results</h3>
<p>In many of the use cases considered so far, EMMA is used for
representation of the results of speech recognition and then for
the results of natural language understanding, and possibly
multimodal fusion. In systems used for voice search, the next step
is often to conduct search and extract a set of records or
documents. Strictly speaking, this stage of processing is out of
scope for EMMA. It is odd though to have the mechanisms of EMMA
such as <code>&lt;emma:one-of&gt;</code> for ambiguity all the way
up to NLU or multimodal fusion, but not to have access to the same
apparatus for representation of the next stage of processing which
can often be search or database lookup. Just as we can use
<code>&lt;emma:one-of&gt;</code> and <code>emma:confidence</code>
to represent N-best recognitions or semantic interpretations,
similarly we can use them to represent a series of search results
along with their relative confidence. One issue is whether we need
some measure other than confidence for relevance ranking, or is the
same confidence attribute can be used.</p>
<p>One issue that arises is whether it would be useful to have some
recommended or standardized element to use for query results e.g
<code>&lt;result&gt;</code> as in the following example. Another
issue is how to annotate information about the database and the
query that was issued. The database could be indicate as part of
the <code>emma:process</code> value as in the following example.
For web search the query URL could be annotated on the result e.g.
<code>&lt;result url="http://cnn.com"/&gt;</code>. For database
queries, the query, SQL for example could be annotated on the
results or on the containing <code>&lt;emma:group&gt;</code>.</p>
<p>The following example shows the use of EMMA to represent the
results of database retrieval from an employee directory. The user
says "John Smith". After ASR, NLU, and then database look up, the
system returns the XML here which shows the N-best lists associated
with each of these three stages of processing. Here
<code>&lt;emma:derived-from&amp;gr;</code> is used to indicate the
relations between each of the <code>&lt;emma:one-of&gt;</code>
elements. However, if you want to see which specific ASR result a
record is derived from, you would need to put
<code>&lt;emma:derived-from&gt;</code> on the individual
elements.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td width="50">User says "John Smith"</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:one-of id="db_results1"
    emma:process="db:type=mysql&amp;database=personel_060109.db&gt;
    &lt;emma:interpretation id="db_nbest1"
      emma:confidence="0.80" emma:tokens="john smith"&gt;
        &lt;result&gt;
          &lt;name&gt;John Smith&lt;/name&gt;
          &lt;room&gt;dx513&lt;/room&gt;
          &lt;number&gt;123-456-7890&gt;/number&gt;
        &lt;/result&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation id="db_nbest2"
      emma:confidence="0.70" emma:tokens="john smith"&gt;
        &lt;result&gt;
          &lt;name&gt;John Smith&lt;/name&gt;
          &lt;room&gt;ef312&lt;/room&gt;
          &lt;number&gt;123-456-7891&gt;/number&gt;
        &lt;/result&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation id="db_nbest3"
      emma:confidence="0.50" emma:tokens="jon smith"&gt;
        &lt;result&gt;
          &lt;name&gt;Jon Smith&lt;/name&gt;
          &lt;room&gt;dv900&lt;/room&gt;
          &lt;number&gt;123-456-7892&gt;/number&gt;
       &lt;/result&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation id="db_nbest4"
      emma:confidence="0.40" emma:tokens="joan smithe"&gt;
        &lt;result&gt;
          &lt;name&gt;Joan Smithe&lt;/name&gt;
          &lt;room&gt;lt567&lt;/room&gt;
          &lt;number&gt;123-456-7893&gt;/number&gt;
        &lt;/result&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:derived-from resource="#nlu_results1/&gt;
  &lt;/emma:one-of&gt;
  &lt;emma:derivation&gt;
    &lt;emma:one-of id="nlu_results1"
      emma:process="smm:type=nlu&amp;version=parser"&gt;
      &lt;emma:interpretation id="nlu_nbest1"
        emma:confidence="0.99" emma:tokens="john smith"&gt;
          &lt;fn&gt;john&lt;/fn&gt;&lt;ln&gt;smith&lt;/ln&gt;
      &lt;/emma:interpretation&gt;
      &lt;emma:interpretation id="nlu_nbest2"
        emma:confidence="0.97" emma:tokens="jon smith"&gt;
          &lt;fn&gt;jon&lt;/fn&gt;&lt;ln&gt;smith&lt;/ln&gt;
      &lt;/emma:interpretation&gt;
      &lt;emma:interpretation id="nlu_nbest3"
        emma:confidence="0.93" emma:tokens="joan smithe"&gt;
          &lt;fn&gt;joan&lt;/fn&gt;&lt;ln&gt;smithe&lt;/ln&gt;
      &lt;/emma:interpretation&gt;
      &lt;emma:derived-from resource="#asr_results1/&gt;
    &lt;/emma:one-of&gt;
    &lt;emma:one-of id="asr_results1"
      emma:medium="acoustic" emma:mode="voice"
      emma:function="dialog" emma:verbal="true"
      emma:lang="en-US" emma:start="1241641821513" 
      emma:end="1241641823033"
      emma:media-type="audio/amr; rate=8000"
      emma:process="smm:type=asr&amp;version=watson6"&gt;
        &lt;emma:interpretation id="asr_nbest1"
          emma:confidence="1.00"&gt;
            &lt;emma:literal&gt;john smith&lt;/emma:literal&gt;
        &lt;/emma:interpretation&gt;
        &lt;emma:interpretation id="asr_nbest2"
          emma:confidence="0.98"&gt;
            &lt;emma:literal&gt;jon smith&lt;/emma:literal&gt;
        &lt;/emma:interpretation&gt;
        &lt;emma:interpretation id="asr_nbest3"
          emma:confidence="0.89" &gt;
            &lt;emma:literal&gt;joan smithe&lt;/emma:literal&gt;
        &lt;/emma:interpretation&gt;
   &lt;/emma:one-of&gt;
  &lt;/emma:derivation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<h3 id="s2.12">2.12 Supporting other semantic representation forms
in EMMA</h3>
<p>In the <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a>
specification, the semantic representation of an input is
represented either in XML in some application namespace or as a
literal value using <code>emma:literal</code>. In some
circumstances it could be beneficial to allow for semantic
representation in other formats such as JSON. Serializations such
as JSON could potentially be contained within
<code>emma:literal</code> using CDATA, and a new EMMA annotation
e.g. <code>emma:semantic-rep</code> used to indicate the semantic
representation language being used.</p>
<h4 id="semantic_representation_example">Example</h4>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td>semantics of spoken input</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt; 
  &lt;emma:interpretation id=“int1"
    emma:confidence=".75”
    emma:medium="acoustic" 
    emma:mode="voice" 
    emma:verbal="true"
    emma:function="dialog" 
    emma:semantic-rep="json" 
      &lt;emma:literal&gt; 
        &lt;![CDATA[
              {
           drink: {
              liquid:"coke",
              drinksize:"medium"},
           pizza: {
              number: "3",
              pizzasize: "large",
              topping: [ "pepperoni", "mushrooms" ]
           }
          } 
          ]]&gt;
      &lt;/emma:literal&gt; 
  &lt;/emma:interpretation&gt; 
&lt;/emma:emma&gt; 
</pre></td>
</tr>
</tbody>
</table>
<h2 id="references">General References</h2>
<p>EMMA 1.0 Requirements <a href=
"http://www.w3.org/TR/EMMAreqs/">http://www.w3.org/TR/EMMAreqs/</a></p>
<p>EMMA Recommendation <a href=
"http://www.w3.org/TR/emma/">http://www.w3.org/TR/emma/</a></p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>Thanks to Jim Larson (W3C Invited Expert) for his contribution
to the section on EMMA for multimodal output.</p>
</body>
</html>