WD-multimodal-reqs-20000710 44 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151 1152 1153 1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205 1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1" />
<link rel="stylesheet" type="text/css"
href="http://www.w3.org/StyleSheets/TR/W3C-WD.css" />
<style type="text/css">
body { 
font-family: sans-serif;
margin-left: 10%; 
margin-right: 5%; 
color: black;
background-color: white;
background-attachment: fixed;
background-image: url(http://www.w3.org/StyleSheets/TR/WD.gif);
background-position: top left;
background-repeat: no-repeat;
}
h1,h2,h3,h4,h5,h6 {
margin-left: -4%;
font-weight: normal;
color: rgb(0, 92, 160);
}
img { color: white; border: 0; }
h1 { margin-top: 2em; clear: both; }
div.navbar,div.head { margin-bottom: 1em; }
p.copyright { font-size: 70%; }
span.term { font-style: italic; color: rgb(0, 0, 192); }

code {
    color: green;
    font-family: monospace;
    font-weight: bold;
}

code.greenmono {
    color: green;
    font-family: monospace;
    font-weight: bold;
}
.good {
    border: solid green;
    border-width: 2px;
    color: green;
    font-weight: bold;
    margin-right: 5%;
    margin-left: 0;
    margin-top: 1em;
    margin-bottom: 1em;
}
.bad  {
    border: solid red;
    border-width: 2px;
    margin-left: 0;
    margin-right: 5%;
    margin-top: 1em;
    margin-bottom: 1em;
    color: rgb(192, 101, 101);
}
div.navbar { text-align: center; }
div.contents {
    background-color: rgb(204,204,255);
    padding: 0.5em;
    border: none;
    margin-right: 5%;
}
.tocline { list-style: none; }
table.exceptions { background-color: rgb(255,255,153); }
.diff-old-a {
  font-size: smaller;
  color: red;
}

.diff-old {
  color: red;
  text-decoration: line-through;
}

.diff-new {
        color: green;
        text-decoration: underline;
}
</style>

<style type="text/css">
 pre.c7 {color: #3333FF}
 p.c6 {color: #3333FF}
 span.c5 {color: #3333FF}
 p.c4 {color: #FF6600}
 b.c3 {font-size: larger}
 tt.c2 {font-size: larger}
 span.c1 {color: #FF6600}
</style>

<title>Multimodal requirements</title>
</head>
<body text="#FF0000" bgcolor="#00FFFF">
<div class="head">
<p><a href="http://www.w3.org/"><img class="head"
src="http://www.w3.org/Icons/w3c_home" alt="W3C" /></a></p>

<h1 class="notoc">Multimodal Requirements<br />
for Voice Markup Languages</h1>

<h3 class="notoc">W3C Working Draft 10 July 2000</h3>

<dl>
<dt>This version:</dt>

<dd><a
href="http://www.w3.org/TR/2000/WD-multimodal-reqs-20000710">
http://www.w3.org/TR/2000/WD-multimodal-reqs-20000710</a></dd>

<dt>Latest version:</dt>

<dd><a href="http://www.w3.org/TR/multimodal-reqs">
http://www.w3.org/TR/multimodal-reqs</a></dd>

<dt>Editors:</dt>

<dd>Marianne Hickey, Hewlett Packard</dd>
</dl>

<p class="copyright"><a
href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">
Copyright</a> &#169;2000 <a href="http://www.w3.org/"><abbr
title="World Wide Web Consortium">W3C</abbr></a><sup>&#174;</sup>
(<a href="http://www.lcs.mit.edu/"><abbr
title="Massachusetts Institute of Technology">MIT</abbr></a>, <a
href="http://www.inria.fr/"><abbr lang="fr"
title="Institut National de Recherche en Informatique et Automatique">
INRIA</abbr></a>, <a href="http://www.keio.ac.jp/">Keio</a>), All
Rights Reserved. W3C <a
href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">
liability</a>, <a
href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">
trademark</a>, <a
href="http://www.w3.org/Consortium/Legal/copyright-documents-19990405">
document use</a> and <a
href="http://www.w3.org/Consortium/Legal/copyright-software-19980720">
software licensing</a> rules apply.</p>

<hr />
</div>

<h2 class="notoc">Abstract</h2>

<p>Multimodal browsers allow users to interact via a combination
of modalities, for instance, speech recognition and synthesis,
displays, keypads and pointing devices. The Voice Browser working
group is interested in adding multimodal capabilities to voice
browsers. This document sets out a prioritized list of
requirements for multimodal dialog interaction, which any
proposed markup language (or extension thereof) should
address.</p>

<h2>Status of this document</h2>

<p>This specification is a Working Draft of the Voice Browser
working group for review by W3C members and other interested
parties. This is the first public version of this document. It is
a draft document and may be updated, replaced, or obsoleted by
other documents at any time. It is inappropriate to use W3C
Working Drafts as reference material or to cite them as other
than "work in progress".</p>

<p>Publication as a Working Draft does not imply endorsement by
the W3C membership, nor of members of the Voice Browser working
groups. This is still a draft document and may be updated,
replaced or obsoleted by other documents at any time. It is
inappropriate to cite W3C Working Drafts as other than "work in
progress."</p>

<p>This document has been produced as part of the <a
href="http://www.w3.org/Voice/">W3C Voice Browser Activity</a>,
but should not be taken as evidence of consensus in the Voice
Browser Working Group. The goals of the <a
href="http://www.w3.org/Voice/Group/">Voice Browser Working
Group</a> (<a href="http://cgi.w3.org/MemberAccess/">members
only</a>) are discussed in the <a
href="http://www.w3.org/Voice/1999/voice-wg-charter.html">Voice
Browser Working Group charter</a> (<a
href="http://cgi.w3.org/MemberAccess/">members only</a>). This
document is for public review. Comments should be sent to the
public mailing list &lt;<a
href="mailto:www-voice@w3.org">www-voice@w3.org</a>&gt; (<a
href="http://lists.w3.org/Archives/Public/www-voice/">archive</a>).</p>

<p>A list of current W3C Recommendations and other technical
documents can be found at <a href="http://www.w3.org/TR/">
http://www.w3.org/TR</a>.</p>

<p class="comment">NOTE: Italicized green comments are merely
that - comments. They are for use during discussions but will be
removed as appropriate.</p>

<h3>Scope</h3>

<p>The document addresses multimodal dialog
interaction.Multimodal as defined in this document is one or more
speech modes:</p>

<ul>
<li>speech recognition,</li>

<li>speech synthesis,</li>

<li>prerecorded speech,</li>
</ul>

<p>together with one or more of the following modes:</p>

<ul>
<li>dtmf,</li>

<li>keyboard,</li>

<li>small screen</li>

<li>pointing device (mouse, pen)</li>

<li>other input/output modes</li>
</ul>

<p>The focus is on multimodal dialog where there is a small
screen and keypad (e.g. a cell phone) or a small screen, keypad
and pointing device (e.g. a palm computer with cellular
connection to the Web). This document is agnostic about where the
browser(s) and speech and language engines are running - e.g.
they could be running on the device itself, on a server or a
combination of the two.</p>

<p>The document addresses applications where both speech input
and speech output can be available. Note that this includes
applications where speech input and/or speech output may be
deselected due to environment/accessibility needs.</p>

<p>The document does not specifically address universal access,
i.e. the issue of rendering the same pages of markup to devices
with different capabilities (e.g. PC, phone or PDA). Rather, the
document addresses a markup language that allows an author to
write an application that uses spoken dialog interaction together
with other modalities (e.g. a visual interface).</p>

<h3>Interaction with Other Groups</h3>

<p>The activities of the Multimodal Requirements Subgroup will be
coordinated with the activities of other sub-groups within the
W3C Voice Browsing Working Group and other related W3C working
groups. Where possible, the specification will reuse standard
visual, multimedia and aural markup languages, see <a
href="#s4.1">Reuse of standard markup requirement (4.1)</a>.</p>

<h2>1. General Requirements</h2>

<h3>1.1 Scalable across end user devices (must address)</h3>

<p>The markup language will be scalable across devices with a
range of capabilities, in order to sufficiently meet the needs of
consumer and device control applications. This includes devices
capable of supporting:</p>

<ol>
<li>audio I/O plus keypad input - e.g. the plain phone with
speech plus dtmf, MP3 player with speech input and output and
with cellular connection to the Web;</li>

<li>audio, keypad and small screen - e.g. WAP phones, smart
phones with displays;</li>

<li>audio, soft keyboard, small screen and pointing - e.g.
palm-top personal organizers with cellular connection to the
Web.</li>

<li>audio, keyboard, full screen and pointing - e.g. desktop PC,
information kiosk.</li>
</ol>

<p>The server must be able to get access to client capabilities
and the user's personal preferences, see <a href="#s4.1">reuse of
standard markup requirement (4.1).</a></p>

<h3>1.2 Easy to implement (must address)</h3>

<p>The markup language should be easy for designers to understand
and author without special tools or knowledge of vendor
technology or protocols (multimodal dialog design knowledge is
still essential).</p>

<h3>1.3 <a id="s1.3" name="s1.3">Complimentary use of
modalities</a></h3>

<p>A characteristic of speech input is that it can be very
efficient - for example, in a device with a small display and
keypad, speech can bypass multiple layers of menus. A
characteristic of speech output is its serial nature, which can
make it a long-winded way of presenting information that could be
quickly browsed on a display.</p>

<p>The markup will allow an author to use the different
characteristics of the modalities in the most appropriate way for
the application.</p>

<h4>1.3.1 <a id="s1.3.1" name="s1.3.1">Output media</a> (must
address)</h4>

<p>The markup language will allow speech output to have different
content to that of simultaneous output from other media. This
requirement is related to the <a href="#s3.3">simultaneous output
requirements</a> (3.3 and 3.4).</p>

<p>In a speech plus GUI system, the author will be able to choose
different text for simultaneous verbal and visual outputs. For
example, a list of options may be presented on screen and
simultaneous speech output does not necessarily repeat them
(which is long-winded) but can summarize them or present an
instruction or warning.</p>

<h4>1.3.2 <a id="s1.3.2" name="s1.3.2">Input modalities</a> (must
address)</h4>

<p>The markup language will allow, in a given dialog state, the
set of actions that can be performed using speech input to be
different tosimultaneous actions that can be performed with other
input modalities. This requirement is related to the <a
href="#s2.3">simultaneous input requirements</a> (2.3 and
2.4).</p>

<p>Consider a speech plus GUI system, where speech and touch
screen input is available simultaneously. The application can be
authored such that, in a given dialog state, there are more
actions available via speech than via the touch screen. For
example, the screen displays a list of flights and the user can
bypass the options available on the display and say "show me
later flights".</p>

<h3>1.4 Seamless synchronization of the various modalities
(should address)</h3>

<p>The markup will be designed such that an author can write
applications where the synchronization of the various modalities
is seamless from the user's point of view. That is, a cause in
one modality results in a synchronous change in another. For
example:</p>

<ol>
<li>an end-user selects something using voice and the visual
display changes to match;</li>

<li>an end-user specifies focus with a mouse and enters the data
with voice - the application knows which field the user is
talking to and therefore what it might expect;</li>
</ol>

<p>See <a href="#s4.7.1">minimally required synchronization
points (4.7.1)</a> and <a href="#s4.7.2">finer grained
synchronization points (4.7.2).</a></p>

<p>See also <a href="#s2.2">multimodal input requirements (2.2,
2.3, 2.4)</a> and <a href="#s3.2">multimodal output requirements
(3.2, 3.3, 3.4).</a></p>

<h3>1.5 Multilingual &amp; international rendering</h3>

<h4>1.5.1 One language per document (must address)</h4>

<p>The markup language will provide the ability to mark the
language of a document.</p>

<h4>1.5.2 Multiple languages in the same document (nice to
address)</h4>

<p>The markup language will support rendering of multi-lingual
documents - i.e. where there is a mixed-language document. For
example, English and French speech output and/or input can appear
in the same document - a spoken system response can be "John read
the book entitled 'Viva La France'."</p>

<p><font color="#008000"><i>This is really a general requirement
for voice dialog, rather than a multimodal requirement. We may
move this to the dialog document.</i></font></p>

<h2>2. Input modality requirements</h2>

<h3>2.1 Audio Modality Input (must address)</h3>

<p>The markup language can specify which spoken user input is
interpreted by the voice browser.</p>

<h3>2.2 <a id="s2.2" name="s2.2">Sequential multi-modal Input</a>
(must address)</h3>

<p>The markup language specifies that speech and user input from
other modalities is to be interpreted by the browser. There is no
requirement that the input modalities are simultaneously active.
In a particular dialog state, there is only one input mode
available but in the whole interaction more than one input mode
is used. Inputs from different modalities are interpreted
separately. For example, a browser can interpret speech input in
one dialog state and keyboard input in another.</p>

<p>The granularity is defined by things like input events.
Synchronization does not occur at any finer granularity. When the
user takes some action, only one mode of input will be available
at that time. See requirement <a href="#s4.7.1">4.7.1 - minimally
required synchronization points.</a></p>

<p>Examples:</p>

<ol>
<li>In a bank application accessed via a phone, the browser
renders the speech "Speak your name", the user must respond in
speech and says "Jack Jones", the browser renders the speech
"Using the keypad, enter your pin number", the user must enter
the number via the keypad.</li>

<li>In an insurance application accessed via a PDA, the browser
renders the speech "Please say your postcode", the user must
reply in speech and says "BS34 8QZ", the browser renders the
speech "I'm having trouble understanding you, please enter your
postcode using the soft keyboard." The user must respond using
the soft keyboard (i.e. not in speech).</li>
</ol>

<h3>2.3 <a id="s2.3" name="s2.3">Uncoordinated, Simultaneous,
Multi-modal Input</a> (must address)</h3>

<p>The markup language specifies that speech and user input from
other modalities is to be interpreted by the browser and that
input modalities are simultaneously active. There is no
requirement that interpretation of the input modalities are
coordinated (i.e. interpreted together). In a particular dialog
state, there is more than one input mode available but only input
from one of the modalities is interpreted (e.g. the first input -
see <a href="#s2.13">2.13 Resolve conflicting input
requirement</a>). For example, a voice browser in a desktop
environment could accept either keyboard input or spoken input in
same dialog state.</p>

<p>The granularity is defined by things like input events.
Synchronization does not occur at any finer granularity. When the
user takes some action, it can be in one of several input modes -
only one mode of input will be accepted by the browser. See
requirement <a href="#s4.7.1">4.7.1 - minimally required
synchronization points.</a></p>

<p>Examples:</p>

<ol>
<li>In a bank application accessed via a phone, the browser
renders the speech "Enter your name", the user says "Jack Jones"
or enters his name via the keypad, the browser renders the speech
"Enter your account number", the user enters the number via the
keypad or speaks the account number.</li>

<li>In a music application accessed via a PDA, the user asks to
hear clips of new releases, either using speech or by selecting a
button on screen. The browser renders a list of titles on screen.
The user selects by pointing to the title with the pen or by
speaking the title of the track.</li>
</ol>

<h3>2.4 <a id="s2.4" name="s2.4">Coordinated, Simultaneous
Multi-modal Input</a> (nice to address)</h3>

<p>The markup language specifies that speech and user input from
other modalities is allowed at the same time and that
interpretation of the inputs are coordinated. In a particular
dialog state, there is more than one input mode available and
input from multiple modalities is interpreted (e.g. within a
given time window). When the user takes some action it can be
composed of inputs from several modalities - for example, a voice
browser in a desktop environment could accept keyboard input and
spoken input together in same dialog state.</p>

<p>Examples:</p>

<ol>
<li>In a telephony environment, the user can type<em>200</em> on
the keypad and say <em>transfer to checking account</em> and the
interpretations are coordinated so that they are understood as
<em>transfer 200 to checking account</em>.</li>

<li>In a route finding application, the user points at Bristol on
a map and says "Give me directions from London to here".</li>
</ol>

<p>See also <a href="#s2.11">2.11 Composite Meaning
requirement</a>, <a href="#s2.13">2.13 Resolve conflicting input
requirement</a>.</p>

<h3>2.5 Input modes supported (must address)</h3>

<p>The markup language will support the following input modes, in
addition to speech:</p>

<ul>
<li>DTMF</li>

<li>keyboard</li>

<li>pointing device (e.g. mouse, touchscreen, etc)</li>
</ul>

<p>DTMF will be supported using the dialog markup specified by
the W3C Voice Browsing Group's dialog requirements.</p>

<p>Character and pointing input will be supported using other
markup languages together with scripting (e.g. html with
Javascript).</p>

<p>See <a href="#s4.1">reuse standard markup requirement
(4.1).</a></p>

<h3>2.6 Input modes supported (nice to address)</h3>

<p>The markup language will support other input modes,
including:</p>

<ul>
<li>hand-writing script</li>

<li>hand-writing gesture - e.g. to delete, to insert.</li>
</ul>

<h3>2.7 Extensible to new input media types (nice to
address)</h3>

<p>The model will be abstract enough so any new or exotic input
media (e.g. gesture captured by video) could fit into it.</p>

<h3>2.8 <a id="s2.8" name="s2.8">Semantics of input generated by
UI components other than speech</a> (nice to address)</h3>

<p>The markup language should support semantic tokens that are
generated by UI components other than speech. These tokens can be
considered in a similar way to action tags and speech grammars.
For example, in a pizza application, if a topping can be selected
from an option list on the screen, the author can declare that
the semantic token 'topping' can be generated by a GUI
component.</p>

<h3>2.9 <a id="s2.9" name="s2.9">Modality-independent
representation of the meaning of user input</a> (nice to
address)</h3>

<p>The markup language should support a modality-independent
method of representing the meaning of user input. This should be
annotated with a record of the modality type. This is related to
the <a href="#s4.3">XForms requirement (4.3)</a> and to the work
on Natural Language within the <a
href="http://www.w3.org/Voice/">W3C Voice activity</a>.</p>

<p>The markup language supports the same semantic representation
of input from different modalities. For example, in a pizza
application, if a topping can be selected from an option list on
the screen or by speaking, the same semantic token, e.g.
'topping' can be used to represent the input.</p>

<h3>2.10 Coordinate speech grammar with grammar for other input
modalities (future revision)</h3>

<p>The markup language coordinates the grammars for modalities
other than speech with speech grammars to avoid duplication of
effort in authoring multimodal grammars.</p>

<h3>2.11 <a id="s2.11" name="s2.11">Composite meaning</a> (nice
to address)</h3>

<p>Multimodal input must be able to be combined to form a
composite meaning. This is related to the <a href="#s2.4">
Coordinated, Simultaneous Multi-modal Input (2.4)</a>. For
example, the user points at Bristol on a map and says "Give me
directions from London to here". The formal representation of the
meaning of each input needs to be combined to get a composite
meaning - "Give me directions from London to Bristol". See also
<a href="#s2.8">Semantics of input generated by UI components
other than speech (2.8)</a> and <a href="#s2.9">Modality
independent semantic representation (2.9)</a></p>

<h3>2.12 Time window for coordinated multimodal input (nice to
address)</h3>

<p>The markup language supports specification of timing
information to determine whether input from multiple modalities
should combine to form an integrated semantic representation. See
<a href="#s2.4">coordinated multimodal input requirement
(2.4)</a>. This could, for example, take the form of a time
window which is specified in the markup, where input events from
different modalities that occur within this window are combined
into one semantic entity.</p>

<h3>2.13 <a id="s2.13" name="s2.13">Support for conflicting input
from different modalities</a> (must address)</h3>

<p>The markup language will support the detection of conflicting
input from several modalities.For example, in a speech + GUI
interface, there may be simultaneous but conflicting speech and
mouse inputs; the markup language should allow the conflict to be
detected so that an appropriate action can be taken. Consider a
music application, the user says "play Madonna" while entering
"Elvis" in an artist text box on screen; an application might
resolve this by asking "Did you mean Madonna or Elvis?". This is
related to <a href="#s2.3">2.3 uncoordinated simultaneous
multimodal input.</a>and <a href="#s2.4">2.4 coordinated
simultaneous input requirement.</a></p>

<h3>2.14 <a id="s2.14" name="s2.14">Context for recognizer</a>
(nice to address)</h3>

<p>The markup language should allow features of the display to
indicate a context for voice interaction. For example:</p>

<ul>
<li>the context for interpreting a spoken utterance might be
indicated by the form field that has focus on the display;</li>

<li>the speech grammar might be dependent on what is currently
being displayed (the page or just the area that's visible).</li>
</ul>

<h3>2.15 <a id="s2.15" name="s2.15">Resolve spoken reference to
display</a> (future revision)</h3>

<p>Interpretation of the input must provide enough information to
the natural language system to be able to resolve speech input
that refers to items in the visual context. For example: the
screen is displaying a list of possible flights that match a
user's requirements and the user says "I'll take the third
one".</p>

<h3>2.16 Time stamping (should address)</h3>

<p>All input events will be time-stamped, in addition to the time
stamping covered by the Dialog Requirements. This includes, for
example, time-stamping speech, key press and pointing events. For
finer grained synchronization, time stamping at the start and the
end of each word within speech may be needed.</p>

<h2>3. Output media requirements</h2>

<h3>3.1 Audio Media Output (must address)</h3>

<p>The markup language can specify the content rendered as spoken
output by the voice browser.</p>

<h3>3.2 <a id="s3.2" name="s3.2">Sequential multimedia output</a>
(must address)</h3>

<p>The markup language specifies that content is rendered in
speech and other media types. There is no requirement that the
output media are rendered simultaneously. For example, a browser
can output speech in one dialog state and graphics in
another.</p>

<p>The granularity is defined by things like input events.
Synchronization does not occur at any finer granularity. When the
user takes some action - either spoken or by pointing, for
example - a response is rendered in one of the output media -
either visual or voice, for example. See requirement <a
href="#s4.7.1">4.7.1 - minimally required synchronization
points.</a></p>

<p>Examples:</p>

<ol>
<li>In a speech plus WML banking application, accessed via a WAP
phone, the user asks "What's my balance". The browser renders the
account balance on the display only. The user clicks OK, the
browser renders the response as speech only - "Would you like
another service?"...</li>

<li>In a music application accessed via a PDA, the user asks to
hear clips of new releases. The browser renders a list of titles
on screen, together with the text instruction to select a title
to hear the track. The user selects a track by speaking the
number. The browser plays the selected track - the screen does
not change.</li>
</ol>

<h3>3.3 <a id="s3.3" name="s3.3">Uncoordinated, Simultaneous,
Multi-media Output</a> (must address)</h3>

<p>The markup language specifies that content is rendered in
speech and other media at the same time (i.e. in the same dialog
state). There is no requirement that the rendering of output
media are coordinated (i.e. synchronized) any further.Where
appropriate, synchronization of speech with other output media
should be supported with SMIL or a related standard.</p>

<p>The granularity of the synchronization for this requirement is
coarser than for the <a href="#s3.4">coordinated simultaneous
output requirement (3.4)</a>. The granularity is defined by
things like input events. When the user takes some action -
either spoken or by pointing, for example - something happens
with the visual and the voice channels but there is no further
synchronization at a finer granularity than that. I.e., a browser
can output speech and graphics in one dialog state, but the two
outputs are not synchronized in any other way. See requirement <a
href="#s4.7.1">4.7.1 - minimally required synchronization
points.</a></p>

<p>Examples:</p>

<ol>
<li>In a cinema-ticket application accessed via a WAP phone, the
user asks what films are showing. The browser renders the list of
films on the screen and renders an instruction in speech - "Here
are today's films. Select one to hear a full description".</li>

<li>A browser in a smart phone environment plays a prompt "Which
service do you require?", while displaying a list of options such
as "Do you want to: (a) transfer money; (b) get account info; (c)
quit."</li>

<li>In a music application accessed via a PDA, the user asks to
hear clips of new releases. The browser renders a list of titles
on screen, and renders an instruction in speech "Here are the
five recommended new releases. Select one to hear a clip". The
user selects one by speaking the title. The browser renders the
audio clip and, at the same time, displays the price and
information about the band. When the track has finished, the user
selects a button on screen to return to the list of tracks.</li>
</ol>

<h3>3.4 <a id="s3.4" name="s3.4">Coordinated, Simultaneous
Multi-media Output</a> (nice to address)</h3>

<p>The markup language specifies that content is to be
simultaneously rendered in speech and other media and that output
rendering is further coordinated (i.e. synchronized). The
granularity is defined by things that happen within the response
to a given user input - see <a href="#s4.7.2">4.7.2 Finer grained
synchronization points.</a> Where appropriate, synchronization of
speech with other output media should be supported with SMIL or a
related standard.</p>

<p>Examples:</p>

<ol>
<li>In a news application, accessed via a PDA, a browser
highlights each paragraph of text (e.g. headline) as it renders
the corresponding speech.</li>

<li>In a learn-to-read application accessed via a PC, the lips of
an animated character are synchronized with speech output, the
words are highlighted on screen as they are spoken and pictures
are displayed as the corresponding words are spoken (e.g. a cat
is displayed as the word cat is spoken).</li>

<li>In a music application accessed via a PDA, the user asks to
hear clips of new releases. The browser renders a list of titles
on screen, highlights the first and starts playing it. When the
first track has finished, the browser highlights the second title
on screen and starts playing the second track, and so on.</li>

<li>Display an image 5 seconds after a spoken prompt has
started.</li>

<li>Display an image for 5 seconds then render a speech
prompt.</li>
</ol>

<p>See also <a href="#s3.5">Synchronization of Multimedia with
voice input requirement (3.5)</a>.</p>

<h3>3.5 <a id="s3.5" name="s3.5">Synchronization of multimedia
with voice input</a> (nice to address)</h3>

<p>The markup language specifies that media output and voice
input are synchronized. The granularity is defined by: things
that happen within the response to a given user input, e.g. play
a video and 30 seconds after it has started activate a speech
grammar; things that happen within a speech input, e.g. detect
the start of a spoken input and 5 seconds later play a video.
Where appropriate, synchronization of speech with other output
media should be supported with SMIL or a related standard. See <a
href="#s3.4">Coordinated simultaneous multimedia output
requirement (3.4)</a>; <a href="#s4.7.2">4.7.2 Finer grained
synchronization points.</a></p>

<h3>3.6 Temporal semantics for synchronization of voice input and
output with multimedia (nice to address)</h3>

<p>The markup language will have clear temporal semantics so that
it can be integrated into the SMIL multimedia framework.
Multi-media frameworks are characterized by precise temporal
synchronization of output and input. For example, the SMIL
notation is based on timing primitives that allow the composition
of complex behaviors. See <a href="#s3.5">Synchronization with
Multimedia with voice input requirement (3.5)</a> and <a
href="#s3.4">3.4 coordinated simultaneous multimodal output
requirement</a>.</p>

<h3>3.7 Visual output of text (must address)</h3>

<p>The markup language will support visual output of text, using
other markup languages such as html or wml (see <a href="#s4.1">
reuse of standard markup requirement, 4.1</a>). For example, the
following may be presented as text on the display:</p>

<ul>
<li>Contextual/history information (e.g. display partially filled
in form);</li>

<li>Prompts;</li>

<li>Menus;</li>

<li>Confirmation;</li>

<li>Error messages.</li>
</ul>

<p>Example 1:</p>

<ul>
<li>User says: "My name is Jack Jones",</li>

<li>System displays: "Jack Jones" in address field.</li>
</ul>

<p>Example 2:</p>

<ul>
<li>User says: "Transfer $200 from my savings account to my
checking account",</li>

<li>System displays: 

<ul>
<li>Operation: transfer</li>

<li>Source account: savings account</li>

<li>Destination account: checking account</li>

<li>Amount: $200</li>
</ul>
</li>
</ul>

<h3>3.8 Media supported by other Voice Browsing Requirements
(must address)</h3>

<p>The markup language supports output defined in other W3C Voice
Browsing Group specifications - for example, recorded audio
(Speech Synthesis Requirements). See <a href="#s4.1">reuse of
standard markup requirement (4.1).</a></p>

<h3>3.9 Media objects supported by SMIL (should address)</h3>

<p>The markup language supports output of media objects supported
by SMIL (animation, audio, img, video, text, textstream), using
other markup languages (see <a href="#s4.1">reuse of standard
markup requirement, 4.1</a>).</p>

<h3>3.10 Other output media (nice to address)</h3>

<p>The markup language supports output of the following media,
using other markup languages (see <a href="#s4.1">reuse of
standard markup requirement, 4.1</a>).</p>

<ul>
<li>media types supported by CSS2</li>

<li>synthesis of audio - MIDI</li>

<li>lip-synch face synthesis</li>
</ul>

<h3>3.11 Extensible to new media (nice to address)</h3>

<p>The markup language will be extensible to support new output
media types (e.g. 3D graphics).</p>

<h3>3.12 <a id="s3.12" name="s3.12"></a>Media-independent
representation of the meaning of output (future revision)</h3>

<p>The markup language should support a media-independent method
of representing the meaning of output. E.g. the output could be
represented in a frame format and rendered in speech or on the
display by the browser. This is related to <a href="#s4.3">XForms
requirement (4.3)</a></p>

<h3>3.13 <a id="s3.13" name="s3.13">Display size</a> (should
address)</h3>

<p>Visual output will be renderable on displays of different
sizes. This should be by using standard visual markup languages
e.g., HTML, CHTML, WML, where appropriate, see <a href="#s4.1">
reuse standard markup requirement</a> (4.1).</p>

<p>This requirement applies to two kinds of visual markup:</p>

<ul>
<li>markup that can be rendered flexibly as the display size
changes</li>

<li>markup that is pre-configured for a particular display
size.</li>
</ul>

<h3>3.14 <a id="s3.14" name="s3.14">Output to more than one
window</a> (future revision)</h3>

<p>The markup language supports the identification of the display
window. This is to support applications where there is more than
one window.</p>

<h3>3.15 <a id="s3.15" name="s3.15">Time stamping</a> (should
address)</h3>

<p>All output events will be time-stamped, in addition to the
time stamping covered by the Dialog<br />
 Requirements. This includes time-stamping the start and the end
of a speech event. For finer grained synchronization, time
stamping at the start and the end of each word within speech may
be needed.</p>

<h2>4. <a id="s4" name="s4">Architecture, Integration and
Synchronization points</a></h2>

<h3>4.1 <a id="s4.1" name="s4.1">Reuse standard markup
languages</a> (must address)</h3>

<p>Where possible, the specification must reuse standard visual,
multimedia and aural markup languages, including:</p>

<ul>
<li>other <a href="http://www.w3.org/Voice/">W3C Voice Browsing
working group</a> specifications for voice markup;</li>

<li>standard multimedia notations (SMIL or a related
standard);</li>

<li>standard visual markup languages e.g., HTML, CHTML, WML;</li>

<li>other relevant specifications, including ACSS;</li>
</ul>

<p>The specification should avoid unnecessary differences with
these markup languages.</p>

<p>In addition, the markup will be compatible with the W3C's work
on Client Capabilities and Personal Preferences (CC/PP).</p>

<h3>4.2 Mesh with modular architecture proposed for XHTML (nice
to address)</h3>

<p>The results of the work should mesh with the modular
architecture proposed for XHTML, where different markup modules
are expected to cohabit and inter-operate gracefully within an
overall XHTML container.</p>

<p>As part of this goal the design should be capable of
incorporating multiple visual and aural markup languages.</p>

<h3>4.3 <a id="s4.3" name="s4.3">Compatibility with W3C work on
X-Forms</a> (nice to address)</h3>

<p>The markup language should be compatible with the W3C's work
on X-Forms.</p>

<ol>
<li>Have an explicit data model for the back end (i.e. the data)
and map it to the front end.</li>

<li>Separate the data model from the presentation. The
presentation depends on the device modality.</li>

<li>Application data and logic should be modality
independent.</li>
</ol>

<p>Related to requirements: <a href="#s3.12">media independent
representation of output (3.12)</a> and <a href="#s2.11">media
independent representation of input (2.11)</a>.</p>

<h3>4.4 Detect that a given modality is available (must
address)</h3>

<p>The markup language will allow identification of the
modalities available. This will allow an author to identify that
a given modality is/is not present and as a result switch to a
different dialog. E.g. there is a visible construct that an
author can query. This can be used to provide for accessibility
requirements and for environmental factors (e.g. noise). The
availability of input and output modalities can be controlled by
the user or by the system. The extent to which the functionality
is retained when modalities are not available is the
responsibility of the author.</p>

<p>The following is a list of use cases regarding a multimodal
document that specifies speech and GUI input and output. The
document could be designed such that:</p>

<ol>
<li>when the speech input error count is high, the user can make
equivalent selections via the GUI;</li>

<li>where a user has a speech impairment, speech input can be
deselected and the user controls the application via the
GUI;</li>

<li>when the user cannot hear a verbal prompt due to a noisy
environment (detected, for example, by no response), an
equivalent prompt is displayed on the screen;</li>

<li>where a user has a hearing impairment the speech output is
deselected and equivalent prompts are displayed.</li>
</ol>

<h3>4.5 Means to act on a notification that a modality has become
available/unavailable (must address)</h3>

<p>Note that this is a requirement on the system and not on the
markup language. For example, when there is temporarily high
background noise, the application may disable speech input and
output but enable them again when the noise lessens.This is a
requirement for an event handling mechanism.</p>

<h3>4.6 Transformable documents</h3>

<h4>4.6.1 Loosely coupled documents (nice to address)</h4>

<p>The mark-up language should support loosely coupled documents,
where separate markup streams for each modality are synchronized
at well-defined points. For example, separate voice and visual
markup streams could be synchronized at the following points:
visiting a form, following a link.</p>

<h4>4.6.2 Tightly coupled documents (nice to address)</h4>

<p>The mark-up language should support tightly coupled documents.
Tightly coupled documents have document elements for each
interaction modality interspersed in the same document. I.e. a
tightly coupled document contains sub-documents from different
interaction modalities (e.g. HTML and voice markup) and has been
authored to achieve explicit synchrony across the interaction
streams.</p>

<p>Tightly coupled documents should be viewed as an optimization
of the loosely-coupled approach, and should be defined by
describing a reversible transformation from a tightly-coupled
document to multiple loosely-coupled documents. For example, a
tightly coupled document that includes HTML and voice markup
sub-documents should be transformable to a pair of documents,
where one is HTML only and the other is voice markup only - see
<a href="#s4.6.3">transformation requirement</a> (4.6.3).</p>

<h4>4.6.3 <a id="s4.6.3" name="s4.6.3">Transformation between
tightly and loosely coupled documents by standard tree
transformations as expressible in XSLT</a> (nice to address)</h4>

<p>The markup language should be designed such that tightly
coupled documents are <em>transformable</em> to documents for a
specific interaction modalities by standard tree transformations
as expressible in XSLT. Conversely, tightly coupled documents
should be viewed as a simple transformation applied to the
individual sub-documents, with the transformation playing the
role of tightly coupling the sub-documents into a single
document.</p>

<p>This requirement will ensure content re-use, keep
implementation of multimodal browsers manageable and provide for
accessibility requirements.</p>

<p>It is important to note that all the interaction information
from the tightly coupled document may not be preserved. If, for
example, you have a speech + GUI design, when you take out the
GUI, the application is not necessarily equivalently usable. It
is up to the author to decide whether the speech document has all
the information that the speech plus GUI document has.Depending
on how the author created the multimodal document, the
transformation could be entirely lossy, could degrade gracefully
by preserving some information from the GUI or could preserve all
information from the GUI. If the author's intent is that the
application should be usable in the presence or absence of either
modality, it is the author's responsibility to design the
application to achieve this.</p>

<h3>4.7 <a id="s4.7" name="s4.7">Synchronization points</a></h3>

<h4>4.7.1 <a id="s4.7.1" name="s4.7.1">Minimally required
synchronization points</a>(must address)</h4>

<p>The markup language should minimally enable synchronization
across different modalities at well known interaction points in
today's browsers, for example, entering and exiting specific
interaction widgets:</p>

<ul>
<li>Entry to a form</li>

<li>Entry to a menu</li>

<li>Completion of a form</li>

<li>Choosing of menu item (in a voice markup language) or link
(HTML).</li>

<li>Filling of a field within a form.</li>
</ul>

<p>For example:</p>

<ul>
<li>The material displayed visually and the GUI input options can
be conditional on: the current voice dialog; the current state of
the voice dialog (e.g. the form, the menu).</li>

<li>The voice markup (i.e. the dialog/grammar/prompt) can be
conditional on: the html page being displayed; the text box in
focus; the option selected; the button that has been
clicked.</li>
</ul>

<p>See <a href="#s3.2">multimedia output requirements (3.2, 3.3
and 3.4)</a> and <a href="#s2.2">multimodal input
requirements</a> (2.2, 2.3 and 2.4).</p>

<h4>4.7.2 <a id="s4.7.2" name="s4.7.2">Finer-grained
synchronization points</a> (nice to address)</h4>

<p>The markup language should support finer-grained
synchronization. Where appropriate, synchronization of speech
with other output media should be supported with SMIL or a
related standard.</p>

<p>For example:</p>

<ul>
<li>to allow a display to synchronize with events in the auditory
output stream</li>

<li>to allow voice markup (i.e. the dialog/grammar/prompt) to
synchronize with scrolling events on the display</li>

<li>to allow voice markup to synchronize with temporal events in
output media.</li>
</ul>

<p>Synchronization points include:</p>

<ul>
<li>events in the auditory output stream e.g. start/finish voice
output events (word, line, paragraph, section)</li>

<li>fine-grained events on the display (e.g. scrolling)</li>

<li>temporal events in other output media.</li>
</ul>

<p>See <a href="#s3.4">3.4 coordinated simultaneous multimodal
output requirement</a>.</p>

<h4>4.7.3 Co-ordinate synchronization points with the DOM event
model (future study)</h4>

<ol>
<li>Synchronization points should be coordinated with the DOM
event model. I.e. one possible starting point for a list of such
synchronization points would be the event types defined by the
DOM, appropriately modified to be modality independent.</li>

<li>Event types defined for multimodal browsing should be
integrated into the DOM; as part of this effort, the Voice WG
might provide requirements as input to the next level of the DOM
specification.</li>
</ol>

<h4>4.7.4 Browser functions and synchronization points (future
study)</h4>

<p>The notion of synchronization points (or navigation sign
posts) are important; they should also be tied into a discussion
of what canonized browser functions like "back, "undo", and
"forward" mean, and what they mean to the global state of the MM
browser. The notion of 'back' is unclear in a voice context.</p>

<h3>4.8 Interaction with External Components (must have)</h3>

<p>The markup language must support a generic component interface
to allow for the use of external components on the client and/or
server side. The interface provides a mechanism for transferring
data between the markup language's variables and the component.
Examples of such data are: semantic representations of user input
(such as attribute-value pairs); URL of markup for different
modalities (e.g. URL of an HTML page). The markup language also
supports Interaction with External Components that is supported
by the <a
href="http://www.w3.org/TR/1999/WD-voice-dialog-reqs-19991223/">
W3C Voice Browsing Dialog Requirements (Requirement
2.10)</a>.</p>

<p>Examples of external components are components for interaction
modalities other than speech (e.g. an HTML browser) and server
scripts. Server scripts can be used to interact with remote
services, devices or databases.</p>

<h2>Acknowledgements</h2>

<p>The following people participated in the multimodal subgroup
of the Voice Browser working group and contributed to this
document</p>

<ul>
<li>T. V. Raman (IBM)</li>

<li>Bruce Lucas (IBM)</li>

<li>Pekka Kapanen (Nokia)</li>

<li>Peter Boda (Nokia)</li>

<li>Laurence Prevosto (EDF)</li>

<li>Marianne Hickey (HP)</li>

<li>Nils Klarlund (AT&amp;T)</li>

<li>Carolina Di Cristo (Telecom Italia)</li>

<li>Charles T. Hemphill (Conversational Computing)</li>

<li>Alan Goldschen (MITRE)</li>

<li>Andreas Kellner (Philips)</li>

<li>Markku T. Hakkinen (The Productivity Works)</li>

<li>Kuansan Wang (Microsoft)</li>

<li>David Raggett (W3C/HP)</li>

<li>Jim Colson (IBM)</li>

<li>Scott McGlashan (Pipebeach)</li>

<li>Frank Scahill (BT)</li>
</ul>
</body>
</html>