WD-multimodal-reqs-20000710
44 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1" />
<link rel="stylesheet" type="text/css"
href="http://www.w3.org/StyleSheets/TR/W3C-WD.css" />
<style type="text/css">
body {
font-family: sans-serif;
margin-left: 10%;
margin-right: 5%;
color: black;
background-color: white;
background-attachment: fixed;
background-image: url(http://www.w3.org/StyleSheets/TR/WD.gif);
background-position: top left;
background-repeat: no-repeat;
}
h1,h2,h3,h4,h5,h6 {
margin-left: -4%;
font-weight: normal;
color: rgb(0, 92, 160);
}
img { color: white; border: 0; }
h1 { margin-top: 2em; clear: both; }
div.navbar,div.head { margin-bottom: 1em; }
p.copyright { font-size: 70%; }
span.term { font-style: italic; color: rgb(0, 0, 192); }
code {
color: green;
font-family: monospace;
font-weight: bold;
}
code.greenmono {
color: green;
font-family: monospace;
font-weight: bold;
}
.good {
border: solid green;
border-width: 2px;
color: green;
font-weight: bold;
margin-right: 5%;
margin-left: 0;
margin-top: 1em;
margin-bottom: 1em;
}
.bad {
border: solid red;
border-width: 2px;
margin-left: 0;
margin-right: 5%;
margin-top: 1em;
margin-bottom: 1em;
color: rgb(192, 101, 101);
}
div.navbar { text-align: center; }
div.contents {
background-color: rgb(204,204,255);
padding: 0.5em;
border: none;
margin-right: 5%;
}
.tocline { list-style: none; }
table.exceptions { background-color: rgb(255,255,153); }
.diff-old-a {
font-size: smaller;
color: red;
}
.diff-old {
color: red;
text-decoration: line-through;
}
.diff-new {
color: green;
text-decoration: underline;
}
</style>
<style type="text/css">
pre.c7 {color: #3333FF}
p.c6 {color: #3333FF}
span.c5 {color: #3333FF}
p.c4 {color: #FF6600}
b.c3 {font-size: larger}
tt.c2 {font-size: larger}
span.c1 {color: #FF6600}
</style>
<title>Multimodal requirements</title>
</head>
<body text="#FF0000" bgcolor="#00FFFF">
<div class="head">
<p><a href="http://www.w3.org/"><img class="head"
src="http://www.w3.org/Icons/w3c_home" alt="W3C" /></a></p>
<h1 class="notoc">Multimodal Requirements<br />
for Voice Markup Languages</h1>
<h3 class="notoc">W3C Working Draft 10 July 2000</h3>
<dl>
<dt>This version:</dt>
<dd><a
href="http://www.w3.org/TR/2000/WD-multimodal-reqs-20000710">
http://www.w3.org/TR/2000/WD-multimodal-reqs-20000710</a></dd>
<dt>Latest version:</dt>
<dd><a href="http://www.w3.org/TR/multimodal-reqs">
http://www.w3.org/TR/multimodal-reqs</a></dd>
<dt>Editors:</dt>
<dd>Marianne Hickey, Hewlett Packard</dd>
</dl>
<p class="copyright"><a
href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">
Copyright</a> ©2000 <a href="http://www.w3.org/"><abbr
title="World Wide Web Consortium">W3C</abbr></a><sup>®</sup>
(<a href="http://www.lcs.mit.edu/"><abbr
title="Massachusetts Institute of Technology">MIT</abbr></a>, <a
href="http://www.inria.fr/"><abbr lang="fr"
title="Institut National de Recherche en Informatique et Automatique">
INRIA</abbr></a>, <a href="http://www.keio.ac.jp/">Keio</a>), All
Rights Reserved. W3C <a
href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">
liability</a>, <a
href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">
trademark</a>, <a
href="http://www.w3.org/Consortium/Legal/copyright-documents-19990405">
document use</a> and <a
href="http://www.w3.org/Consortium/Legal/copyright-software-19980720">
software licensing</a> rules apply.</p>
<hr />
</div>
<h2 class="notoc">Abstract</h2>
<p>Multimodal browsers allow users to interact via a combination
of modalities, for instance, speech recognition and synthesis,
displays, keypads and pointing devices. The Voice Browser working
group is interested in adding multimodal capabilities to voice
browsers. This document sets out a prioritized list of
requirements for multimodal dialog interaction, which any
proposed markup language (or extension thereof) should
address.</p>
<h2>Status of this document</h2>
<p>This specification is a Working Draft of the Voice Browser
working group for review by W3C members and other interested
parties. This is the first public version of this document. It is
a draft document and may be updated, replaced, or obsoleted by
other documents at any time. It is inappropriate to use W3C
Working Drafts as reference material or to cite them as other
than "work in progress".</p>
<p>Publication as a Working Draft does not imply endorsement by
the W3C membership, nor of members of the Voice Browser working
groups. This is still a draft document and may be updated,
replaced or obsoleted by other documents at any time. It is
inappropriate to cite W3C Working Drafts as other than "work in
progress."</p>
<p>This document has been produced as part of the <a
href="http://www.w3.org/Voice/">W3C Voice Browser Activity</a>,
but should not be taken as evidence of consensus in the Voice
Browser Working Group. The goals of the <a
href="http://www.w3.org/Voice/Group/">Voice Browser Working
Group</a> (<a href="http://cgi.w3.org/MemberAccess/">members
only</a>) are discussed in the <a
href="http://www.w3.org/Voice/1999/voice-wg-charter.html">Voice
Browser Working Group charter</a> (<a
href="http://cgi.w3.org/MemberAccess/">members only</a>). This
document is for public review. Comments should be sent to the
public mailing list <<a
href="mailto:www-voice@w3.org">www-voice@w3.org</a>> (<a
href="http://lists.w3.org/Archives/Public/www-voice/">archive</a>).</p>
<p>A list of current W3C Recommendations and other technical
documents can be found at <a href="http://www.w3.org/TR/">
http://www.w3.org/TR</a>.</p>
<p class="comment">NOTE: Italicized green comments are merely
that - comments. They are for use during discussions but will be
removed as appropriate.</p>
<h3>Scope</h3>
<p>The document addresses multimodal dialog
interaction.Multimodal as defined in this document is one or more
speech modes:</p>
<ul>
<li>speech recognition,</li>
<li>speech synthesis,</li>
<li>prerecorded speech,</li>
</ul>
<p>together with one or more of the following modes:</p>
<ul>
<li>dtmf,</li>
<li>keyboard,</li>
<li>small screen</li>
<li>pointing device (mouse, pen)</li>
<li>other input/output modes</li>
</ul>
<p>The focus is on multimodal dialog where there is a small
screen and keypad (e.g. a cell phone) or a small screen, keypad
and pointing device (e.g. a palm computer with cellular
connection to the Web). This document is agnostic about where the
browser(s) and speech and language engines are running - e.g.
they could be running on the device itself, on a server or a
combination of the two.</p>
<p>The document addresses applications where both speech input
and speech output can be available. Note that this includes
applications where speech input and/or speech output may be
deselected due to environment/accessibility needs.</p>
<p>The document does not specifically address universal access,
i.e. the issue of rendering the same pages of markup to devices
with different capabilities (e.g. PC, phone or PDA). Rather, the
document addresses a markup language that allows an author to
write an application that uses spoken dialog interaction together
with other modalities (e.g. a visual interface).</p>
<h3>Interaction with Other Groups</h3>
<p>The activities of the Multimodal Requirements Subgroup will be
coordinated with the activities of other sub-groups within the
W3C Voice Browsing Working Group and other related W3C working
groups. Where possible, the specification will reuse standard
visual, multimedia and aural markup languages, see <a
href="#s4.1">Reuse of standard markup requirement (4.1)</a>.</p>
<h2>1. General Requirements</h2>
<h3>1.1 Scalable across end user devices (must address)</h3>
<p>The markup language will be scalable across devices with a
range of capabilities, in order to sufficiently meet the needs of
consumer and device control applications. This includes devices
capable of supporting:</p>
<ol>
<li>audio I/O plus keypad input - e.g. the plain phone with
speech plus dtmf, MP3 player with speech input and output and
with cellular connection to the Web;</li>
<li>audio, keypad and small screen - e.g. WAP phones, smart
phones with displays;</li>
<li>audio, soft keyboard, small screen and pointing - e.g.
palm-top personal organizers with cellular connection to the
Web.</li>
<li>audio, keyboard, full screen and pointing - e.g. desktop PC,
information kiosk.</li>
</ol>
<p>The server must be able to get access to client capabilities
and the user's personal preferences, see <a href="#s4.1">reuse of
standard markup requirement (4.1).</a></p>
<h3>1.2 Easy to implement (must address)</h3>
<p>The markup language should be easy for designers to understand
and author without special tools or knowledge of vendor
technology or protocols (multimodal dialog design knowledge is
still essential).</p>
<h3>1.3 <a id="s1.3" name="s1.3">Complimentary use of
modalities</a></h3>
<p>A characteristic of speech input is that it can be very
efficient - for example, in a device with a small display and
keypad, speech can bypass multiple layers of menus. A
characteristic of speech output is its serial nature, which can
make it a long-winded way of presenting information that could be
quickly browsed on a display.</p>
<p>The markup will allow an author to use the different
characteristics of the modalities in the most appropriate way for
the application.</p>
<h4>1.3.1 <a id="s1.3.1" name="s1.3.1">Output media</a> (must
address)</h4>
<p>The markup language will allow speech output to have different
content to that of simultaneous output from other media. This
requirement is related to the <a href="#s3.3">simultaneous output
requirements</a> (3.3 and 3.4).</p>
<p>In a speech plus GUI system, the author will be able to choose
different text for simultaneous verbal and visual outputs. For
example, a list of options may be presented on screen and
simultaneous speech output does not necessarily repeat them
(which is long-winded) but can summarize them or present an
instruction or warning.</p>
<h4>1.3.2 <a id="s1.3.2" name="s1.3.2">Input modalities</a> (must
address)</h4>
<p>The markup language will allow, in a given dialog state, the
set of actions that can be performed using speech input to be
different tosimultaneous actions that can be performed with other
input modalities. This requirement is related to the <a
href="#s2.3">simultaneous input requirements</a> (2.3 and
2.4).</p>
<p>Consider a speech plus GUI system, where speech and touch
screen input is available simultaneously. The application can be
authored such that, in a given dialog state, there are more
actions available via speech than via the touch screen. For
example, the screen displays a list of flights and the user can
bypass the options available on the display and say "show me
later flights".</p>
<h3>1.4 Seamless synchronization of the various modalities
(should address)</h3>
<p>The markup will be designed such that an author can write
applications where the synchronization of the various modalities
is seamless from the user's point of view. That is, a cause in
one modality results in a synchronous change in another. For
example:</p>
<ol>
<li>an end-user selects something using voice and the visual
display changes to match;</li>
<li>an end-user specifies focus with a mouse and enters the data
with voice - the application knows which field the user is
talking to and therefore what it might expect;</li>
</ol>
<p>See <a href="#s4.7.1">minimally required synchronization
points (4.7.1)</a> and <a href="#s4.7.2">finer grained
synchronization points (4.7.2).</a></p>
<p>See also <a href="#s2.2">multimodal input requirements (2.2,
2.3, 2.4)</a> and <a href="#s3.2">multimodal output requirements
(3.2, 3.3, 3.4).</a></p>
<h3>1.5 Multilingual & international rendering</h3>
<h4>1.5.1 One language per document (must address)</h4>
<p>The markup language will provide the ability to mark the
language of a document.</p>
<h4>1.5.2 Multiple languages in the same document (nice to
address)</h4>
<p>The markup language will support rendering of multi-lingual
documents - i.e. where there is a mixed-language document. For
example, English and French speech output and/or input can appear
in the same document - a spoken system response can be "John read
the book entitled 'Viva La France'."</p>
<p><font color="#008000"><i>This is really a general requirement
for voice dialog, rather than a multimodal requirement. We may
move this to the dialog document.</i></font></p>
<h2>2. Input modality requirements</h2>
<h3>2.1 Audio Modality Input (must address)</h3>
<p>The markup language can specify which spoken user input is
interpreted by the voice browser.</p>
<h3>2.2 <a id="s2.2" name="s2.2">Sequential multi-modal Input</a>
(must address)</h3>
<p>The markup language specifies that speech and user input from
other modalities is to be interpreted by the browser. There is no
requirement that the input modalities are simultaneously active.
In a particular dialog state, there is only one input mode
available but in the whole interaction more than one input mode
is used. Inputs from different modalities are interpreted
separately. For example, a browser can interpret speech input in
one dialog state and keyboard input in another.</p>
<p>The granularity is defined by things like input events.
Synchronization does not occur at any finer granularity. When the
user takes some action, only one mode of input will be available
at that time. See requirement <a href="#s4.7.1">4.7.1 - minimally
required synchronization points.</a></p>
<p>Examples:</p>
<ol>
<li>In a bank application accessed via a phone, the browser
renders the speech "Speak your name", the user must respond in
speech and says "Jack Jones", the browser renders the speech
"Using the keypad, enter your pin number", the user must enter
the number via the keypad.</li>
<li>In an insurance application accessed via a PDA, the browser
renders the speech "Please say your postcode", the user must
reply in speech and says "BS34 8QZ", the browser renders the
speech "I'm having trouble understanding you, please enter your
postcode using the soft keyboard." The user must respond using
the soft keyboard (i.e. not in speech).</li>
</ol>
<h3>2.3 <a id="s2.3" name="s2.3">Uncoordinated, Simultaneous,
Multi-modal Input</a> (must address)</h3>
<p>The markup language specifies that speech and user input from
other modalities is to be interpreted by the browser and that
input modalities are simultaneously active. There is no
requirement that interpretation of the input modalities are
coordinated (i.e. interpreted together). In a particular dialog
state, there is more than one input mode available but only input
from one of the modalities is interpreted (e.g. the first input -
see <a href="#s2.13">2.13 Resolve conflicting input
requirement</a>). For example, a voice browser in a desktop
environment could accept either keyboard input or spoken input in
same dialog state.</p>
<p>The granularity is defined by things like input events.
Synchronization does not occur at any finer granularity. When the
user takes some action, it can be in one of several input modes -
only one mode of input will be accepted by the browser. See
requirement <a href="#s4.7.1">4.7.1 - minimally required
synchronization points.</a></p>
<p>Examples:</p>
<ol>
<li>In a bank application accessed via a phone, the browser
renders the speech "Enter your name", the user says "Jack Jones"
or enters his name via the keypad, the browser renders the speech
"Enter your account number", the user enters the number via the
keypad or speaks the account number.</li>
<li>In a music application accessed via a PDA, the user asks to
hear clips of new releases, either using speech or by selecting a
button on screen. The browser renders a list of titles on screen.
The user selects by pointing to the title with the pen or by
speaking the title of the track.</li>
</ol>
<h3>2.4 <a id="s2.4" name="s2.4">Coordinated, Simultaneous
Multi-modal Input</a> (nice to address)</h3>
<p>The markup language specifies that speech and user input from
other modalities is allowed at the same time and that
interpretation of the inputs are coordinated. In a particular
dialog state, there is more than one input mode available and
input from multiple modalities is interpreted (e.g. within a
given time window). When the user takes some action it can be
composed of inputs from several modalities - for example, a voice
browser in a desktop environment could accept keyboard input and
spoken input together in same dialog state.</p>
<p>Examples:</p>
<ol>
<li>In a telephony environment, the user can type<em>200</em> on
the keypad and say <em>transfer to checking account</em> and the
interpretations are coordinated so that they are understood as
<em>transfer 200 to checking account</em>.</li>
<li>In a route finding application, the user points at Bristol on
a map and says "Give me directions from London to here".</li>
</ol>
<p>See also <a href="#s2.11">2.11 Composite Meaning
requirement</a>, <a href="#s2.13">2.13 Resolve conflicting input
requirement</a>.</p>
<h3>2.5 Input modes supported (must address)</h3>
<p>The markup language will support the following input modes, in
addition to speech:</p>
<ul>
<li>DTMF</li>
<li>keyboard</li>
<li>pointing device (e.g. mouse, touchscreen, etc)</li>
</ul>
<p>DTMF will be supported using the dialog markup specified by
the W3C Voice Browsing Group's dialog requirements.</p>
<p>Character and pointing input will be supported using other
markup languages together with scripting (e.g. html with
Javascript).</p>
<p>See <a href="#s4.1">reuse standard markup requirement
(4.1).</a></p>
<h3>2.6 Input modes supported (nice to address)</h3>
<p>The markup language will support other input modes,
including:</p>
<ul>
<li>hand-writing script</li>
<li>hand-writing gesture - e.g. to delete, to insert.</li>
</ul>
<h3>2.7 Extensible to new input media types (nice to
address)</h3>
<p>The model will be abstract enough so any new or exotic input
media (e.g. gesture captured by video) could fit into it.</p>
<h3>2.8 <a id="s2.8" name="s2.8">Semantics of input generated by
UI components other than speech</a> (nice to address)</h3>
<p>The markup language should support semantic tokens that are
generated by UI components other than speech. These tokens can be
considered in a similar way to action tags and speech grammars.
For example, in a pizza application, if a topping can be selected
from an option list on the screen, the author can declare that
the semantic token 'topping' can be generated by a GUI
component.</p>
<h3>2.9 <a id="s2.9" name="s2.9">Modality-independent
representation of the meaning of user input</a> (nice to
address)</h3>
<p>The markup language should support a modality-independent
method of representing the meaning of user input. This should be
annotated with a record of the modality type. This is related to
the <a href="#s4.3">XForms requirement (4.3)</a> and to the work
on Natural Language within the <a
href="http://www.w3.org/Voice/">W3C Voice activity</a>.</p>
<p>The markup language supports the same semantic representation
of input from different modalities. For example, in a pizza
application, if a topping can be selected from an option list on
the screen or by speaking, the same semantic token, e.g.
'topping' can be used to represent the input.</p>
<h3>2.10 Coordinate speech grammar with grammar for other input
modalities (future revision)</h3>
<p>The markup language coordinates the grammars for modalities
other than speech with speech grammars to avoid duplication of
effort in authoring multimodal grammars.</p>
<h3>2.11 <a id="s2.11" name="s2.11">Composite meaning</a> (nice
to address)</h3>
<p>Multimodal input must be able to be combined to form a
composite meaning. This is related to the <a href="#s2.4">
Coordinated, Simultaneous Multi-modal Input (2.4)</a>. For
example, the user points at Bristol on a map and says "Give me
directions from London to here". The formal representation of the
meaning of each input needs to be combined to get a composite
meaning - "Give me directions from London to Bristol". See also
<a href="#s2.8">Semantics of input generated by UI components
other than speech (2.8)</a> and <a href="#s2.9">Modality
independent semantic representation (2.9)</a></p>
<h3>2.12 Time window for coordinated multimodal input (nice to
address)</h3>
<p>The markup language supports specification of timing
information to determine whether input from multiple modalities
should combine to form an integrated semantic representation. See
<a href="#s2.4">coordinated multimodal input requirement
(2.4)</a>. This could, for example, take the form of a time
window which is specified in the markup, where input events from
different modalities that occur within this window are combined
into one semantic entity.</p>
<h3>2.13 <a id="s2.13" name="s2.13">Support for conflicting input
from different modalities</a> (must address)</h3>
<p>The markup language will support the detection of conflicting
input from several modalities.For example, in a speech + GUI
interface, there may be simultaneous but conflicting speech and
mouse inputs; the markup language should allow the conflict to be
detected so that an appropriate action can be taken. Consider a
music application, the user says "play Madonna" while entering
"Elvis" in an artist text box on screen; an application might
resolve this by asking "Did you mean Madonna or Elvis?". This is
related to <a href="#s2.3">2.3 uncoordinated simultaneous
multimodal input.</a>and <a href="#s2.4">2.4 coordinated
simultaneous input requirement.</a></p>
<h3>2.14 <a id="s2.14" name="s2.14">Context for recognizer</a>
(nice to address)</h3>
<p>The markup language should allow features of the display to
indicate a context for voice interaction. For example:</p>
<ul>
<li>the context for interpreting a spoken utterance might be
indicated by the form field that has focus on the display;</li>
<li>the speech grammar might be dependent on what is currently
being displayed (the page or just the area that's visible).</li>
</ul>
<h3>2.15 <a id="s2.15" name="s2.15">Resolve spoken reference to
display</a> (future revision)</h3>
<p>Interpretation of the input must provide enough information to
the natural language system to be able to resolve speech input
that refers to items in the visual context. For example: the
screen is displaying a list of possible flights that match a
user's requirements and the user says "I'll take the third
one".</p>
<h3>2.16 Time stamping (should address)</h3>
<p>All input events will be time-stamped, in addition to the time
stamping covered by the Dialog Requirements. This includes, for
example, time-stamping speech, key press and pointing events. For
finer grained synchronization, time stamping at the start and the
end of each word within speech may be needed.</p>
<h2>3. Output media requirements</h2>
<h3>3.1 Audio Media Output (must address)</h3>
<p>The markup language can specify the content rendered as spoken
output by the voice browser.</p>
<h3>3.2 <a id="s3.2" name="s3.2">Sequential multimedia output</a>
(must address)</h3>
<p>The markup language specifies that content is rendered in
speech and other media types. There is no requirement that the
output media are rendered simultaneously. For example, a browser
can output speech in one dialog state and graphics in
another.</p>
<p>The granularity is defined by things like input events.
Synchronization does not occur at any finer granularity. When the
user takes some action - either spoken or by pointing, for
example - a response is rendered in one of the output media -
either visual or voice, for example. See requirement <a
href="#s4.7.1">4.7.1 - minimally required synchronization
points.</a></p>
<p>Examples:</p>
<ol>
<li>In a speech plus WML banking application, accessed via a WAP
phone, the user asks "What's my balance". The browser renders the
account balance on the display only. The user clicks OK, the
browser renders the response as speech only - "Would you like
another service?"...</li>
<li>In a music application accessed via a PDA, the user asks to
hear clips of new releases. The browser renders a list of titles
on screen, together with the text instruction to select a title
to hear the track. The user selects a track by speaking the
number. The browser plays the selected track - the screen does
not change.</li>
</ol>
<h3>3.3 <a id="s3.3" name="s3.3">Uncoordinated, Simultaneous,
Multi-media Output</a> (must address)</h3>
<p>The markup language specifies that content is rendered in
speech and other media at the same time (i.e. in the same dialog
state). There is no requirement that the rendering of output
media are coordinated (i.e. synchronized) any further.Where
appropriate, synchronization of speech with other output media
should be supported with SMIL or a related standard.</p>
<p>The granularity of the synchronization for this requirement is
coarser than for the <a href="#s3.4">coordinated simultaneous
output requirement (3.4)</a>. The granularity is defined by
things like input events. When the user takes some action -
either spoken or by pointing, for example - something happens
with the visual and the voice channels but there is no further
synchronization at a finer granularity than that. I.e., a browser
can output speech and graphics in one dialog state, but the two
outputs are not synchronized in any other way. See requirement <a
href="#s4.7.1">4.7.1 - minimally required synchronization
points.</a></p>
<p>Examples:</p>
<ol>
<li>In a cinema-ticket application accessed via a WAP phone, the
user asks what films are showing. The browser renders the list of
films on the screen and renders an instruction in speech - "Here
are today's films. Select one to hear a full description".</li>
<li>A browser in a smart phone environment plays a prompt "Which
service do you require?", while displaying a list of options such
as "Do you want to: (a) transfer money; (b) get account info; (c)
quit."</li>
<li>In a music application accessed via a PDA, the user asks to
hear clips of new releases. The browser renders a list of titles
on screen, and renders an instruction in speech "Here are the
five recommended new releases. Select one to hear a clip". The
user selects one by speaking the title. The browser renders the
audio clip and, at the same time, displays the price and
information about the band. When the track has finished, the user
selects a button on screen to return to the list of tracks.</li>
</ol>
<h3>3.4 <a id="s3.4" name="s3.4">Coordinated, Simultaneous
Multi-media Output</a> (nice to address)</h3>
<p>The markup language specifies that content is to be
simultaneously rendered in speech and other media and that output
rendering is further coordinated (i.e. synchronized). The
granularity is defined by things that happen within the response
to a given user input - see <a href="#s4.7.2">4.7.2 Finer grained
synchronization points.</a> Where appropriate, synchronization of
speech with other output media should be supported with SMIL or a
related standard.</p>
<p>Examples:</p>
<ol>
<li>In a news application, accessed via a PDA, a browser
highlights each paragraph of text (e.g. headline) as it renders
the corresponding speech.</li>
<li>In a learn-to-read application accessed via a PC, the lips of
an animated character are synchronized with speech output, the
words are highlighted on screen as they are spoken and pictures
are displayed as the corresponding words are spoken (e.g. a cat
is displayed as the word cat is spoken).</li>
<li>In a music application accessed via a PDA, the user asks to
hear clips of new releases. The browser renders a list of titles
on screen, highlights the first and starts playing it. When the
first track has finished, the browser highlights the second title
on screen and starts playing the second track, and so on.</li>
<li>Display an image 5 seconds after a spoken prompt has
started.</li>
<li>Display an image for 5 seconds then render a speech
prompt.</li>
</ol>
<p>See also <a href="#s3.5">Synchronization of Multimedia with
voice input requirement (3.5)</a>.</p>
<h3>3.5 <a id="s3.5" name="s3.5">Synchronization of multimedia
with voice input</a> (nice to address)</h3>
<p>The markup language specifies that media output and voice
input are synchronized. The granularity is defined by: things
that happen within the response to a given user input, e.g. play
a video and 30 seconds after it has started activate a speech
grammar; things that happen within a speech input, e.g. detect
the start of a spoken input and 5 seconds later play a video.
Where appropriate, synchronization of speech with other output
media should be supported with SMIL or a related standard. See <a
href="#s3.4">Coordinated simultaneous multimedia output
requirement (3.4)</a>; <a href="#s4.7.2">4.7.2 Finer grained
synchronization points.</a></p>
<h3>3.6 Temporal semantics for synchronization of voice input and
output with multimedia (nice to address)</h3>
<p>The markup language will have clear temporal semantics so that
it can be integrated into the SMIL multimedia framework.
Multi-media frameworks are characterized by precise temporal
synchronization of output and input. For example, the SMIL
notation is based on timing primitives that allow the composition
of complex behaviors. See <a href="#s3.5">Synchronization with
Multimedia with voice input requirement (3.5)</a> and <a
href="#s3.4">3.4 coordinated simultaneous multimodal output
requirement</a>.</p>
<h3>3.7 Visual output of text (must address)</h3>
<p>The markup language will support visual output of text, using
other markup languages such as html or wml (see <a href="#s4.1">
reuse of standard markup requirement, 4.1</a>). For example, the
following may be presented as text on the display:</p>
<ul>
<li>Contextual/history information (e.g. display partially filled
in form);</li>
<li>Prompts;</li>
<li>Menus;</li>
<li>Confirmation;</li>
<li>Error messages.</li>
</ul>
<p>Example 1:</p>
<ul>
<li>User says: "My name is Jack Jones",</li>
<li>System displays: "Jack Jones" in address field.</li>
</ul>
<p>Example 2:</p>
<ul>
<li>User says: "Transfer $200 from my savings account to my
checking account",</li>
<li>System displays:
<ul>
<li>Operation: transfer</li>
<li>Source account: savings account</li>
<li>Destination account: checking account</li>
<li>Amount: $200</li>
</ul>
</li>
</ul>
<h3>3.8 Media supported by other Voice Browsing Requirements
(must address)</h3>
<p>The markup language supports output defined in other W3C Voice
Browsing Group specifications - for example, recorded audio
(Speech Synthesis Requirements). See <a href="#s4.1">reuse of
standard markup requirement (4.1).</a></p>
<h3>3.9 Media objects supported by SMIL (should address)</h3>
<p>The markup language supports output of media objects supported
by SMIL (animation, audio, img, video, text, textstream), using
other markup languages (see <a href="#s4.1">reuse of standard
markup requirement, 4.1</a>).</p>
<h3>3.10 Other output media (nice to address)</h3>
<p>The markup language supports output of the following media,
using other markup languages (see <a href="#s4.1">reuse of
standard markup requirement, 4.1</a>).</p>
<ul>
<li>media types supported by CSS2</li>
<li>synthesis of audio - MIDI</li>
<li>lip-synch face synthesis</li>
</ul>
<h3>3.11 Extensible to new media (nice to address)</h3>
<p>The markup language will be extensible to support new output
media types (e.g. 3D graphics).</p>
<h3>3.12 <a id="s3.12" name="s3.12"></a>Media-independent
representation of the meaning of output (future revision)</h3>
<p>The markup language should support a media-independent method
of representing the meaning of output. E.g. the output could be
represented in a frame format and rendered in speech or on the
display by the browser. This is related to <a href="#s4.3">XForms
requirement (4.3)</a></p>
<h3>3.13 <a id="s3.13" name="s3.13">Display size</a> (should
address)</h3>
<p>Visual output will be renderable on displays of different
sizes. This should be by using standard visual markup languages
e.g., HTML, CHTML, WML, where appropriate, see <a href="#s4.1">
reuse standard markup requirement</a> (4.1).</p>
<p>This requirement applies to two kinds of visual markup:</p>
<ul>
<li>markup that can be rendered flexibly as the display size
changes</li>
<li>markup that is pre-configured for a particular display
size.</li>
</ul>
<h3>3.14 <a id="s3.14" name="s3.14">Output to more than one
window</a> (future revision)</h3>
<p>The markup language supports the identification of the display
window. This is to support applications where there is more than
one window.</p>
<h3>3.15 <a id="s3.15" name="s3.15">Time stamping</a> (should
address)</h3>
<p>All output events will be time-stamped, in addition to the
time stamping covered by the Dialog<br />
Requirements. This includes time-stamping the start and the end
of a speech event. For finer grained synchronization, time
stamping at the start and the end of each word within speech may
be needed.</p>
<h2>4. <a id="s4" name="s4">Architecture, Integration and
Synchronization points</a></h2>
<h3>4.1 <a id="s4.1" name="s4.1">Reuse standard markup
languages</a> (must address)</h3>
<p>Where possible, the specification must reuse standard visual,
multimedia and aural markup languages, including:</p>
<ul>
<li>other <a href="http://www.w3.org/Voice/">W3C Voice Browsing
working group</a> specifications for voice markup;</li>
<li>standard multimedia notations (SMIL or a related
standard);</li>
<li>standard visual markup languages e.g., HTML, CHTML, WML;</li>
<li>other relevant specifications, including ACSS;</li>
</ul>
<p>The specification should avoid unnecessary differences with
these markup languages.</p>
<p>In addition, the markup will be compatible with the W3C's work
on Client Capabilities and Personal Preferences (CC/PP).</p>
<h3>4.2 Mesh with modular architecture proposed for XHTML (nice
to address)</h3>
<p>The results of the work should mesh with the modular
architecture proposed for XHTML, where different markup modules
are expected to cohabit and inter-operate gracefully within an
overall XHTML container.</p>
<p>As part of this goal the design should be capable of
incorporating multiple visual and aural markup languages.</p>
<h3>4.3 <a id="s4.3" name="s4.3">Compatibility with W3C work on
X-Forms</a> (nice to address)</h3>
<p>The markup language should be compatible with the W3C's work
on X-Forms.</p>
<ol>
<li>Have an explicit data model for the back end (i.e. the data)
and map it to the front end.</li>
<li>Separate the data model from the presentation. The
presentation depends on the device modality.</li>
<li>Application data and logic should be modality
independent.</li>
</ol>
<p>Related to requirements: <a href="#s3.12">media independent
representation of output (3.12)</a> and <a href="#s2.11">media
independent representation of input (2.11)</a>.</p>
<h3>4.4 Detect that a given modality is available (must
address)</h3>
<p>The markup language will allow identification of the
modalities available. This will allow an author to identify that
a given modality is/is not present and as a result switch to a
different dialog. E.g. there is a visible construct that an
author can query. This can be used to provide for accessibility
requirements and for environmental factors (e.g. noise). The
availability of input and output modalities can be controlled by
the user or by the system. The extent to which the functionality
is retained when modalities are not available is the
responsibility of the author.</p>
<p>The following is a list of use cases regarding a multimodal
document that specifies speech and GUI input and output. The
document could be designed such that:</p>
<ol>
<li>when the speech input error count is high, the user can make
equivalent selections via the GUI;</li>
<li>where a user has a speech impairment, speech input can be
deselected and the user controls the application via the
GUI;</li>
<li>when the user cannot hear a verbal prompt due to a noisy
environment (detected, for example, by no response), an
equivalent prompt is displayed on the screen;</li>
<li>where a user has a hearing impairment the speech output is
deselected and equivalent prompts are displayed.</li>
</ol>
<h3>4.5 Means to act on a notification that a modality has become
available/unavailable (must address)</h3>
<p>Note that this is a requirement on the system and not on the
markup language. For example, when there is temporarily high
background noise, the application may disable speech input and
output but enable them again when the noise lessens.This is a
requirement for an event handling mechanism.</p>
<h3>4.6 Transformable documents</h3>
<h4>4.6.1 Loosely coupled documents (nice to address)</h4>
<p>The mark-up language should support loosely coupled documents,
where separate markup streams for each modality are synchronized
at well-defined points. For example, separate voice and visual
markup streams could be synchronized at the following points:
visiting a form, following a link.</p>
<h4>4.6.2 Tightly coupled documents (nice to address)</h4>
<p>The mark-up language should support tightly coupled documents.
Tightly coupled documents have document elements for each
interaction modality interspersed in the same document. I.e. a
tightly coupled document contains sub-documents from different
interaction modalities (e.g. HTML and voice markup) and has been
authored to achieve explicit synchrony across the interaction
streams.</p>
<p>Tightly coupled documents should be viewed as an optimization
of the loosely-coupled approach, and should be defined by
describing a reversible transformation from a tightly-coupled
document to multiple loosely-coupled documents. For example, a
tightly coupled document that includes HTML and voice markup
sub-documents should be transformable to a pair of documents,
where one is HTML only and the other is voice markup only - see
<a href="#s4.6.3">transformation requirement</a> (4.6.3).</p>
<h4>4.6.3 <a id="s4.6.3" name="s4.6.3">Transformation between
tightly and loosely coupled documents by standard tree
transformations as expressible in XSLT</a> (nice to address)</h4>
<p>The markup language should be designed such that tightly
coupled documents are <em>transformable</em> to documents for a
specific interaction modalities by standard tree transformations
as expressible in XSLT. Conversely, tightly coupled documents
should be viewed as a simple transformation applied to the
individual sub-documents, with the transformation playing the
role of tightly coupling the sub-documents into a single
document.</p>
<p>This requirement will ensure content re-use, keep
implementation of multimodal browsers manageable and provide for
accessibility requirements.</p>
<p>It is important to note that all the interaction information
from the tightly coupled document may not be preserved. If, for
example, you have a speech + GUI design, when you take out the
GUI, the application is not necessarily equivalently usable. It
is up to the author to decide whether the speech document has all
the information that the speech plus GUI document has.Depending
on how the author created the multimodal document, the
transformation could be entirely lossy, could degrade gracefully
by preserving some information from the GUI or could preserve all
information from the GUI. If the author's intent is that the
application should be usable in the presence or absence of either
modality, it is the author's responsibility to design the
application to achieve this.</p>
<h3>4.7 <a id="s4.7" name="s4.7">Synchronization points</a></h3>
<h4>4.7.1 <a id="s4.7.1" name="s4.7.1">Minimally required
synchronization points</a>(must address)</h4>
<p>The markup language should minimally enable synchronization
across different modalities at well known interaction points in
today's browsers, for example, entering and exiting specific
interaction widgets:</p>
<ul>
<li>Entry to a form</li>
<li>Entry to a menu</li>
<li>Completion of a form</li>
<li>Choosing of menu item (in a voice markup language) or link
(HTML).</li>
<li>Filling of a field within a form.</li>
</ul>
<p>For example:</p>
<ul>
<li>The material displayed visually and the GUI input options can
be conditional on: the current voice dialog; the current state of
the voice dialog (e.g. the form, the menu).</li>
<li>The voice markup (i.e. the dialog/grammar/prompt) can be
conditional on: the html page being displayed; the text box in
focus; the option selected; the button that has been
clicked.</li>
</ul>
<p>See <a href="#s3.2">multimedia output requirements (3.2, 3.3
and 3.4)</a> and <a href="#s2.2">multimodal input
requirements</a> (2.2, 2.3 and 2.4).</p>
<h4>4.7.2 <a id="s4.7.2" name="s4.7.2">Finer-grained
synchronization points</a> (nice to address)</h4>
<p>The markup language should support finer-grained
synchronization. Where appropriate, synchronization of speech
with other output media should be supported with SMIL or a
related standard.</p>
<p>For example:</p>
<ul>
<li>to allow a display to synchronize with events in the auditory
output stream</li>
<li>to allow voice markup (i.e. the dialog/grammar/prompt) to
synchronize with scrolling events on the display</li>
<li>to allow voice markup to synchronize with temporal events in
output media.</li>
</ul>
<p>Synchronization points include:</p>
<ul>
<li>events in the auditory output stream e.g. start/finish voice
output events (word, line, paragraph, section)</li>
<li>fine-grained events on the display (e.g. scrolling)</li>
<li>temporal events in other output media.</li>
</ul>
<p>See <a href="#s3.4">3.4 coordinated simultaneous multimodal
output requirement</a>.</p>
<h4>4.7.3 Co-ordinate synchronization points with the DOM event
model (future study)</h4>
<ol>
<li>Synchronization points should be coordinated with the DOM
event model. I.e. one possible starting point for a list of such
synchronization points would be the event types defined by the
DOM, appropriately modified to be modality independent.</li>
<li>Event types defined for multimodal browsing should be
integrated into the DOM; as part of this effort, the Voice WG
might provide requirements as input to the next level of the DOM
specification.</li>
</ol>
<h4>4.7.4 Browser functions and synchronization points (future
study)</h4>
<p>The notion of synchronization points (or navigation sign
posts) are important; they should also be tied into a discussion
of what canonized browser functions like "back, "undo", and
"forward" mean, and what they mean to the global state of the MM
browser. The notion of 'back' is unclear in a voice context.</p>
<h3>4.8 Interaction with External Components (must have)</h3>
<p>The markup language must support a generic component interface
to allow for the use of external components on the client and/or
server side. The interface provides a mechanism for transferring
data between the markup language's variables and the component.
Examples of such data are: semantic representations of user input
(such as attribute-value pairs); URL of markup for different
modalities (e.g. URL of an HTML page). The markup language also
supports Interaction with External Components that is supported
by the <a
href="http://www.w3.org/TR/1999/WD-voice-dialog-reqs-19991223/">
W3C Voice Browsing Dialog Requirements (Requirement
2.10)</a>.</p>
<p>Examples of external components are components for interaction
modalities other than speech (e.g. an HTML browser) and server
scripts. Server scripts can be used to interact with remote
services, devices or databases.</p>
<h2>Acknowledgements</h2>
<p>The following people participated in the multimodal subgroup
of the Voice Browser working group and contributed to this
document</p>
<ul>
<li>T. V. Raman (IBM)</li>
<li>Bruce Lucas (IBM)</li>
<li>Pekka Kapanen (Nokia)</li>
<li>Peter Boda (Nokia)</li>
<li>Laurence Prevosto (EDF)</li>
<li>Marianne Hickey (HP)</li>
<li>Nils Klarlund (AT&T)</li>
<li>Carolina Di Cristo (Telecom Italia)</li>
<li>Charles T. Hemphill (Conversational Computing)</li>
<li>Alan Goldschen (MITRE)</li>
<li>Andreas Kellner (Philips)</li>
<li>Markku T. Hakkinen (The Productivity Works)</li>
<li>Kuansan Wang (Microsoft)</li>
<li>David Raggett (W3C/HP)</li>
<li>Jim Colson (IBM)</li>
<li>Scott McGlashan (Pipebeach)</li>
<li>Frank Scahill (BT)</li>
</ul>
</body>
</html>