WD-multimodal-reqs-20000710 44 KB

Raw Blame History Permalink

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1" />
<link rel="stylesheet" type="text/css"
href="http://www.w3.org/StyleSheets/TR/W3C-WD.css" />
<style type="text/css">
body {
font-family: sans-serif;
margin-left: 10%;
margin-right: 5%;
color: black;
background-color: white;
background-attachment: fixed;
background-image: url(http://www.w3.org/StyleSheets/TR/WD.gif);
background-position: top left;
background-repeat: no-repeat;
}
h1,h2,h3,h4,h5,h6 {
margin-left: -4%;
font-weight: normal;
color: rgb(0, 92, 160);
}
img { color: white; border: 0; }
h1 { margin-top: 2em; clear: both; }
div.navbar,div.head { margin-bottom: 1em; }
p.copyright { font-size: 70%; }
span.term { font-style: italic; color: rgb(0, 0, 192); }

code {
    color: green;
    font-family: monospace;
    font-weight: bold;
}

code.greenmono {
    color: green;
    font-family: monospace;
    font-weight: bold;
}
.good {
    border: solid green;
    border-width: 2px;
    color: green;
    font-weight: bold;
    margin-right: 5%;
    margin-left: 0;
    margin-top: 1em;
    margin-bottom: 1em;
}
.bad  {
    border: solid red;
    border-width: 2px;
    margin-left: 0;
    margin-right: 5%;
    margin-top: 1em;
    margin-bottom: 1em;
    color: rgb(192, 101, 101);
}
div.navbar { text-align: center; }
div.contents {
    background-color: rgb(204,204,255);
    padding: 0.5em;
    border: none;
    margin-right: 5%;
}
.tocline { list-style: none; }
table.exceptions { background-color: rgb(255,255,153); }
.diff-old-a {
  font-size: smaller;
  color: red;
}

.diff-old {
  color: red;
  text-decoration: line-through;
}

.diff-new {
        color: green;
        text-decoration: underline;
}
</style>

<style type="text/css">
 pre.c7 {color: #3333FF}
 p.c6 {color: #3333FF}
 span.c5 {color: #3333FF}
 p.c4 {color: #FF6600}
 b.c3 {font-size: larger}
 tt.c2 {font-size: larger}
 span.c1 {color: #FF6600}
</style>

<title>Multimodal requirements</title>
</head>
<body text="#FF0000" bgcolor="#00FFFF">
<div class="head">
<p><a href="http://www.w3.org/"><img class="head"
src="http://www.w3.org/Icons/w3c_home" alt="W3C" /></a></p>

<h1 class="notoc">Multimodal Requirements<br />
for Voice Markup Languages</h1>

<h3 class="notoc">W3C Working Draft 10 July 2000</h3>

<dl>
<dt>This version:</dt>

<dd><a
href="http://www.w3.org/TR/2000/WD-multimodal-reqs-20000710">
http://www.w3.org/TR/2000/WD-multimodal-reqs-20000710</a></dd>

<dt>Latest version:</dt>

<dd><a href="http://www.w3.org/TR/multimodal-reqs">
http://www.w3.org/TR/multimodal-reqs</a></dd>

<dt>Editors:</dt>

<dd>Marianne Hickey, Hewlett Packard</dd>
</dl>

<p class="copyright"><a
href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">
Copyright</a> &#169;2000 <a href="http://www.w3.org/"><abbr
title="World Wide Web Consortium">W3C</abbr></a><sup>&#174;</sup>
(<a href="http://www.lcs.mit.edu/"><abbr
title="Massachusetts Institute of Technology">MIT</abbr></a>, <a
href="http://www.inria.fr/"><abbr lang="fr"
title="Institut National de Recherche en Informatique et Automatique">
INRIA</abbr></a>, <a href="http://www.keio.ac.jp/">Keio</a>), All
Rights Reserved. W3C <a
href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">
liability</a>, <a
href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">
trademark</a>, <a
href="http://www.w3.org/Consortium/Legal/copyright-documents-19990405">
document use</a> and <a
href="http://www.w3.org/Consortium/Legal/copyright-software-19980720">
software licensing</a> rules apply.</p>

<hr />
</div>

<h2 class="notoc">Abstract</h2>

<p>Multimodal browsers allow users to interact via a combination
of modalities, for instance, speech recognition and synthesis,
displays, keypads and pointing devices. The Voice Browser working
group is interested in adding multimodal capabilities to voice
browsers. This document sets out a prioritized list of
requirements for multimodal dialog interaction, which any
proposed markup language (or extension thereof) should
address.</p>

<h2>Status of this document</h2>

<p>This specification is a Working Draft of the Voice Browser
working group for review by W3C members and other interested
parties. This is the first public version of this document. It is
a draft document and may be updated, replaced, or obsoleted by
other documents at any time. It is inappropriate to use W3C
Working Drafts as reference material or to cite them as other
than "work in progress".</p>

<p>Publication as a Working Draft does not imply endorsement by
the W3C membership, nor of members of the Voice Browser working
groups. This is still a draft document and may be updated,
replaced or obsoleted by other documents at any time. It is
inappropriate to cite W3C Working Drafts as other than "work in
progress."</p>

<p>This document has been produced as part of the <a
href="http://www.w3.org/Voice/">W3C Voice Browser Activity</a>,
but should not be taken as evidence of consensus in the Voice
Browser Working Group. The goals of the <a
href="http://www.w3.org/Voice/Group/">Voice Browser Working
Group</a> (<a href="http://cgi.w3.org/MemberAccess/">members
only</a>) are discussed in the <a
href="http://www.w3.org/Voice/1999/voice-wg-charter.html">Voice
Browser Working Group charter</a> (<a
href="http://cgi.w3.org/MemberAccess/">members only</a>). This
document is for public review. Comments should be sent to the
public mailing list &lt;<a
href="mailto:www-voice@w3.org">www-voice@w3.org</a>&gt; (<a
href="http://lists.w3.org/Archives/Public/www-voice/">archive</a>).</p>

<p>A list of current W3C Recommendations and other technical
documents can be found at <a href="http://www.w3.org/TR/">
http://www.w3.org/TR</a>.</p>

<p class="comment">NOTE: Italicized green comments are merely
that - comments. They are for use during discussions but will be
removed as appropriate.</p>

<h3>Scope</h3>

<p>The document addresses multimodal dialog
interaction.Multimodal as defined in this document is one or more
speech modes:</p>

<ul>
<li>speech recognition,</li>

<li>speech synthesis,</li>

<li>prerecorded speech,</li>
</ul>

<p>together with one or more of the following modes:</p>

<ul>
<li>dtmf,</li>

<li>keyboard,</li>

<li>small screen</li>

<li>pointing device (mouse, pen)</li>

<li>other input/output modes</li>
</ul>

<p>The focus is on multimodal dialog where there is a small
screen and keypad (e.g. a cell phone) or a small screen, keypad
and pointing device (e.g. a palm computer with cellular
connection to the Web). This document is agnostic about where the
browser(s) and speech and language engines are running - e.g.
they could be running on the device itself, on a server or a
combination of the two.</p>

<p>The document addresses applications where both speech input
and speech output can be available. Note that this includes
applications where speech input and/or speech output may be
deselected due to environment/accessibility needs.</p>

<p>The document does not specifically address universal access,
i.e. the issue of rendering the same pages of markup to devices
with different capabilities (e.g. PC, phone or PDA). Rather, the
document addresses a markup language that allows an author to
write an application that uses spoken dialog interaction together
with other modalities (e.g. a visual interface).</p>

<h3>Interaction with Other Groups</h3>

<p>The activities of the Multimodal Requirements Subgroup will be
coordinated with the activities of other sub-groups within the
W3C Voice Browsing Working Group and other related W3C working
groups. Where possible, the specification will reuse standard
visual, multimedia and aural markup languages, see <a
href="#s4.1">Reuse of standard markup requirement (4.1)</a>.</p>

<h2>1. General Requirements</h2>

<h3>1.1 Scalable across end user devices (must address)</h3>

<p>The markup language will be scalable across devices with a
range of capabilities, in order to sufficiently meet the needs of
consumer and device control applications. This includes devices
capable of supporting:</p>

<ol>
<li>audio I/O plus keypad input - e.g. the plain phone with
speech plus dtmf, MP3 player with speech input and output and
with cellular connection to the Web;</li>

<li>audio, keypad and small screen - e.g. WAP phones, smart
phones with displays;</li>

<li>audio, soft keyboard, small screen and pointing - e.g.
palm-top personal organizers with cellular connection to the
Web.</li>

<li>audio, keyboard, full screen and pointing - e.g. desktop PC,
information kiosk.</li>
</ol>

<p>The server must be able to get access to client capabilities
and the user's personal preferences, see <a href="#s4.1">reuse of
standard markup requirement (4.1).</a></p>

<h3>1.2 Easy to implement (must address)</h3>

<p>The markup language should be easy for designers to understand
and author without special tools or knowledge of vendor
technology or protocols (multimodal dialog design knowledge is
still essential).</p>

<h3>1.3 <a id="s1.3" name="s1.3">Complimentary use of
modalities</a></h3>

<p>A characteristic of speech input is that it can be very
efficient - for example, in a device with a small display and
keypad, speech can bypass multiple layers of menus. A
characteristic of speech output is its serial nature, which can
make it a long-winded way of presenting information that could be
quickly browsed on a display.</p>

<p>The markup will allow an author to use the different
characteristics of the modalities in the most appropriate way for
the application.</p>

<h4>1.3.1 <a id="s1.3.1" name="s1.3.1">Output media</a> (must
address)</h4>

<p>The markup language will allow speech output to have different
content to that of simultaneous output from other media. This
requirement is related to the <a href="#s3.3">simultaneous output
requirements</a> (3.3 and 3.4).</p>

<p>In a speech plus GUI system, the author will be able to choose
different text for simultaneous verbal and visual outputs. For
example, a list of options may be presented on screen and
simultaneous speech output does not necessarily repeat them
(which is long-winded) but can summarize them or present an
instruction or warning.</p>

<h4>1.3.2 <a id="s1.3.2" name="s1.3.2">Input modalities</a> (must
address)</h4>

<p>The markup language will allow, in a given dialog state, the
set of actions that can be performed using speech input to be
different tosimultaneous actions that can be performed with other
input modalities. This requirement is related to the <a
href="#s2.3">simultaneous input requirements</a> (2.3 and
2.4).</p>

<p>Consider a speech plus GUI system, where speech and touch
screen input is available simultaneously. The application can be
authored such that, in a given dialog state, there are more
actions available via speech than via the touch screen. For
example, the screen displays a list of flights and the user can
bypass the options available on the display and say "show me
later flights".</p>

<h3>1.4 Seamless synchronization of the various modalities
(should address)</h3>

<p>The markup will be designed such that an author can write
applications where the synchronization of the various modalities
is seamless from the user's point of view. That is, a cause in
one modality results in a synchronous change in another. For
example:</p>

<ol>
<li>an end-user selects something using voice and the visual
display changes to match;</li>

<li>an end-user specifies focus with a mouse and enters the data
with voice - the application knows which field the user is
talking to and therefore what it might expect;</li>
</ol>

<p>See <a href="#s4.7.1">minimally required synchronization
points (4.7.1)</a> and <a href="#s4.7.2">finer grained
synchronization points (4.7.2).</a></p>

<p>See also <a href="#s2.2">multimodal input requirements (2.2,
2.3, 2.4)</a> and <a href="#s3.2">multimodal output requirements
(3.2, 3.3, 3.4).</a></p>

<h3>1.5 Multilingual &amp; international rendering</h3>

<h4>1.5.1 One language per document (must address)</h4>

<p>The markup language will provide the ability to mark the
language of a document.</p>

<h4>1.5.2 Multiple languages in the same document (nice to
address)</h4>

<p>The markup language will support rendering of multi-lingual
documents - i.e. where there is a mixed-language document. For
example, English and French speech output and/or input can appear
in the same document - a spoken system response can be "John read
the book entitled 'Viva La France'."</p>

<p><font color="#008000"><i>This is really a general requirement
for voice dialog, rather than a multimodal requirement. We may
move this to the dialog document.</i></font></p>

<h2>2. Input modality requirements</h2>

<h3>2.1 Audio Modality Input (must address)</h3>

<p>The markup language can specify which spoken user input is
interpreted by the voice browser.</p>

<h3>2.2 <a id="s2.2" name="s2.2">Sequential multi-modal Input</a>
(must address)</h3>

<p>The markup language specifies that speech and user input from
other modalities is to be interpreted by the browser. There is no
requirement that the input modalities are simultaneously active.
In a particular dialog state, there is only one input mode
available but in the whole interaction more than one input mode
is used. Inputs from different modalities are interpreted
separately. For example, a browser can interpret speech input in
one dialog state and keyboard input in another.</p>

<p>The granularity is defined by things like input events.
Synchronization does not occur at any finer granularity. When the
user takes some action, only one mode of input will be available
at that time. See requirement <a href="#s4.7.1">4.7.1 - minimally
required synchronization points.</a></p>

<p>Examples:</p>

<ol>
<li>In a bank application accessed via a phone, the browser
renders the speech "Speak your name", the user must respond in
speech and says "Jack Jones", the browser renders the speech
"Using the keypad, enter your pin number", the user must enter
the number via the keypad.</li>

<li>In an insurance application accessed via a PDA, the browser
renders the speech "Please say your postcode", the user must
reply in speech and says "BS34 8QZ", the browser renders the
speech "I'm having trouble understanding you, please enter your
postcode using the soft keyboard." The user must respond using
the soft keyboard (i.e. not in speech).</li>
</ol>

<h3>2.3 <a id="s2.3" name="s2.3">Uncoordinated, Simultaneous,
Multi-modal Input</a> (must address)</h3>

<p>The markup language specifies that speech and user input from
other modalities is to be interpreted by the browser and that
input modalities are simultaneously active. There is no
requirement that interpretation of the input modalities are
coordinated (i.e. interpreted together). In a particular dialog
state, there is more than one input mode available but only input
from one of the modalities is interpreted (e.g. the first input -
see <a href="#s2.13">2.13 Resolve conflicting input
requirement</a>). For example, a voice browser in a desktop
environment could accept either keyboard input or spoken input in
same dialog state.</p>

<p>The granularity is defined by things like input events.
Synchronization does not occur at any finer granularity. When the
user takes some action, it can be in one of several input modes -
only one mode of input will be accepted by the browser. See
requirement <a href="#s4.7.1">4.7.1 - minimally required
synchronization points.</a></p>

<p>Examples:</p>

<ol>
<li>In a bank application accessed via a phone, the browser
renders the speech "Enter your name", the user says "Jack Jones"
or enters his name via the keypad, the browser renders the speech
"Enter your account number", the user enters the number via the
keypad or speaks the account number.</li>

<li>In a music application accessed via a PDA, the user asks to
hear clips of new releases, either using speech or by selecting a
button on screen. The browser renders a list of titles on screen.
The user selects by pointing to the title with the pen or by
speaking the title of the track.</li>
</ol>

<h3>2.4 <a id="s2.4" name="s2.4">Coordinated, Simultaneous
Multi-modal Input</a> (nice to address)</h3>

<p>The markup language specifies that speech and user input from
other modalities is allowed at the same time and that
interpretation of the inputs are coordinated. In a particular
dialog state, there is more than one input mode available and
input from multiple modalities is interpreted (e.g. within a
given time window). When the user takes some action it can be
composed of inputs from several modalities - for example, a voice
browser in a desktop environment could accept keyboard input and
spoken input together in same dialog state.</p>

<p>Examples:</p>

<ol>
<li>In a telephony environment, the user can type<em>200</em> on
the keypad and say <em>transfer to checking account</em> and the
interpretations are coordinated so that they are understood as
<em>transfer 200 to checking account</em>.</li>

<li>In a route finding application, the user points at Bristol on
a map and says "Give me directions from London to here".</li>
</ol>

<p>See also <a href="#s2.11">2.11 Composite Meaning
requirement</a>, <a href="#s2.13">2.13 Resolve conflicting input
requirement</a>.</p>

<h3>2.5 Input modes supported (must address)</h3>

<p>The markup language will support the following input modes, in
addition to speech:</p>

<ul>
<li>DTMF</li>

<li>keyboard</li>

<li>pointing device (e.g. mouse, touchscreen, etc)</li>
</ul>

<p>DTMF will be supported using the dialog markup specified by
the W3C Voice Browsing Group's dialog requirements.</p>

<p>Character and pointing input will be supported using other
markup languages together with scripting (e.g. html with
Javascript).</p>

<p>See <a href="#s4.1">reuse standard markup requirement
(4.1).</a></p>

<h3>2.6 Input modes supported (nice to address)</h3>

<p>The markup language will support other input modes,
including:</p>

<ul>
<li>hand-writing script</li>

<li>hand-writing gesture - e.g. to delete, to insert.</li>
</ul>

<h3>2.7 Extensible to new input media types (nice to
address)</h3>

<p>The model will be abstract enough so any new or exotic input
media (e.g. gesture captured by video) could fit into it.</p>

<h3>2.8 <a id="s2.8" name="s2.8">Semantics of input generated by
UI components other than speech</a> (nice to address)</h3>

<p>The markup language should support semantic tokens that are
generated by UI components other than speech. These tokens can be
considered in a similar way to action tags and speech grammars.
For example, in a pizza application, if a topping can be selected
from an option list on the screen, the author can declare that
the semantic token 'topping' can be generated by a GUI
component.</p>

<h3>2.9 <a id="s2.9" name="s2.9">Modality-independent
representation of the meaning of user input</a> (nice to
address)</h3>

<p>The markup language should support a modality-independent
method of representing the meaning of user input. This should be
annotated with a record of the modality type. This is related to
the <a href="#s4.3">XForms requirement (4.3)</a> and to the work
on Natural Language within the <a
href="http://www.w3.org/Voice/">W3C Voice activity</a>.</p>

<p>The markup language supports the same semantic representation
of input from different modalities. For example, in a pizza
application, if a topping can be selected from an option list on
the screen or by speaking, the same semantic token, e.g.
'topping' can be used to represent the input.</p>

<h3>2.10 Coordinate speech grammar with grammar for other input
modalities (future revision)</h3>

<p>The markup language coordinates the grammars for modalities
other than speech with speech grammars to avoid duplication of
effort in authoring multimodal grammars.</p>

<h3>2.11 <a id="s2.11" name="s2.11">Composite meaning</a> (nice
to address)</h3>

<p>Multimodal input must be able to be combined to form a
composite meaning. This is related to the <a href="#s2.4">
Coordinated, Simultaneous Multi-modal Input (2.4)</a>. For
example, the user points at Bristol on a map and says "Give me
directions from London to here". The formal representation of the
meaning of each input needs to be combined to get a composite
meaning - "Give me directions from London to Bristol". See also
<a href="#s2.8">Semantics of input generated by UI components
other than speech (2.8)</a> and <a href="#s2.9">Modality
independent semantic representation (2.9)</a></p>

<h3>2.12 Time window for coordinated multimodal input (nice to
address)</h3>

<p>The markup language supports specification of timing
information to determine whether input from multiple modalities
should combine to form an integrated semantic representation. See
<a href="#s2.4">coordinated multimodal input requirement
(2.4)</a>. This could, for example, take the form of a time
window which is specified in the markup, where input events from
different modalities that occur within this window are combined
into one semantic entity.</p>

<h3>2.13 <a id="s2.13" name="s2.13">Support for conflicting input
from different modalities</a> (must address)</h3>

<p>The markup language will support the detection of conflicting
input from several modalities.For example, in a speech + GUI
interface, there may be simultaneous but conflicting speech and
mouse inputs; the markup language should allow the conflict to be
detected so that an appropriate action can be taken. Consider a
music application, the user says "play Madonna" while entering
"Elvis" in an artist text box on screen; an application might
resolve this by asking "Did you mean Madonna or Elvis?". This is
related to <a href="#s2.3">2.3 uncoordinated simultaneous
multimodal input.</a>and <a href="#s2.4">2.4 coordinated
simultaneous input requirement.</a></p>

<h3>2.14 <a id="s2.14" name="s2.14">Context for recognizer</a>
(nice to address)</h3>

<p>The markup language should allow features of the display to
indicate a context for voice interaction. For example:</p>

<ul>
<li>the context for interpreting a spoken utterance might be
indicated by the form field that has focus on the display;</li>

<li>the speech grammar might be dependent on what is currently
being displayed (the page or just the area that's visible).</li>
</ul>

<h3>2.15 <a id="s2.15" name="s2.15">Resolve spoken reference to
display</a> (future revision)</h3>

<p>Interpretation of the input must provide enough information to
the natural language system to be able to resolve speech input
that refers to items in the visual context. For example: the
screen is displaying a list of possible flights that match a
user's requirements and the user says "I'll take the third
one".</p>

<h3>2.16 Time stamping (should address)</h3>

<p>All input events will be time-stamped, in addition to the time
stamping covered by the Dialog Requirements. This includes, for
example, time-stamping speech, key press and pointing events. For
finer grained synchronization, time stamping at the start and the
end of each word within speech may be needed.</p>

<h2>3. Output media requirements</h2>

<h3>3.1 Audio Media Output (must address)</h3>

<p>The markup language can specify the content rendered as spoken
output by the voice browser.</p>

<h3>3.2 <a id="s3.2" name="s3.2">Sequential multimedia output</a>
(must address)</h3>

<p>The markup language specifies that content is rendered in
speech and other media types. There is no requirement that the
output media are rendered simultaneously. For example, a browser
can output speech in one dialog state and graphics in
another.</p>

<p>The granularity is defined by things like input events.
Synchronization does not occur at any finer granularity. When the
user takes some action - either spoken or by pointing, for
example - a response is rendered in one of the output media -
either visual or voice, for example. See requirement <a
href="#s4.7.1">4.7.1 - minimally required synchronization
points.</a></p>

<p>Examples:</p>

<ol>
<li>In a speech plus WML banking application, accessed via a WAP
phone, the user asks "What's my balance". The browser renders the
account balance on the display only. The user clicks OK, the
browser renders the response as speech only - "Would you like
another service?"...</li>

<li>In a music application accessed via a PDA, the user asks to
hear clips of new releases. The browser renders a list of titles
on screen, together with the text instruction to select a title
to hear the track. The user selects a track by speaking the
number. The browser plays the selected track - the screen does
not change.</li>
</ol>

<h3>3.3 <a id="s3.3" name="s3.3">Uncoordinated, Simultaneous,
Multi-media Output</a> (must address)</h3>

<p>The markup language specifies that content is rendered in
speech and other media at the same time (i.e. in the same dialog
state). There is no requirement that the rendering of output
media are coordinated (i.e. synchronized) any further.Where
appropriate, synchronization of speech with other output media
should be supported with SMIL or a related standard.</p>

<p>The granularity of the synchronization for this requirement is
coarser than for the <a href="#s3.4">coordinated simultaneous
output requirement (3.4)</a>. The granularity is defined by
things like input events. When the user takes some action -
either spoken or by pointing, for example - something happens
with the visual and the voice channels but there is no further
synchronization at a finer granularity than that. I.e., a browser
can output speech and graphics in one dialog state, but the two
outputs are not synchronized in any other way. See requirement <a
href="#s4.7.1">4.7.1 - minimally required synchronization
points.</a></p>

<p>Examples:</p>

<ol>
<li>In a cinema-ticket application accessed via a WAP phone, the
user asks what films are showing. The browser renders the list of
films on the screen and renders an instruction in speech - "Here
are today's films. Select one to hear a full description".</li>

<li>A browser in a smart phone environment plays a prompt "Which
service do you require?", while displaying a list of options such
as "Do you want to: (a) transfer money; (b) get account info; (c)
quit."</li>

<li>In a music application accessed via a PDA, the user asks to
hear clips of new releases. The browser renders a list of titles
on screen, and renders an instruction in speech "Here are the
five recommended new releases. Select one to hear a clip". The
user selects one by speaking the title. The browser renders the
audio clip and, at the same time, displays the price and
information about the band. When the track has finished, the user
selects a button on screen to return to the list of tracks.</li>
</ol>

<h3>3.4 <a id="s3.4" name="s3.4">Coordinated, Simultaneous
Multi-media Output</a> (nice to address)</h3>

<p>The markup language specifies that content is to be
simultaneously rendered in speech and other media and that output
rendering is further coordinated (i.e. synchronized). The
granularity is defined by things that happen within the response
to a given user input - see <a href="#s4.7.2">4.7.2 Finer grained
synchronization points.</a> Where appropriate, synchronization of
speech with other output media should be supported with SMIL or a
related standard.</p>

<p>Examples:</p>

<ol>
<li>In a news application, accessed via a PDA, a browser
highlights each paragraph of text (e.g. headline) as it renders
the corresponding speech.</li>

<li>In a learn-to-read application accessed via a PC, the lips of
an animated character are synchronized with speech output, the
words are highlighted on screen as they are spoken and pictures
are displayed as the corresponding words are spoken (e.g. a cat
is displayed as the word cat is spoken).</li>

<li>In a music application accessed via a PDA, the user asks to
hear clips of new releases. The browser renders a list of titles
on screen, highlights the first and starts playing it. When the
first track has finished, the browser highlights the second title
on screen and starts playing the second track, and so on.</li>

<li>Display an image 5 seconds after a spoken prompt has
started.</li>

<li>Display an image for 5 seconds then render a speech
prompt.</li>
</ol>

<p>See also <a href="#s3.5">Synchronization of Multimedia with
voice input requirement (3.5)</a>.</p>

<h3>3.5 <a id="s3.5" name="s3.5">Synchronization of multimedia
with voice input</a> (nice to address)</h3>

<p>The markup language specifies that media output and voice
input are synchronized. The granularity is defined by: things
that happen within the response to a given user input, e.g. play
a video and 30 seconds after it has started activate a speech
grammar; things that happen within a speech input, e.g. detect
the start of a spoken input and 5 seconds later play a video.
Where appropriate, synchronization of speech with other output
media should be supported with SMIL or a related standard. See <a
href="#s3.4">Coordinated simultaneous multimedia output
requirement (3.4)</a>; <a href="#s4.7.2">4.7.2 Finer grained
synchronization points.</a></p>

<h3>3.6 Temporal semantics for synchronization of voice input and
output with multimedia (nice to address)</h3>

<p>The markup language will have clear temporal semantics so that
it can be integrated into the SMIL multimedia framework.
Multi-media frameworks are characterized by precise temporal
synchronization of output and input. For example, the SMIL
notation is based on timing primitives that allow the composition
of complex behaviors. See <a href="#s3.5">Synchronization with
Multimedia with voice input requirement (3.5)</a> and <a
href="#s3.4">3.4 coordinated simultaneous multimodal output
requirement</a>.</p>

<h3>3.7 Visual output of text (must address)</h3>

<p>The markup language will support visual output of text, using
other markup languages such as html or wml (see <a href="#s4.1">
reuse of standard markup requirement, 4.1</a>). For example, the
following may be presented as text on the display:</p>

<ul>
<li>Contextual/history information (e.g. display partially filled
in form);</li>

<li>Prompts;</li>

<li>Menus;</li>

<li>Confirmation;</li>

<li>Error messages.</li>
</ul>

<p>Example 1:</p>

<ul>
<li>User says: "My name is Jack Jones",</li>

<li>System displays: "Jack Jones" in address field.</li>
</ul>

<p>Example 2:</p>

<ul>
<li>User says: "Transfer $200 from my savings account to my
checking account",</li>

<li>System displays:

<ul>
<li>Operation: transfer</li>

<li>Source account: savings account</li>

<li>Destination account: checking account</li>

<li>Amount: $200</li>
</ul>
</li>
</ul>

<h3>3.8 Media supported by other Voice Browsing Requirements
(must address)</h3>

<p>The markup language supports output defined in other W3C Voice
Browsing Group specifications - for example, recorded audio
(Speech Synthesis Requirements). See <a href="#s4.1">reuse of
standard markup requirement (4.1).</a></p>

<h3>3.9 Media objects supported by SMIL (should address)</h3>

<p>The markup language supports output of media objects supported
by SMIL (animation, audio, img, video, text, textstream), using
other markup languages (see <a href="#s4.1">reuse of standard
markup requirement, 4.1</a>).</p>

<h3>3.10 Other output media (nice to address)</h3>

<p>The markup language supports output of the following media,
using other markup languages (see <a href="#s4.1">reuse of
standard markup requirement, 4.1</a>).</p>

<ul>
<li>media types supported by CSS2</li>

<li>synthesis of audio - MIDI</li>

<li>lip-synch face synthesis</li>
</ul>

<h3>3.11 Extensible to new media (nice to address)</h3>

<p>The markup language will be extensible to support new output
media types (e.g. 3D graphics).</p>

<h3>3.12 <a id="s3.12" name="s3.12"></a>Media-independent
representation of the meaning of output (future revision)</h3>

<p>The markup language should support a media-independent method
of representing the meaning of output. E.g. the output could be
represented in a frame format and rendered in speech or on the
display by the browser. This is related to <a href="#s4.3">XForms
requirement (4.3)</a></p>

<h3>3.13 <a id="s3.13" name="s3.13">Display size</a> (should
address)</h3>

<p>Visual output will be renderable on displays of different
sizes. This should be by using standard visual markup languages
e.g., HTML, CHTML, WML, where appropriate, see <a href="#s4.1">
reuse standard markup requirement</a> (4.1).</p>

<p>This requirement applies to two kinds of visual markup:</p>

<ul>
<li>markup that can be rendered flexibly as the display size
changes</li>

<li>markup that is pre-configured for a particular display
size.</li>
</ul>

<h3>3.14 <a id="s3.14" name="s3.14">Output to more than one
window</a> (future revision)</h3>

<p>The markup language supports the identification of the display
window. This is to support applications where there is more than
one window.</p>

<h3>3.15 <a id="s3.15" name="s3.15">Time stamping</a> (should
address)</h3>

<p>All output events will be time-stamped, in addition to the
time stamping covered by the Dialog<br />
 Requirements. This includes time-stamping the start and the end
of a speech event. For finer grained synchronization, time
stamping at the start and the end of each word within speech may
be needed.</p>

<h2>4. <a id="s4" name="s4">Architecture, Integration and
Synchronization points</a></h2>

<h3>4.1 <a id="s4.1" name="s4.1">Reuse standard markup
languages</a> (must address)</h3>

<p>Where possible, the specification must reuse standard visual,
multimedia and aural markup languages, including:</p>

<ul>
<li>other <a href="http://www.w3.org/Voice/">W3C Voice Browsing
working group</a> specifications for voice markup;</li>

<li>standard multimedia notations (SMIL or a related
standard);</li>

<li>standard visual markup languages e.g., HTML, CHTML, WML;</li>

<li>other relevant specifications, including ACSS;</li>
</ul>

<p>The specification should avoid unnecessary differences with
these markup languages.</p>

<p>In addition, the markup will be compatible with the W3C's work
on Client Capabilities and Personal Preferences (CC/PP).</p>

<h3>4.2 Mesh with modular architecture proposed for XHTML (nice
to address)</h3>

<p>The results of the work should mesh with the modular
architecture proposed for XHTML, where different markup modules
are expected to cohabit and inter-operate gracefully within an
overall XHTML container.</p>

<p>As part of this goal the design should be capable of
incorporating multiple visual and aural markup languages.</p>

<h3>4.3 <a id="s4.3" name="s4.3">Compatibility with W3C work on
X-Forms</a> (nice to address)</h3>

<p>The markup language should be compatible with the W3C's work
on X-Forms.</p>

<ol>
<li>Have an explicit data model for the back end (i.e. the data)
and map it to the front end.</li>

<li>Separate the data model from the presentation. The
presentation depends on the device modality.</li>

<li>Application data and logic should be modality
independent.</li>
</ol>

<p>Related to requirements: <a href="#s3.12">media independent
representation of output (3.12)</a> and <a href="#s2.11">media
independent representation of input (2.11)</a>.</p>

<h3>4.4 Detect that a given modality is available (must
address)</h3>

<p>The markup language will allow identification of the
modalities available. This will allow an author to identify that
a given modality is/is not present and as a result switch to a
different dialog. E.g. there is a visible construct that an
author can query. This can be used to provide for accessibility
requirements and for environmental factors (e.g. noise). The
availability of input and output modalities can be controlled by
the user or by the system. The extent to which the functionality
is retained when modalities are not available is the
responsibility of the author.</p>

<p>The following is a list of use cases regarding a multimodal
document that specifies speech and GUI input and output. The
document could be designed such that:</p>

<ol>
<li>when the speech input error count is high, the user can make
equivalent selections via the GUI;</li>

<li>where a user has a speech impairment, speech input can be
deselected and the user controls the application via the
GUI;</li>

<li>when the user cannot hear a verbal prompt due to a noisy
environment (detected, for example, by no response), an
equivalent prompt is displayed on the screen;</li>

<li>where a user has a hearing impairment the speech output is
deselected and equivalent prompts are displayed.</li>
</ol>

<h3>4.5 Means to act on a notification that a modality has become
available/unavailable (must address)</h3>

<p>Note that this is a requirement on the system and not on the
markup language. For example, when there is temporarily high
background noise, the application may disable speech input and
output but enable them again when the noise lessens.This is a
requirement for an event handling mechanism.</p>

<h3>4.6 Transformable documents</h3>

<h4>4.6.1 Loosely coupled documents (nice to address)</h4>

<p>The mark-up language should support loosely coupled documents,
where separate markup streams for each modality are synchronized
at well-defined points. For example, separate voice and visual
markup streams could be synchronized at the following points:
visiting a form, following a link.</p>

<h4>4.6.2 Tightly coupled documents (nice to address)</h4>

<p>The mark-up language should support tightly coupled documents.
Tightly coupled documents have document elements for each
interaction modality interspersed in the same document. I.e. a
tightly coupled document contains sub-documents from different
interaction modalities (e.g. HTML and voice markup) and has been
authored to achieve explicit synchrony across the interaction
streams.</p>

<p>Tightly coupled documents should be viewed as an optimization
of the loosely-coupled approach, and should be defined by
describing a reversible transformation from a tightly-coupled
document to multiple loosely-coupled documents. For example, a
tightly coupled document that includes HTML and voice markup
sub-documents should be transformable to a pair of documents,
where one is HTML only and the other is voice markup only - see
<a href="#s4.6.3">transformation requirement</a> (4.6.3).</p>

<h4>4.6.3 <a id="s4.6.3" name="s4.6.3">Transformation between
tightly and loosely coupled documents by standard tree
transformations as expressible in XSLT</a> (nice to address)</h4>

<p>The markup language should be designed such that tightly
coupled documents are <em>transformable</em> to documents for a
specific interaction modalities by standard tree transformations
as expressible in XSLT. Conversely, tightly coupled documents
should be viewed as a simple transformation applied to the
individual sub-documents, with the transformation playing the
role of tightly coupling the sub-documents into a single
document.</p>

<p>This requirement will ensure content re-use, keep
implementation of multimodal browsers manageable and provide for
accessibility requirements.</p>

<p>It is important to note that all the interaction information
from the tightly coupled document may not be preserved. If, for
example, you have a speech + GUI design, when you take out the
GUI, the application is not necessarily equivalently usable. It
is up to the author to decide whether the speech document has all
the information that the speech plus GUI document has.Depending
on how the author created the multimodal document, the
transformation could be entirely lossy, could degrade gracefully
by preserving some information from the GUI or could preserve all
information from the GUI. If the author's intent is that the
application should be usable in the presence or absence of either
modality, it is the author's responsibility to design the
application to achieve this.</p>

<h3>4.7 <a id="s4.7" name="s4.7">Synchronization points</a></h3>

<h4>4.7.1 <a id="s4.7.1" name="s4.7.1">Minimally required
synchronization points</a>(must address)</h4>

<p>The markup language should minimally enable synchronization
across different modalities at well known interaction points in
today's browsers, for example, entering and exiting specific
interaction widgets:</p>

<ul>
<li>Entry to a form</li>

<li>Entry to a menu</li>

<li>Completion of a form</li>

<li>Choosing of menu item (in a voice markup language) or link
(HTML).</li>

<li>Filling of a field within a form.</li>
</ul>

<p>For example:</p>

<ul>
<li>The material displayed visually and the GUI input options can
be conditional on: the current voice dialog; the current state of
the voice dialog (e.g. the form, the menu).</li>

<li>The voice markup (i.e. the dialog/grammar/prompt) can be
conditional on: the html page being displayed; the text box in
focus; the option selected; the button that has been
clicked.</li>
</ul>

<p>See <a href="#s3.2">multimedia output requirements (3.2, 3.3
and 3.4)</a> and <a href="#s2.2">multimodal input
requirements</a> (2.2, 2.3 and 2.4).</p>

<h4>4.7.2 <a id="s4.7.2" name="s4.7.2">Finer-grained
synchronization points</a> (nice to address)</h4>

<p>The markup language should support finer-grained
synchronization. Where appropriate, synchronization of speech
with other output media should be supported with SMIL or a
related standard.</p>

<p>For example:</p>

<ul>
<li>to allow a display to synchronize with events in the auditory
output stream</li>

<li>to allow voice markup (i.e. the dialog/grammar/prompt) to
synchronize with scrolling events on the display</li>

<li>to allow voice markup to synchronize with temporal events in
output media.</li>
</ul>

<p>Synchronization points include:</p>

<ul>
<li>events in the auditory output stream e.g. start/finish voice
output events (word, line, paragraph, section)</li>

<li>fine-grained events on the display (e.g. scrolling)</li>

<li>temporal events in other output media.</li>
</ul>

<p>See <a href="#s3.4">3.4 coordinated simultaneous multimodal
output requirement</a>.</p>

<h4>4.7.3 Co-ordinate synchronization points with the DOM event
model (future study)</h4>

<ol>
<li>Synchronization points should be coordinated with the DOM
event model. I.e. one possible starting point for a list of such
synchronization points would be the event types defined by the
DOM, appropriately modified to be modality independent.</li>

<li>Event types defined for multimodal browsing should be
integrated into the DOM; as part of this effort, the Voice WG
might provide requirements as input to the next level of the DOM
specification.</li>
</ol>

<h4>4.7.4 Browser functions and synchronization points (future
study)</h4>

<p>The notion of synchronization points (or navigation sign
posts) are important; they should also be tied into a discussion
of what canonized browser functions like "back, "undo", and
"forward" mean, and what they mean to the global state of the MM
browser. The notion of 'back' is unclear in a voice context.</p>

<h3>4.8 Interaction with External Components (must have)</h3>

<p>The markup language must support a generic component interface
to allow for the use of external components on the client and/or
server side. The interface provides a mechanism for transferring
data between the markup language's variables and the component.
Examples of such data are: semantic representations of user input
(such as attribute-value pairs); URL of markup for different
modalities (e.g. URL of an HTML page). The markup language also
supports Interaction with External Components that is supported
by the <a
href="http://www.w3.org/TR/1999/WD-voice-dialog-reqs-19991223/">
W3C Voice Browsing Dialog Requirements (Requirement
2.10)</a>.</p>

<p>Examples of external components are components for interaction
modalities other than speech (e.g. an HTML browser) and server
scripts. Server scripts can be used to interact with remote
services, devices or databases.</p>

<h2>Acknowledgements</h2>

<p>The following people participated in the multimodal subgroup
of the Voice Browser working group and contributed to this
document</p>

<ul>
<li>T. V. Raman (IBM)</li>

<li>Bruce Lucas (IBM)</li>

<li>Pekka Kapanen (Nokia)</li>

<li>Peter Boda (Nokia)</li>

<li>Laurence Prevosto (EDF)</li>

<li>Marianne Hickey (HP)</li>

<li>Nils Klarlund (AT&amp;T)</li>

<li>Carolina Di Cristo (Telecom Italia)</li>

<li>Charles T. Hemphill (Conversational Computing)</li>

<li>Alan Goldschen (MITRE)</li>

<li>Andreas Kellner (Philips)</li>

<li>Markku T. Hakkinen (The Productivity Works)</li>

<li>Kuansan Wang (Microsoft)</li>

<li>David Raggett (W3C/HP)</li>

<li>Jim Colson (IBM)</li>

<li>Scott McGlashan (Pipebeach)</li>

<li>Frank Scahill (BT)</li>
</ul>
</body>
</html>