index.html 24.9 KB

Raw Blame History Permalink

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
       "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
  <title>Multimodal Application Developer Feedback</title>
  <style type="text/css">
code { font-family: monospace; margin-left: 2em }
.ref { font-size: 80% }
.quote { margin-left: 5%; margin-right: 10% }
.definition { margin-left: 5%; margin-right: 10%; font-style: italic }
.diagram { text-align: center; font-size: 80%; font-weight: bold }
.changed {background-color: rgb(255, 255, 224)}
.deleted {background-color: rgb(240, 240, 240); text-decoration: line-through }
.comment {background-color: rgb(0, 204, 204)}
.pending {background-color: rgb(255, 224, 224)}
ul.toc li { list-style-type: none }
  </style>
  <link href="http://www.w3.org/StyleSheets/TR/W3C-WG-NOTE.css"
  rel="stylesheet" type="text/css" />
</head>

<body xml:lang="en" lang="en">

<div class="head">
<a href="http://www.w3.org/"><img alt="W3C" height="48"
src="http://www.w3.org/Icons/w3c_home" width="72" /></a>

<h1>Multimodal Application Developer Feedback</h1>

<h2>W3C Working Group Note 14 April 2006</h2>

<dl>
  <dt>This version:</dt>
    <dd><a
      href="http://www.w3.org/TR/2006/NOTE-mmi-dev-feedback-20060414/">http://www.w3.org/TR/2006/NOTE-mmi-dev-feedback-20060414/</a></dd>
  <dt>Latest version:</dt>

    <dd><a
      href="http://www.w3.org/TR/mmi-dev-feedback/">http://www.w3.org/TR/mmi-dev-feedback/</a></dd>
  <dt>Previous version:</dt>
    <dd><em>This is the first publication.</em></dd>
  <dt>Editors:</dt>
    <dd>Andrew Wahbe, VoiceGenie Technologies</dd>
    <dd>Gerald McCobb, IBM</dd>
    <dd>Klaus Reifenrath, Nuance</dd>
    <dd>Raj Tumuluri, Openstream</dd>
    <dd>Sunil Kumar, V-Enable</dd>
</dl>

<p class="copyright"><a
href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a>
&#169; 2006 <a href="http://www.w3.org/"><acronym
title="World Wide Web Consortium">W3C</acronym></a><sup>&#174;</sup> (<a
href="http://www.csail.mit.edu/"><acronym
title="Massachusetts Institute of Technology">MIT</acronym></a>, <a
href="http://www.ercim.org/"><acronym
title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,

<a href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved. W3C <a
href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
<a href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>
and <a href="http://www.w3.org/Consortium/Legal/copyright-documents">document
use</a> rules apply.</p>
</div>

<!-- end of head div -->

<hr title="Separator for header" />

<h2 id="abstract">Abstract</h2>

<p>Several years of multimodal application development in
various business areas and on various device platforms has
provided developers enough experience to provide detailed
feedback about what they like, dislike, and want to see
improve and continue.  This experience is provided here as
an input to the specifications under development in the W3C
<a href="http://www.w3.org/2002/mmi/">Multimodal Interaction</a>
and <a href="http://www.w3.org/voice">Voice Browser</a>
Activities.</p>

<h2 id="status">Status of this Document</h2>

<p><em>This section describes the status of this document at
the time of its publication. Other documents may supersede this
document. A list of current W3C publications and the latest revision
of this technical report can be found in the
<a href="http://www.w3.org/TR/">W3C technical reports
index</a> at http://www.w3.org/TR/.</em></p>

<p>This document is a W3C Working Group Note. It represents
the views of the W3C Multimidal Interaction Working Group at
the time of publication. The document may be updated as new
technologies emerge or mature. Publication as a Working
Group Note does not imply endorsement by the W3C Membership.
This is a draft document and may be updated, replaced or
obsoleted by other documents at any time. It is inappropriate
to cite this document as other than work in progress.</p>

<p>This document is one of a series produced by the
<a href="http://www.w3.org/2002/mmi/Group/">Multimodal
Interaction Working Group</a> <em>(<a
 href="http://cgi.w3.org/MemberAccess/AccessRequest">Member
Only Link</a>)</em>, part of the <a
href="http://www.w3.org/2002/mmi/">W3C Multimodal
Interaction Activity</a>. The MMI activity statement can
be seen at
<a href="http://www.w3.org/2002/mmi/Activity">http://www.w3.org/2002/mmi/Activity</a>.</p>

<p>Comments on this document can be sent to <a
href="mailto:www-multimodal@w3.org">www-multimodal@w3.org</a>,
the public forum for discussion of the W3C's work on
Multimodal Interaction. To subscribe, send an email to
<a href="mailto:www-multimodal-request@w3.org">www-multimodal-request@w3.org</a>
with the word subscribe in the subject line (include the
word unsubscribe if you want to unsubscribe). The
<a href="http://lists.w3.org/Archives/Public/www-multimodal/">archive</a>
for the list is accessible online.</p>

<p>This document was produced by a group operating under the <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/">5 February 2004 W3C Patent Policy</a>. This document is informative only. W3C maintains a <a rel="disclosure" href="http://www.w3.org/2004/01/pp-impl/34607/status">public list of any patent disclosures</a> made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/#def-essential">Essential Claim(s)</a> must disclose the information in accordance with <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/#sec-Disclosure">section 6 of the W3C Patent Policy</a>.</p>

<h2 id="contents">Table of Contents</h2>

<ul class="toc">
<li>1 <a href="#s1">Introduction</a></li>
<li>2 <a href="#s2">What developers liked</a>
 <ul>
  <li>2.1 <a href="#s2.1">Reusable and pluggable modality
    components</a></li>
  <li>2.2 <a href="#s2.2">Modular modality components</a></li>
  <li>2.3 <a href="#s2.3">Declarative synchronization between
    modalities</a></li>
  <li>2.4 <a href="#s2.4">Scripting and semantic interpretation
	</a></li>
  <li>2.5 <a href="#s2.5">Styling</a></li>
 </ul></li>
<li>3 <a href="#s3">What developers would like to see</a>
 <ul>
  <li>3.1 <a href="#s3.1">Global grammars</a></li>
  <li>3.2 <a href="#s3.2">Speech grammars for HTML links
    and controls</a></li>
  <li>3.3 <a href="#s3.3">Speech prompts for voice-enabled
    HTML links and controls</a></li>
  <li>3.4 <a href="#s3.4">Speech-enabled widgets</a></li>
  <li>3.5 <a href="#s3.5">Use speech to activate links
    and change focus</a></li>
  <li>3.6 <a href="#s3.6">Back functionality</a></li>
 </ul></li>
<li>4 <a href="#s4">What developers would like to see
    continue and improve</a>
 <ul>
  <li>4.1 <a href="#s4.1">Support for both off-line and
    on-line multimodal interaction</a></li>
  <li>4.2 <a href="#s4.2">Support for events distributed
    over the network</a></li>
  <li>4.3 <a href="#s4.3">Support for implicit events</a></li>
  <li>4.4 <a href="#s4.4">VoiceXML tag and feature support</a></li>
  <li>4.5 <a href="#s4.5">Support for both directed and
    user-initiated dialogs</a></li>
  <li>4.6 <a href="#s4.6">Mixed-initiative interaction</a></li>
  <li>4.7 <a href="#s4.7">Access to speech confidence scores
    and n-best list by the application</a></li>
  <li>4.8 <a href="#s4.8">Access to device details</a></li>
  <li>4.9 <a href="#s4.9">Choice of ASR</a></li>
  <li>4.10 <a href="#s4.10">Controlling N-Best choice of
    ASR</a></li>
 </ul></li>
</ul>

<hr />

<h2 id="s1">1 Introduction</h2>

<p>IBM, VoiceGenie Technologies, Nuance, V-Enable, and
OpenStream customers have been developing multimodal
applications in a broad range of business areas, including
Field-Force Productivity, Health Care and Life Sciences,
Warehouse and Distribution, Industrial Plant Floor, Financial
and Information Services, Directory Assistance, and the
Mobile Web.  Customer device platforms have included PC's
(desktops, laptops, and tablets), PDA's, kiosks, appliances,
equipment consoles, and web browser-based smart phones.
The multimodal applications primarily extended the traditional
GUI mode of interaction with speech, with the location of the
speech services either local on the device or distributed on
a remote server.  Several XML markup languages were used to
develop these applications, including <a
href="http://www.voicexml.org/specs/multimodal/x+v/12/">XHTML+Voice
(X+V)</a> and <a href="http://www.nuance.com/xhmi/">xHMI</a>.</p>

<p>During the process of developing these applications,
developers found features they liked about the development
environment they were using and found features they thought
were lacking.  Their experiences were collected and are
summarized here as feedback for the W3C <a
href="http://www.w3.org/2002/mmi/">Multimodal Interaction</a>
and <a href="http://www.w3.org/Voice/">Voice Browser</a>
Working Groups to consider when specifying future multimodal
and voice authoring capabilities.  We also solicit comments
from the wider multimodal development community on the extent
to which these observations are consistent with their own
development experiences.</p>

<p>The developers surveyed were expert in various programming
languages and application environments.  Developers expert in
C/C++ and Java generally speech enabled native applications on
small devices.  Device platforms included Windows Mobile, BREW,
embedded Linux, Symbian, and J2ME.  Developers expert in the
Web generally speech enabled browser based applications.  Web
browser platforms included Opera, Access' NetFront, Windows
Mobile Internet Explorer, and the Nokia Series 60.  Web
developers understood the web programming model very well but
generally were new to speech.  They liked XHTML, XML namespaces,
XML Events, CSS, JavaScript, and VoiceXML with its ability to
hide platform details.  Developers expert in VoiceXML and
dictation had backgrounds in speech and telephony and generally
worked on adding GUI to voice and dictation applications.</p>

<h2 id="s2">2 What developers liked</h2>

<h3 id="s2.1">2.1 Reusable and pluggable modality
components</h3>

<p>Developers preferred to develop modality components that
are reusable and pluggable.</p>

<h4 id="s2.1.1">Use Case:  VoiceXML modality component</h4>

<p>A VoiceXML modality component is reused without
modification in different multimodal applications.</p>

<h3 id="s2.2">2.2 Modular modality components</h3>

<p>Modular modality components are preferred because they
can be authored separately by the modality experts.</p>

<h4 id="s2.2.1">Use Case:  XHTML and VoiceXML modality
components</h4>

<p>A VoiceXML expert authors the voice modality component
and an XHTML expert authors the GUI component.  Modality
component coordination is handled independently, for example,
by X+V &lt;sync&gt; and &lt;cancel&gt; elements.</p>

<h3 id="s2.3">2.3 Declarative synchronization between
modalities</h3>

<p>Implicit event support includes both implicit event
generation and implicit event handling.  At different
stages in the operation of the modality component, there
will be either event generation or event handling by the
component itself.</p>

<h4 id="s2.3.1">Use Case:  X+V &lt;sync&gt; element</h4>

<p>The X+V &lt;sync&gt; element provides a declarative
synchronization of XHTML form control elements and the
VoiceXML &lt;field&gt; element. The &lt;sync&gt; element
allows input from one speech or visual modality to set
the field in the other modality. Also, setting the focus
of an &lt;input&gt; element that is synchronized with a
VoiceXML field updates the FIA to visit that VoiceXML
field.</p>

<h3 id="s2.4">2.4 Scripting and semantic interpretation</h3>

<p>Developers liked support for modality component
integration via scripting and semantic interpretation.</p>

<h4 id="s2.4.1">Use Case:  Timed notifications of an
operating room medical procedure</h4>

<p>A timed notification changes dynamically as time
progresses. The notification depends on the current
state of the application as well as the notification
state. For a GUI+speech multimodal application a
notification may be a TTS output and a new GUI page,
corresponding to the next step of an operating room
medical procedure.</p>

<h4 id="s2.4.2">Use Case:  Integrated pen and speech
interaction with a map</h4>

<p>The user says "zoom in here" while drawing an area on a map.
The application responds by enlarging the detail of the area
within the boundary drawn by the user.</p>

<h3 id="s2.5">2.5 Styling</h3>

<p>Developers liked CSS for styling each modality. For example,
the CSS3 module for styling speech based on SSML was useful
for styling the voice modality.</p>

<h4 id="s2.5.1">Use Case:  TTS rendering of a news article
on the web</h4>

<p>The news article is read by the computer in a realistic
voice that uses a different sounding voices for headlines,
section headings, and text.  There are also a pauses between
paragraphs and before article headlines.</p>

<h2 id="s3">3 What developers would like to see</h2>

<h3 id="s3.1">3.1 Global grammars</h3>

<p>Developers would like support for top-level ("global")
grammars that are active across multiple windows (e.g.,
HTML frames or portlets) of the application.</p>

<h4 id="s3.1.1">Use Case:  Top-level menus</h4>

<p>An application has top level menus "buy", "sell", and
"trade".  At any time while involved in the "buy" dialog,
a user can say "trade" and be switched to the "trade"
multimodal dialog.</p>

<h3 id="s3.2">3.2 Speech grammars for HTML links and
controls</h3>

<p>Developers would like support for explicitly adding
speech grammars to  activate HTML links and controls.
An automatically created speech grammar may not capture
everything the user may say.</p>

<h4 id="s3.2.1">Use Case:  Hotel booking application:
get list of hotels</h4>

<p>Before booking a hotel reservation the user looks up a list
of available hotels.  On the page along with the reservation is
a link labeled "Available Hotels." The developer anticipates that
besides "available hotels", the user may say  "show me the
available hotels" or ask "what hotels are available", and adds
these two phrases to the grammar for activating the link.</p>

<h4 id="s3.2.2">Use Case:  Hotel booking application:
submit reservation</h4>

<p>The reservation form's submit button says "submit reservation",
but the developer anticipates that a user might say "submit booking"
instead, and adds "submit booking" to the grammar for activating
the button.</p>

<h3 id="s3.3">3.3 Speech prompts for voice-enabled HTML
links and controls</h3>

<p>Developers would like support for explicitly adding speech
prompts to voice-enabled HTML hyperlinks and controls.  The
prompts can provide more information than the visual labels
attached to the HTML hyperlinks and input fields.</p>

<h4 id="s3.3.1">Use Case:  Hotel booking application:
enter Hotel name</h4>

<p>The user is prompted to enter a hotel name with the
following TTS: "please enter a hotel name.  You can get a
list of available hotels by saying 'show me available
hotels.'"</p>

<h3 id="s3.4">3.4 Speech-enabled widgets</h3>

<p>Developers would like to see speech enabled UI widgets
which contain a simple dialog flow (e.g. widgets which contain
confirmation or disambiguation steps). This allows an author
to configure the dialog properties (prompts, grammars,
confirmation-mode, confidence thresholds, etc.) of an HTML
control or hyperlink.</p>

<h4 id="s3.4.1">Use Case:  Hotel booking application:
confirm hotel</h4>

<p>The user says the name of one of the available hotels.
The application repeats the name of the hotel back to the
user and asks if it is correct.  If the user says 'yes' then
the application fills in the HTML field with the user's input.</p>

<h3 id="s3.5">3.5 Use speech to activate links and change
focus</h3>

<p>It should be easy to use speech to do more than fill in
HTML form controls. For example, there should be declarative
support for activating an HTML link or changing focus within
an HTML page.</p>

<h4 id="s3.5.1">Use Case:  Speech enabled bookmark page</h4>

<p>A page that displays the user's bookmarks is speech-enabled
such that each bookmark has an associated grammar for moving
the browser to the bookmarked page.</p>

<h3 id="s3.6">3.6 Back functionality</h3>

<p>Developers like to see support for a consistent and
intuitive "back" handling across modalities. The browser
"Back" multimodal functionality should be built-in and
not require custom code.</p>

<h4 id="s3.6.1">Use Case:  Browser "back" button</h4>

<p>The user can either press the browser back button or
say "browser go back" to return to the previous multimodal
page.  All spoken commands which control the browser are
preceded by "browser" so there is no collision with an
application grammar.</p>

<h2 id="s4">4 What developers would like to see continue
and improve</h2>

<h3 id="s4.1">4.1 Support for both off-line and on-line
multimodal interaction</h3>

<p>Multimodal interaction should be supported both for
applications that are on-line, that is, are connected to
the network, as well as for off-line applications.  If the
multimodal application goes from an on-line to an off-line
state, multimodal interaction should still be supported by
the modality components that run locally on the device.</p>

<h4 id="s4.1.1">Use Case:  Access of medical information
while walking down a hallway</h4>

<p>A doctor carrying a wireless tablet accesses patient
medical information while walking down a hallway. Loss of
wireless connectivity does not prevent the multimodal
application from interacting with the doctor or presenting
information it has stored on the doctor's tablet.</p>

<h4 id="s4.1.2">Use Case:  Multimodal application in hospital
operating room</h4>

<p>An off-line multimodal application in an operating room
delivers timely instructions to the doctor.</p>

<h3 id="s4.2">4.2 Support for events distributed over
the network</h3>

<p>Because a modality may be distributed on a remote server,
there must be support for distributed events between a
modality and the interaction manager.</p>

<h4 id="s4.2.1">Use Case:  Driving directions</h4>

<p>A user accesses a multimodal driving directions application
using a cell-phone.  The application tells the user to turn
right at the next intersection.  An arrow pointing right pops
up over a map.  The application had received an event to
display an arrow from the server.</p>

<h3 id="s4.3">4.3 Support for implicit events</h3>

<p>Implicit event support includes both implicit event
generation and implicit event handling.  At different
stages in the operation of the modality component, there
will be either event generation or event handling by the
component itself.  For example, the VoiceXML modality
component could implicitly generate a focus event when
the FIA selects a new form input item.</p>

<h4 id="s4.3.1">Use Case:  Hotel booking application:
name, address, phone number</h4>

<p>A hotel booking application has a form with separate
HTML input fields for entering name, street address, city,
state and phone number.  When the user selects one of the
fields the user hears a prompt for entering the correction
information into the field.  The visual input focus is
coordinated with the speech input focus.</p>

<h3 id="s4.4">4.4 VoiceXML tag and feature support</h3>

<p>VoiceXML support should include, for example, the
&lt;object&gt; and &lt;mark&gt; tags and the "record
while recognition is in progress" feature.</p>

<h4 id="s4.4.1">Use case:  Windows program for calculating
stock purchase totals</h4>

<p>The &lt;object&gt; element can be used to load a
reusable platform-specific plug-in. For example, the
application would load a Windows program which calculates
stock purchase totals using the &lt;object&gt; element.</p>

<h4 id="s4.4.2">Use case:  Read part of an e-mail message</h4>

<p>The &lt;mark&gt; tag can be used to mark how much of
the text was actually read before the user left the page.
When the user returns to the page the rest of the text can
be read beginning where the user left off.</p>

<h4 id="s4.4.3">Use case:  Unrecognized user input</h4>

<p>The recording of an unrecognized user input can be
logged by the speech recognizer.</p>

<h3 id="s4.5">4.5 Support for both directed and
user-initiated dialogs</h3>

<p>There must be arbitrary as well as procedural speech
access to the visual application.  For a dialog mechanism
used in conjunction with a visual form there should be
support for user-initiated dialogs.  For example, the
user should be able to jump to arbitrary points in the
dialog by changing the visual focus (e.g., by clicking
on a text box).</p>

<h4 id="s4.5.1">Use Case:  Form filling for air travel
reservation</h4>

<p>The air travel reservation application takes the user
step by step through making a reservation, beginning with
the origin and destination of the flight.  After the user
has been given a selection of flights, the user clicks on
the visual departure date field to change the departure date.</p>

<h4 id="s4.5.2">Use Case:  Application with two HTML forms</h4>

<p>The user is taken step-by-step through filling out a
set of HTML fields in a form.  Before all the fields have
been filled, the user clicks on a field belonging to the
other form.</p>

<h3 id="s4.6">4.6 Mixed-initiative interaction</h3>

<p>Dialog mechanisms that combine speech and text input
must support mixed-initiative interaction.</p>

<h4 id="s4.6.1">Use Case:  Flight reservation application</h4>

<p>A flight reservation application has separate HTML
input fields for entering destination airport, date of
travel and seating class.  With a single utterance "I'd
like to go to San Francisco on April 20th, business class"
the user fills in all the fields at one time.</p>

<h3 id="s4.7">4.7 Access to speech confidence scores
and n-best list by the application</h3>

<p>Confidence scores and n-best lists are useful for
example to allow the user to pick from a set of results
supplied by an input recognizer.</p>

<h4 id="s4.7.1">Use Case:  Select a football player</h4>

<p>A user says the name of a favorite football player.
A number of players matched the user's input with the same
low confidence score.  Instead of asking the user to repeat
the name, the application displays a visual list of player
names that was matched.  The user selects a name from the
list.</p>

<h3 id="s4.8">4.8 Access to device details</h3>

<p>The developer would like access to device information
such as, for example, the cell phone number, phone model,
and display screen size.  Typically in any mobile
application the content is very specific to the device and
at times personalized for the user. Access to device
specific details such device model (e.g., Nokia 6680) helps
the application reduce the grammar size and render
device specific content. Access to user information such
as the phone number allows the application to personalize
the content for the user.</p>

<h4 id="s4.8.1">Use Case:  Mobile appointment application</h4>

<p>When user 'George' accesses the appointment application
the application says "Welcome 'George'" and presents a list
of appointments for the day.  The user can select any of his
appointments by saying an appointment label shown on his phone.
Each label is short enough to fit entirely on George's display.</p>

<h3 id="s4.9">4.9 Choice of ASR</h3>

<p>The developer would like to have more control over ASR.
An example is the capability of a multimodal application to
choose between a local ASR or network based ASR depending
on the location of the grammar.  The developer should be
allowed to pick the ASR depending on the application logic.</p>

<h4 id="s4.9.1">Use Case:  Music search mobile application</h4>

<p>In a music search mobile application the application uses
network-based ASR to perform a search for a particular
Artist/Album such as 'Green Day', '50 Cent' etc.  In case
of network-based recognition the grammar is changing
dynamically and is large in size. The same music application
may use local ASR for the purpose navigating through the
application using commands such as 'Home', 'Next Page' etc.</p>

<h3 id="s4.10">4.10 Controlling N-Best choice of ASR</h3>

<p>The application should be able to control the number of
results it wants from ASR based on either a number N (say
return top 5 matches) or confidence score (say return &gt;
0.8 score).  The developer should be able to author this
N-Best list control.</p>

<h4 id="s4.10.1">Use Case:  Select a football player mobile
application</h4>

<p>As with the previous football player selection use case,
the list of players is visually displayed for the user to
select.  The user can make a selection from the visual
presentation.  The ASR may return more than 10 results as
part of its N-Best response mechanism. However, the application
depending on the screen size may choose to display only the
top 5 entries on the screen. The application requests only
the top 5 players in the N-best result instead of receiving
10 results and then ignoring the last 5 results.</p>

<hr />
</body>
</html>