index.html 49.4 KB

Raw Blame History Permalink

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="HTML Tidy for Linux/x86 (vers 1st April 2003), see www.w3.org"
      name="generator" />
<title>W3C Multimodal Interaction Framework</title>
<style type="text/css">
 /*<![CDATA[*/
  body {
    margin-left: 8%;
    margin-right: 5%;
    background-color: white;
    font-family: Trebuchet, Arial, sans-serif
  }
  h1 { margin-left: -4%; color: rgb(0,92,160) }
  h2 { margin-left: -4%; color: rgb(0,92,160)}
  h3 { margin-left: 0% }
  p.fig {text-align: center}
  .c1 { display: none }
  .old { text-decoration: line-through }
  .new { font-style: italic; color: green }
  .note { font-style: italic; color: red }
  p.example { margin-left: 10% }
  //--> /*]]>*/
</style>
<link rel="stylesheet" type="text/css"
href="http://www.w3.org/StyleSheets/TR/W3C-NOTE" />
</head>
<body>
<div class="head">
<p><a href="http://www.w3.org/"><img height="48" alt="W3C"
src="http://www.w3.org/Icons/w3c_home" width="72" /></a></p>

<h1 class="notoc" id="name">W3C Multimodal Interaction
Framework</h1>

<h2>W3C NOTE 06 May 2003</h2>

<dl>
<dt>This version:</dt>

<dd>
<a href="http://www.w3.org/TR/2003/NOTE-mmi-framework-20030506/">http://www.w3.org/TR/2003/NOTE-mmi-framework-20030506/</a></dd>

<dt>Latest version:</dt>

<dd>
<a href="http://www.w3.org/TR/mmi-framework/">http://www.w3.org/TR/mmi-framework/</a></dd>

<dt>Previous version:</dt>

<dd>
<a href="http://www.w3.org/TR/2002/NOTE-mmi-framework-20021202/">http://www.w3.org/TR/2002/NOTE-mmi-framework-20021202/</a></dd>

<dt>Editors:</dt>

<dd>James A. Larson, Intel</dd>

<dd>T.V. Raman, IBM</dd>

<dd>Dave Raggett, W3C &amp; Canon</dd>

<dt>Contributors:</dt>

<dd>Michael Bodell, Tellme Networks</dd>

<dd>Michael Johnston AT&amp;T</dd>

<dd>Sunil Kumar V-Enable Inc.</dd>

<dd>Stephen Potter, Microsoft</dd>

<dd>Keith Waters France Telecom</dd>
</dl>

<p class="copyright"><a href="http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">
Copyright</a> &#xa9; 2003 <a href="http://www.w3.org/"><acronym
title="World Wide Web Consortium">W3C</acronym></a><sup>&#xae;</sup> (<a
href="http://www.lcs.mit.edu/"><acronym
title="Massachusetts Institute of Technology">MIT</acronym></a>, <a
href="http://www.ercim.org/"><acronym
title="European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
<a href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved. W3C
<a href="http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
<a href="http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>,
<a href="http://www.w3.org/Consortium/Legal/copyright-documents">document use</a>
and <a href="http://www.w3.org/Consortium/Legal/copyright-software">software
licensing</a> rules apply.</p>

<hr title="Separator for header" />

</div>

<h2 class="notoc" id="Abstract">Abstract</h2>

<p>This document introduces the W3C Multimodal Interaction
Framework, and identifies the major components for multimodal
systems. Each component represents a set of related functions. The
framework identifies the markup languages used to describe
information required by components and for data flowing among
components. The W3C Multimodal Interaction Framework describes
input and output modes widely used today and can be extended to
include additional modes of user input and output as they become
available.</p>

<h2 id="Status">Status of this Document</h2>

<p><em>This section describes the status of this document at the
time of its publication. Other documents may supersede this
document. The latest status of this document series is maintained
at the <abbr title="the World Wide Web Consortium">W3C</abbr>
.</em></p>

<p>W3C's <a href="http://www.w3.org/2002/mmi/">Multimodal
Interaction Activity</a> is developing specifications for extending
the Web to support multiple modes of interaction. This document
introduces a functional framework for multimodal interaction and is
intended to provide a context for the specifications that comprise
the W3C Multimodal Interaction Framework.</p>

<p>This document has been produced as part of the
<a href="http://www.w3.org/2002/mmi/">W3C Multimodal Interaction
Activity</a>,<a class="c1" href="http://www.w3.org/2002/mmi/Activity.html"></a> following the procedures set out for the
<a href="http://www.w3.org/Consortium/Process/">W3C Process</a> .
The authors of this document are members of the
<a href="http://www.w3.org/2002/mmi/Group/">Multimodal Interaction
Working Group</a>
(<a href="http://cgi.w3.org/MemberAccess/AccessRequest">W3C Members
only</a> ). This is a Royalty Free Working Group, as described in
W3C's <a href="/TR/2002/NOTE-patent-practice-20020124">Current
Patent Practice</a> NOTE. Working Group participants are required
to provide <a href="http://www.w3.org/2002/01/mmi-ipr.html">patent
disclosures</a> .</p>

<p>Please send comments about this document to the public mailing
list:
<a href="mailto:www-multimodal@w3.org">www-multimodal@w3.org</a>
(<a href="http://lists.w3.org/Archives/Public/www-multimodal/">public
archives</a> ). To subscribe, send an email to
<a href="mailto:www-multimodal-request@w3.org">www-multimodal-request@w3.
org</a> with the word <em>subscribe</em> in the subject line
(include the word <em>unsubscribe</em> if you want to
unsubscribe).</p>

<p>A list of current W3C Recommendations and other technical
documents including Working Drafts and Notes can be found at
<a href="http://www.w3.org/TR/">http://www.w3.org/TR/</a> .</p>

<h2 id="intro">1. Introduction</h2>

<p>The purpose of the W3C multimodal interaction framework is to
identify and relate markup languages for multimodal interaction
systems. The framework identifies the major components for every
multimodal system. Each component represents a set of related
functions. The framework identifies the markup languages used to
describe information required by components and for data flowing
among components.</p>

<p>The W3C Multimodal Interaction Framework describes input and
output modes widely used today and can be extended to include
additional modes of user input and output as they become
available.</p>

<p><em>The multimodal interaction framework is not an
architecture</em> . The multimodal interaction framework is a level
of abstraction above an architecture. An architecture indicates how
components are allocated to hardware devices and the communication
system enabling the hardware devices to communicate with each
other. The W3C Multimodal Interaction Framework does not describe
either how components are allocated to hardware devices or how the
communication system enables the hardware devices to communicate.
See Section 6 for descriptions of several example architectures
consistent with the W3C multimodal interaction framework.</p>

<h2 id="s2">2. Basic Components of the W3C Multimodal Interaction
Framework</h2>

<p>The Multimodal Interaction Framework is intended as a basis for
developing multimodal applications in terms of markup, scripting,
styling and other resources. The Framework will build upon a range
of existing W3C markup languages together with the
<a href="http://www.w3.org/DOM/">W3C Document Object Model</a>
(DOM). &nbsp;DOM defines interfaces whereby programs and scripts
can dynamically access and update the content, structure and style
of documents.</p>

<p>Figure 1 illustrates the basic components of the W3C multimodal
interaction framework.</p>

<p class="fig"><img src="fig1.png" width="493" height="300"
alt="I/O processors, dialog manager and application back end" /></p>

<p><em>Human user &mdash; A user</em> who enters input into the
system and observes and hears information presented by the system.
In this document, we will use the term "user" to refer to a human
user. However, an automated user may replace the human user for
testing purposes. For example, an automated "testing harness" may
replace human users for regression testing to verify that changes
to one component do not affect the user interface negatively.</p>

<p><em>Input</em> &mdash; An interactive multimodal implementation
will use multiple input modes such as audio, speech, handwriting,
and keyboarding, and other input modes. The various modes of input
will be described in <a href="#s3">Section 3</a>.</p>

<p><em>Output</em> &mdash; An interactive multimodal implementation
will use one or more modes of output, such as speech, text,
graphics, audio files, and animation. The various modes of output
will be described in <a href="#s4">Section 4</a>.</p>

<p><em>Interaction manager</em> &mdash; The interaction manager is
the logical component that coordinates data and manages execution
flow from various input and output modality component interface
objects. The input and output modality components are as described
in <a href="#s5">Section 5</a>.</p>

<p>The interaction manager maintains the interaction state and
context of the application and responds to inputs from component
interface objects and changes in the system and environment. The
interaction manager then manages these changes and coordinates
input and output across component interface objects. The
Interaction manager is discussed in <a href="#s6">section
6</a>.</p>

<p>In some architectures the interaction manager may be implemented
as one single component. In other architectures the interaction
manager may be treated as a composition of lesser components.
Composition may be distributed across process and device
boundaries.</p>

<p><i>Session component</i> &mdash; The Session component
(discussed in <a href="#s7">Section 7</a>) provides an interface to
the interaction manager to support state management, and temporary
and persistent sessions for multimodal applications. This will be
useful in the following scenarios but is not limited to these:</p>

<ul>
<li>A user is interacting with an application which runs on
multiple devices.</li>

<li>The application is session based e.g. multiplayer game,
multimodal chat, meeting room etc.</li>

<li>The application provides multiple modes of providing input and
receiving output.</li>

<li>The application runs on a single device and needs to experience
multimodality by switching modes.</li>
</ul>

<p><em>System and Environment component</em> &mdash; This component
enables the interaction manager to find out about and respond to
changes in device capabilities, user preferences and environmental
conditions. For example, which of the available modes, the user
wishes to use &mdash; has the user muted audio input? The
interaction manager may be interested in the width and height of
the display, whether it supports color, and other capability and
configuration information. For more information see
<a href="#s8">Section 8</a></p>

<h2 id="s3">3. Input Components</h2>

<p>Figure 2 illustrates the various types of components within the
input component.</p>

<p class="fig"><img src="fig2.png" width="539" height="449"
alt="recognition, interpretation and integation of inputs" /></p>

<ul>
<li>
<p><em>Recognition component</em> &mdash; Captures natural input
from the user and translates the input into a form useful for later
processing. The recognition component may use a grammar described
by a grammar markup language. Example recognition components
include:</p>

<ul>
<li><em>Speech</em> &mdash; Converts spoken speech into text. The
automatic speech recognition component uses an acoustic model, a
language model, and a grammar specified using the W3C Speech
Recognition Grammar or the Stochastic Language Model (N-Gram)
Specification to convert human speech into words specified by the
grammar.</li>

<li><em>Handwriting</em> &mdash; Converts handwritten symbols and
messages into text. The handwriting recognition component may use a
handwritten gesture model, a language model, and a grammar to
convert handwriting into words specified in a grammar.</li>

<li><em>Keyboarding</em> &mdash; Converts key presses into textual
characters</li>

<li><em>Pointing device</em> &mdash; Converts button presses into
x-y positions on a two-dimensional surface</li>
</ul>

<p>Other input recognition components may include vision, sign
language, DTMF, biometrics, tactile input, speaker verification,
handwritten identification, and other input modes yet to be
invented.</p>
</li>

<li>
<p><em>Interpretation component</em> &mdash; May further process
the results of recognition components. Each interpretation
component identifies the "meaning" or "semantics" intended by the
user. For example, many words that users utter such as "yes,"
"affirmative," "sure," and "I agree," could be represented as
"yes."</p>
</li>

<li>
<p><em>Integration component</em> &mdash; Combines the output from
several interpretation components</p>
</li>
</ul>

<p>Some or all of the functionality of this component could be
implemented as part of the recognition, interpretation, or
interaction components. For example, audio-visual speech
recognition may integrate lip movement recognition and speech
recognition as part of a lip reading component, as part of the
speech recognition component, or integrated within a separate
integration component. As another example, the two input modes of
speaking and pointing are used in</p>

<p class="example">"put that," (point to an object), "there,"
(point to a location)</p>

<p>and may be integrated within a separate integration component or
may be integrated within the interaction manager component.</p>

<p>Information generated by other system components may be
integrated with user input by the integration component. For
example, a GPS system generates the current location of the user,
or a banking application generates an overdraft to prohibit the
user from making additional purchases.</p>

<p>The output for each interpretation component may be expressed
using EMMA, a language for representing the semantics or meaning of
data. &nbsp;Either the user or the system may create information
that may be routed directly to the interaction manager without
being encoded in EMMA. For example, audio is recorded for later
replay or a sequence of keystrokes is captured during the creation
of a macro.</p>

<h2 id="s4">4. Output Components</h2>

<p>Figure 3 illustrates the components within the output
component.</p>

<p class="fig"><img src="fig3.png" width="534" height="391"
alt="Use of EMMA to drive output modes" /></p>

<ul>
<li>
<p><em>Generation component</em> &mdash; The generation component
determines which output mode or modes will be used for presenting
information from the interaction manager to the user. The
generation component may select a single output mode or it may
select complementary or supplementary modes. The "internal
representation" language used to describe the output from the
generation component is under discussion by the working group.</p>
</li>
</ul>

<p>Information from the interaction manager may be routed directly
to the appropriate rendering device without being encoded in an
internal representation. For example, recorded audio is send
directly to the audio system.</p>

<ul>
<li>
<p><em>Styling component</em> &mdash; This component adds
information about how the information is "layed out." For example,
the styling component for a display specifies how graphical objects
are positioned on a canvas, while the styling component for audio
may insert pauses and voice inflections into text which will be
rendered by a speech synthesizer. Cascading Style Sheets (CSS)
could be used to modify voice output.</p>
</li>

<li>
<p><em>Rendering component</em> &mdash; The rendering component
converts the information from the styling component into a format
that is easily understood by the user. For example, a graphics
rendering component rectangle displays a vector of points as a
curved line, and a speech synthesis system converts text into
synthesized voice.</p>
</li>
</ul>

<p>Each of the output modes has both a styling and rendering
component.</p>

<p>The voice styling component constructs text strings containing
Speech Synthesis Markup Language tags describing how the words
should be pronounced. This is converted to voice by the voice
rendering component. The voice styling component may also select
prerecorded audio files for replay by the voice rendering
component.</p>

<p>The graphics styling component creates XHTML, XHTML Basic, or
<a href="http://www.w3.org/TR/SVG/">SVG</a> markup tags describing
how the graphics should be rendered. The graphics rendering
component converts the output from the graphics styling component
into graphics displayed to the user.</p>

<p>Other pairs of styling and rendering components are possible for
other output modes.
<a href="http://www.w3.org/AudioVideo/">SMIL</a> may be used for
coordinated multimedia output.</p>

<h2 id="s5">5. Specification of input and output components</h2>

<p>This section describes how the input and output components of
sections 3 and 4 are specified. In brief, input and output
components of the user interface will be specified as DOM objects
that expose interfaces pertaining to that object's functionality.
&nbsp;This enables the modality objects to be accessed and
manipulated in the interaction management environments described in
section 6.</p>

<p>(The use of the term "object" in this section is intended in the
sense of "object" as used in the Document Object Model, and is not
intended to imply a particular class or object hierarchy.)</p>

<h3 id="s5.1">5.1 Encapsulated interfaces based on DOM</h3>

<p>User interface components make their functionality available to
interaction managers through a set of interfaces, and can be
considered as receiving values from and returning values to the
host environment. Here, values can be simple or complex types, and
components can specify the location for binding the received data,
perhaps using XPath, which is W3C's language for addressing parts
of an XML document, and was originally designed to be used by both
XSLT and XPointer. The set of interfaces will be built on DOM, and
thereby provide an object model for realizing the functionality of
a given modality.</p>

<p>The functionality of a user interface component can therefore
usefully be encapsulated in a programming-language-independent
manner into an <span>object</span> exposing the following kinds of
features:</p>

<ul>
<li>a set of <b>properties</b> (e.g. presentation parameters or
input constraints);</li>

<li>a set of <b>methods</b> (e.g. begin playback or recognition);
and</li>

<li>a set of <b>events</b> raised by the component (e.g. mouse
clicks, speech events).</li>
</ul>

<p>The DOM defines a platform-neutral and
programming-language-neutral interface to documents, their
structure and their content. The <span>user interface
objects</span> extend this model by adding modality-specific
interfaces. In this way, <span>user interface objects</span> can
define abstract interfaces which are usable across different host
environments.</p>

<p>In multimodal applications, multiple user interface components
are controlled and coordinated individually by the interaction
manager.</p>

<p>User interface <span>objects</span> should follow certain
guidelines to integrate into the multimodal framework:</p>

<ul>
<li>adhere to the principles of encapsulation, that is, the
features of a given modality should relate only to the modality in
question;</li>

<li>adopt common or recommended interfaces where possible;</li>

<li><span>In order to insure that the framework is sufficiently
general to accommodate both local and distributed
architectures,</span> avoid blocking calls and threading
issues.&nbsp;</li>

<li><span>Consider what kinds of message exchange patterns are
needed, for instance, publish/subscribe, broadcast, and
specifically addressed messages. This is also an important
consideration for insuring that the framework is neutral with
respect to local and distributed architectures.</span></li>
</ul>

<p>In general, the formalization of features into properties,
methods and events should not be taken to imply that the
manipulation of the interface can take place only in local DOM
architectures. It is the intention of this design that modality
interfaces should remain agnostic to component architectures where
possible. So the object feature definitions should be considered as
abstract indications of functionality, the uses of which will
probably differ according to architectural considerations (for
example property setting may take different forms, and
implementation mechanisms for event dispatch and handling are not
addressed here.)</p>

<h3 id="s5.2">5.2 Interface formalization</h3>

<p>Each <span>user interface object</span> will specify a set of
interfaces in terms of properties, events and methods, using a
formal interface definition language. Bindings into XML, ECMAScript
and other programming languages will also be defined.</p>

<p>In addition to formal definition of markup and DOM interfaces, a
description of the execution model of the <span>user interface
object</span> will be defined, that is, the behaviour of the
<span>object</span> when used. Further, a <span>user interface
object</span> should also describe how <span>it</span> is
controlled in different interaction management environments, for
example, those which support:</p>

<ul>
<li>limited environments without programmatic capabilities;</li>

<li>XHTML and its flavours, including scripting, DOM eventing,
XForms, etc.</li>

<li>SMIL</li>

<li>HTML</li>

<li>SVG</li>

<li>etc.</li>
</ul>

<p>As work proceeds on the definition of individual modality
interfaces, sufficient commonality of features may be found such
that it is desirable to standardize in some way those features
across different modalities. As such, the MMI group will
investigate the possibilities for establishing a set of common
interfaces that may be shared among all relevant modalities</p>
<br />


<h2 id="s6">6. Specification of interaction management
component</h2>

<h3 id="s6.1">6.1 Host Environments for interaction management</h3>

<p>The interaction manager is a logical component. The interaction
manager is contained in the host environment that hosts interface
objects. Interface objects influence one another by interacting
with the Host Environment. A host environment provides data
management and flow control to its hosted interface objects. Some
languages that may be candidates as Host Environment languages
include <a href="http://www.w3.org/Graphics/SVG/">SVG</a>,
<a href="http://www.w3.org/MarkUp/">XHTML</a> (possibly
<a href="http://www.w3.org/MarkUp/">XHTML</a>+
<a href="http://www.w3.org/MarkUp/Forms/">XForms</a>), and
<a href="http://www.w3.org/AudioVideo/">SMIL</a>.</p>

<p>A Host Environment's hosted interface objects may range from the
simple to the complex. Authors will be able to specify the
interface object components through a mixture of markup, scripting,
style sheets, or any other resources supported by their Host
Environment's functionality. The Host Environment design makes
possible architectures where the interface objects may each have
their own thread of execution independent from context of the Host
Environment. The design also supports each component communicating
asynchronously with the Host Environment (however familiarity with
synchronization primitives such as mutexes will not be required to
successfully author multimodal documents).</p>

<p>In some architectures, it is possible to have a hierarchical
composition of Host Environments similar in spirit to Russian
nesting dolls. Different aspects of interaction management may be
handled at different levels of the hierarchy. For example,
"barge-in", where speech output is cut off on the basis of user
input, is an interaction management mechanism that may handled by
one lower level Environment that just hosts the basic speech input
and speech output objects while a different higher level Host
Environment coordinates the multimodal application. Hierarchical
interaction management also enables the delegation of complex input
tasks to lower levels of the hierarchy. As an example, a date
dialog might encapsulate the necessary interaction management logic
needed to produce appropriate tapered prompts, error handling, and
other dialog constructs to eventually collect a valid date. This
form of nesting enables the creation of hierarchical interaction
management that reflects the task hierarchy within the overall
application.</p>

<h2 id="s7">7. Session Component</h2>

<p>An important goal of the W3C Multimodal Interaction Framework is
to provide a simplified approach for authoring multimodal
applications whether on a single system/user or distributed across
multiple systems/users. The framework is architecture neutral, and
abstractly relies on passing messages between the various framework
components. The session component provides a means to simplify the
author's view of how resources are identified in terms of source
and destination of such messages.&nbsp;The session component is
particularly important for distributed applications involving more
than one device and/or user. It hides the details of the resource
naming schemes and protocols used and provides a high-level
interface for requesting and releasing resources taking part in the
session.&nbsp;</p>

<h3 id="s7.1">7.1 Functions of Session Component</h3>

<h4 id="s7.1.1">7.1.1 Session as basis of state replication and
synchronization</h4>

<p>The session component can be used for replicating state across
devices, or across processes within the same device. In a graphical
interface scenario running on a hand held device coupled to a voice
interface running in the network. The user can choose to navigate
or enter data using the device keypad or using speech. When filling
out a form, this gives two ways to update the field's value. The
session provides a scope for the replication mechanism and provides
a way to keep multiple modes in sync.</p>

<h4 id="s7.1.2">7.1.2 Temporary/Persistent Sessions</h4>

<p>For certain applications the session is short lived. In theses
cases the same session may last for a single page or for several
pages as the user navigates through the application, for example
when visiting a web site. This makes it practical to retain state
information for the duration of the application. For applications
that involve persistent sessions such as meeting rooms, multiplayer
games, there is a need for session management, and a means to
locate, join and leave such sessions.</p>

<h4 id="s7.1.3">7.1.3 Simplifying Applications</h4>

<p>In a distributed environment there are several ways to identify
a resource. The session component provides a means to query
descriptions of resources, including the type of the resource, what
properties the resource has, and what interfaces it supports.</p>

<h3 id="s7.2">7.2 Use Cases</h3>

<p>The following use case provides the basis of defining session
component:</p>

<ul>
<li><b>Mobile Devices with sequential capability</b></li>
</ul>

<p>Devices with limited capability provide a good example of the
importance of a session component. The sequential multimodality
allows user to experience multiple modes but only one mode at a
time. In such a scenario the user has to switch between modes to
experience multiple modes. In an application where the user is
filling out a form using voice as the input mode since voice is
preferred/easier mode for providing input. After the user has
provided the input the application saves the form fields in a
session object and switches the mode to visual. In visual mode the
application retrieves the values from session and uses the form
fields for further processing. An example of such application would
be Driving Directions application where the user provides source
and destination using voice mode and then selects directions from
visual mode to see the directions.</p>

<ul>
<li><b>Form filling</b></li>
</ul>

<p>Form filling presents another use case for a session component.
Especially when partial information is filled using the keypad
attached to the device and partial information is filled using the
speech processed at the speech server in the network. For example
in an airline reservation system the user can provide date of
travel by clicking on appropriate dates in the calendar and provide
source and destination using speech which is processed in the
network. A session component helps in synchronizing the input
provided in either mode and provides filled form information back
to the application.</p>

<ul>
<li><b>Meeting Rooms</b></li>
</ul>

<p>The session in this case is persistent and users join/leave the
session during the application. A session component allows a user
to query the session environment. A session environment would
consist of the resources and the values of the attributes in the
resources. In case of meeting room application the user can query:
i) who else is in the meeting room. ii) Get the information about a
particular member in the meeting room e.g. contact information,
whether the member is online etc.? The resources that application
wants its user to share is stored and proper interfaces are
provided to access the attributes of the resource.</p>

<ul>
<li><b>Multiple Device Applications</b></li>
</ul>

<p>For multimodal applications running across multiple devices, the
session component can play an important role in the synchronization
of state across the devices. For example a user may be running an
application while sitting in a car using a device attached to the
car. The user gets off the car and goes to his office and wants to
continue with the application on his laptop that he was running in
the car. The session component provides interfaces to save the
state of the whole application on a device and reinstating the
whole state on another device. The few examples of such
applications could be video conferencing, online shopping, airline
reservations etc. For example in an airline reservation system, the
user selects the itinerary while he is still in the car. The user
gets out of the car and buys the same ticket using his laptop in
his office.</p>

<h2 id="s8">8. System and Environment Component</h2>

<p>The <a href="http://www.w3.org/TR/mmi-reqs/">W3C Multimodal
Interaction Requirements</a> call for the ability for developers to
be able to create applications that
<a href="http://www.w3.org/TR/mmi-reqs/#Deliveryandcontext">dynamically
adapt</a> to changes in device capabilities, user preferences and
environmental conditions. The multimodal interaction framework must
allow the interaction manager to determine what information is
available, as this will be system dependent. In addition, the
framework must support stand-alone as well as distributed scenarios
involving multiple devices and multiple users (see
<a href="#s7">section 7</a> for more details).</p>

<p>It is expected that the system and environment component will
make use of the work of the
<a href="http://www.w3.org/2001/di/">W3C Device Independence
activity</a>, in particular the
<a href="http://www.w3.org/Mobile/CCPP/">CC/PP</a> language, whose
aim is to standardize ways of expressing device features and
settings, and to describe how they are transmitted between
components. Profiles regarding multimodal-specific properties, such
as those listed below, are expected to be defined in accordance to
the <a href="http://www.w3.org/TR/CCPP-struct-vocab/">CC/PP
Structure and Vocabularies specification</a>.</p>

<h3 id="s8.1">8.1 User Case Scenarios</h3>

<p>To illustrate the components functionality it is worth
considering the following few user case scenarios:</p>

<ul>
<li>
<p><b>Mobile</b> devices typically have limited capabilities and
resources, so that applications need to be tailored to the
specifics of the device. For example, many mobile phones have small
monochrome displays, while others have rich, fast color displays.
The following are typical characteristics of mobile devices that
can be provided to the Interaction Manager through the System and
Environment component:</p>

<ul>
<li>
<p><b>Location</b> information can be provided by an increasing
number of mobile devices. Typically this information is derived
from cell quadrant (cellular radio networks), GPS satellite data or
dead reckoning based on motion sensors. The Location
Interoperability Forum - now part of the
<a href="http://www.openmobilealliance.org/lif/">Open Mobile
Alliance</a> &mdash; has been responsible for much of the work on
this to date. Location-based services (LBS) provide time stamped
location data of varying accuracy, in some circumstances, this can
be to within a few meters. This information can be provided upon
request at sub-second intervals. Multimodal applications can use
such information to orient maps and to provide geographically
relevant information.</p>
</li>

<li>
<p><b>Signal strength</b> provides information on network
connectivity as well as the quality of service that can be
provided. As signal strength decreases a Multimodal application
could adapt accordingly. This could be as simple as switching to an
alternative low-bandwidth mode of communication.</p>
</li>

<li>
<p><b>Aural noise level</b> for mobile devices is an important
consideration because of the variety of situations where the device
can be used, for example, noise from passing vehicles, other people
talking nearby, or loud music. Speech recognition can be tailored
based on noise levels returned by the System and Environment
component.</p>
</li>

<li>
<p><b>Battery level</b> provides information on the remaining
operational time. Such a notification to the Interaction Manager is
particularly revelant to small un-tethered devices where power
consumption is critical.</p>
</li>
</ul>
</li>

<li>
<p><b>Automotive</b> &mdash; Multimodality is typically an on-board
capability that senses the local environment to determine what
services can be adapted to the drivers situation, for example:</p>

<ul>
<li>
<p><b>Aural noise Level</b> within the car can be generated and
modified by numerous environmental factors for example driving with
the windows down, radio volume, the AC/Fan on/off or windscreen
wipers on/off. Environmental conditions of the vehicle, controlled
by the driver, can be notified via the System Environment component
to the Interaction Manager to adapt the speech recognition.</p>
</li>

<li>
<p><b>In gear</b> notifications could provide information on the
drivers ability to use a touch screen in a Multimodal application.
In addition there are legal ramifications associated with the
driver operating devices whilst the vehicle is in motion. Therefore
the general behavior of a Multimodal application may need to adapt
according to whether the vehicle is parked or "in-drive".</p>
</li>

<li>
<p><b>GPS</b> notifications are an important feature of an on-board
Multimodal navigation system. The update frequency and accuracy of
updates being higher than typical LBS mobile services (see Mobile
section).</p>
</li>
</ul>
</li>

<li>
<p><b>Desktop</b> &mdash; Multimodal applications can be tailored
to the user's preferences. These choices can be dynamic or static
for example:</p>

<ul>
<li>
<p><b>Static user preferences</b> &mdash; the default volume
setting, the rate in words per minute for playing text to speech, a
general preference to using speech rather than a keyboard. People
with visual impairments may opt for easy to see large print text
and high contrast color themes.</p>
</li>

<li>
<p><b>Dynamic preferences</b> &mdash; the user may suddenly mute
audio output, or switch from speech to pen input, and expect the
application to adapt accordingly. The application itself may
monitor's the user's progress, and react appropriately, for
example, prompting the user to use a pen after successive failures
with speech recognition.</p>
</li>
</ul>
</li>
</ul>

<h3 id="s8.2">8.2 System and Environment Component Categories</h3>

<p>The above examples give a general indication of the
functionality that the System and Environment component offers as a
means for enabling applications to be tailored to adapt to device
capabilities, user preferences and environmental conditions.</p>

<ul>
<li>
<p><b>Environmental</b> conditions can be monitored and reported to
to the Interaction Manager. One way to look at these
characteristics is to inspect interference channels:</p>

<ul>
<li>
<p><b>Auditory</b></p>

<ul>
<li>
<p><b>Environment too noisy and bad for listening</b> - the
application should adapt to this change to provide a better
experience.</p>
</li>

<li>
<p><b>A speaker system/headphone attached?</b> A speaker system
allows the user to see the screen as well as listen at the same
time.</p>
</li>

<li>
<p><b>Car environment factors</b> - radio on/off, radio volume,
AC/Fan on/off, windscreen wipers on/off windows up/down.</p>
</li>
</ul>
</li>

<li>
<p><b>Visual</b></p>

<ul>
<li>
<p><b>Whether gesture recognition is possible.</b> The user should
be able to see the sensor for a gesture based application.
Moreover, if the user cannot see the device then audio becomes the
predominate mode of communication and the application adapt to
it.</p>
</li>
</ul>
</li>

<li>
<p><b>Tactile</b></p>

<ul>
<li>
<p><b>Pen</b> &mdash; large or small or finger begin used as a
tactile input device.</p>
</li>
</ul>
</li>
</ul>
</li>

<li>
<p><b>System</b> notifications can be derived from numerous
environmental sources, particularly within mobile and automotive
applications. Notifications from the System and Environment
component to the Interaction Manger can range from GPS location
information to the fact that the laptop has been closed. Many of
these system notifications indicate that the application should
switch to an alternative mode of operation.</p>
</li>

<li>
<p><b>User preferences</b> help with tailoring the application to
the user. These characteristics are most apparent in rich
Multimodal scenarios such as the desktop where resources are less
of an issue (large screens and fast CPU's). Preferences can be
modified to best suit user choices. Furthermore, it is possible to
dynamically adapt to the users preferences overtime.</p>
</li>
</ul>

<h2 id="s9">9. Illustrative Use Case</h2>

<p>To illustrate the component markup languages of the W3C
Multimodal Interaction Framework, consider this simple use case.
The human user points to a position on a displayed map and speaks:
"What is the name of this place?" The multimodal interaction system
responds by speaking "Lake Wobegon, Minnesota" and displays the
text "Lake Wobegon, Minnesota" on the map. The following summarizes
the actions of the relevant components of the W3C Multimodal
Interaction Framework:</p>

<p><em>Human user</em> &mdash; Points to a position on a map and
says, "What is the name of this place?"</p>

<p><em>Speech recognition component</em> &mdash; Recognizes the
words "What is the name of this place?"</p>

<p><em>Mouse recognition component</em> &mdash; Recognizes the x-y
coordinates of the position to which the user pointed on a map.</p>

<p><em>Speech interpretation component</em> &mdash; Converts the
words "What is the name of this place?" into an internal
notation.</p>

<p><em>Pointing interpretation component</em> &mdash; Converts the
x-y coordinates of the position to which the user pointed into an
internal notation.</p>

<p><em>Integration component</em> &mdash; Integrates the internal
notation for the words "What is the name of this place?" with the
internal notation for the x-y coordinates.</p>

<p><em>Interaction manager component</em> &mdash; Stores the
internal notation in the session object. Converts the request to a
database request, submits the request to a database management
system which returns the value of "Lake Wobegon, Minnesota". Add
the response to the internal notation in the session object The
interaction manager converts the response into an internal notation
and sends the response to the generation component.</p>

<p><em>Generation component</em> &mdash; Access the Environment
component to determine that voice and graphics modes are available.
Decides to present the result as two complementary modes, voice and
graphics. The generation component sends internal notation
representing "Lake Wobegon, Minnesota" to the voice styling
component, and sends internal notation representing the location of
Lake Wobegon, Minnesota on a map to the graphics styling
component.</p>

<p><em>Voice styling component</em> &mdash; Converts the internal
notation representing "Lake Wobegon, Minnesota" into SSML.</p>

<p><em>Graphics styling component</em> &mdash; Converts the
internal notation representing the "Lake Wobegon, Minnesota"
location on a map into HTML notation.</p>

<p><em>Voice rendering component</em>: Converts the SSML notation
into acoustic voice for the user to hear.</p>

<p><em>Graphics styling component</em>: Converts the HTML notation
into visual graphics for the user to see.</p>

<h2 id="s10">10.&nbsp; Examples of Architectures Consistent with
the W3C Multimodal Interaction Framework.</h2>

<p>There are many possible multimodal architectures that are
consistent with the W3C multimodal interaction framework. These
multimodal architectures have the following properties:</p>

<p>Property 1. THE MULTIMODAL ARCHITECTURE CONTAINS A SUBSET OF THE
COMPONENTS OF THE W3C MULTIMODAL INTERACTION FRAMEWORK. A
<em>multimedia architecture</em> contains two or more output modes.
A <em>multimodal architecture</em> contains two or more input
modes.</p>

<p>Property 2. COMPONENTS MAY BE PARTITIONED AND COMBINED. The
functions within a component may be partitioned into several
modules within the architecture, and the functions within two or
more components may be combined into a single module within the
architecture.</p>

<p>Property 3. THE COMPONENTS ARE ALLOCATED TO HARDWARE DEVICES. If
all components are allocated to the same hardware device, the
architecture is said to be <em>centralized architecture</em> . For
example, a PC containing all of the selected components has a
centralized architecture. A <em>client-server architecture</em>
consists of two types of devices, several client devices containing
many of the input and output components, and the server which
contains the remaining components. A <em>distributed
architecture</em> consists of multiple types of devices connected
by a communication system.</p>

<p>Property 4. THE COMMUNICATION SYSTEMS ARE SPECIFIED. Designers
specify the protocols for exchanging messages among hardware
devices.</p>

<p>Property 5. THE DIALOG MODEL IS SPECIFIED. Designers specify how
modules are invoked and terminated, and how they interpret input to
produce output.</p>

<p>The following examples illustrate architectures that conform to
the W3C multimodal interaction framework.</p>

<h3 id="s10.1">Example 1: Driving Example (Figure 4)</h3>

<p>In this example, the user wants to go to a specific address from
his current location and while driving wants to take a detour to a
local restaurant (The user does not know the restaurant address nor
the name). The user initiates service via a button on his steering
wheel and interacts with the system via the touch screen and
speech.</p>

<p>Property 1. The driving architecture contains the components
illustrated in Figure 4: a graphical display, map database, voice
and touch input, speech output, local ASR, TTS Processing and
GPS.</p>

<p>Property 2. No components are partitioned or combined with the
possible exception of the integration and interaction manager
components, and the generation and interaction components. There
are two possible configurations, depending upon whether the
integration component is stand alone or combined with the
interaction manager component:</p>

<ul>
<li>
<p>Information entered by the user may be encoded into EMMA
(Extensible MultiModal Annotation Markup Language, formerly known
as the Natural Language Semantic Markup Language) and combined by
an integration component (shown within the dotted rectangle in
Figure 4) which is separate from the interaction manager.</p>
</li>

<li>
<p>Information entered by the user may be recognized and
interpreted and then routed directly to the interaction manager,
which performs its own integration of user information</p>
</li>
</ul>

<p>There are two possible configurations, depending upon whether
the generation component is stand alone or combined with the
interaction manager component:</p>

<ul>
<li>
<p>Information from the interaction manager may be routed to the
generation component, where multiple modes of output are generated
and the appropriate synchronization control created.</p>
</li>

<li>
<p>Information may be be routed directly to the styling components
and then on to the rendering components. In this case, the
interaction manager does its own generation and
synchronization.</p>
</li>
</ul>

<p>Property 3. All components are allocated to a single client side
hardware device onboard the car. In Figure 4, the client is
illustrated by a pink box containing all of the components.</p>

<p>Property 4. No communication system is required in this
centralized architecture.</p>

<p>Property 5. Dialog Model: The user wants to go to a specific
address from his current location and while driving wants to take a
detour to a local restaurant . (The user does not know the
restaurant name or address.) The user initiates service via a
button on his steering wheel and interacts with the system via the
touch screen and speech.</p>

<p class="fig"><img src="fig4.png" width="559" height="539"
alt="Figure 4: Driving Example" /></p>

<h3 id="s10.2">Example 2: Name dialing (Figure 5)</h3>

<p>The Name dialing example enables a user to initiate a call by
saying the name of the person to be contacted. Visual and spoken
dialogs are used to narrow the selection, and to allow an exchange
of multimedia messages if the called person is unavailable. Call
handling is determined by a script provided by the called person.
The example supports the use of a combination of local and remote
speech recognition.</p>

<p>Property 1: The architecture contains a subset of the components
of the W3C Multimodal Interface Framework.</p>

<p>Property 2: No components have been partitioned or combined with
the possible exception of the integration component and interaction
component, and the generation component and the interaction
component (as discussed in example 2).</p>

<p>Property 3. The components in pink are allocated to the client
and the components in green are allocated to the server. Note that
the speech recognition and interpretation components are on both
client and server. The local ASR recognizes basic control commands
based upon the ETSI DES/HF-00021 standardized command and control
vocabulary, and the remote ASR recognizes names of individuals the
user wishes to dial. (The vocabulary of names is too large to
maintain on the client, so it is maintained on the server.)</p>

<p>Property 4. Communications system is SIP.
<a href="http://www.ietf.org/html.charters/sip-charter.html">SIP</a>
is a session initiation protocol and is a means for initiating
communication sessions involving multiple devices, and for control
signaling during such sessions.</p>

<p>Property 5. Navigational and control commands are recognized by
the ASR on the client. When the user says "call John Smith," the
ASR on the client recognizes the command "call" and transfers the
following information ("John Smith") to the server for recognition.
The application on the server then connects the user with John
Smith's telephone.</p>

<p class="fig"><img src="fig5.png" width="557" height="545"
alt="Figure 5: Name Dialing Example" /></p>

<h3 id="s10.3">Example 3: Form fill-in (Figure 6)</h3>

<p>In the Form fill-in example, the user wants to make a flight
reservation with his mobile device while he is on the way to work.
The user initiates the service by means of making a phone call to a
multimodal service (telephone) or by selecting an application
(portal environment metaphor). The dialogue between the user and
the application is driven by a form-filling paradigm where the user
provides input to fields such as "Travel Origin:", "Travel
Destination:", "Leaving on date", "Returning on date". As the user
selects each field in the application to enter information, the
corresponding input constraints are activated to drive the
recognition and interpretation of the user input.</p>

<p>Property 1: The architecture contains a subset of the components
of the W3C Multimodal Interface Framework, including GPS and
Ink.</p>

<p>Property 2: The speech recognition component has been
partitioned into two components, one which will be placed on the
client and the other on the server. The integration component and
interaction component, and the generation component and the
interaction component may be combined or left separate (as
discussed in example 2).</p>

<p>Property 3. The components in pink are allocated to the client
and the components in green are allocated to the server. Speech
recognition is distributed between the client and the server, with
the feature extraction on the client and the remaining speech
recognition functions performed on the server.</p>

<p>Property 4. Communications system is SIP.
<a href="http://www.ietf.org/html.charters/sip-charter.html">SIP</a>
is a session initiation protocol and is a means for initiating
communication sessions involving multiple devices, and for control
signaling during such sessions.</p>

<p>Property 5. Dialog Model: The user wants to make a flight
reservation with his mobile device while he is on the way to work.
The user initiates the service via means of making a phone call to
a multimodal service (telephone metaphor) or by selecting an
application (portal environment metaphor). The dialogue between the
user and the application is driven by a form-filling paradigm where
the user provides input to fields such as "Travel Origin:", "Travel
Destination:", "Leaving on date", "Returning on date". As the user
selects each field in the application to enter information, the
corresponding input constraints are activated to drive the
recognition and interpretation of the user input. The capability of
providing composite multimodal input is also examined, where input
from multiple modalities is combined for the interpretation of the
user's intent.</p>

<p class="fig"><img src="fig6.png" width="558" height="547"
alt="Figure 6: Form Fill-in Example" /></p>
</body>
</html>