WD-voice-architecture-19991223 12 KB
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content=
"text/html; charset=iso-8859-1">
<title>Model Architecture for Voice Browser Systems</title>
<style type="text/css">
body { 
font-family: sans-serif;
margin-left: 10%; 
margin-right: 5%; 
color: black;
background-color: white;
background-attachment: fixed;
background-image: url(http://www.w3.org/StyleSheets/TR/WD.gif);
background-position: top left;
background-repeat: no-repeat;
font-family: Tahoma, Verdana, "Myriad Web", Syntax, sans-serif;
}
.unfinished {  font-style: normal; background-color: #FFFF33}
.dtd-code {  font-family: monospace;
 background-color: #dfdfdf; white-space: pre;
 border: #000000; border-style: solid;
 border-top-width: 1px; border-right-width: 1px;
 border-bottom-width: 1px; border-left-width: 1px; }
p.copyright {font-size: smaller}
h2,h3 {margin-top: 1em;}
.extra { font-style: italic; color: #338033 }
code {
    color: green;
    font-family: monospace;
    font-weight: bold;
}
.example {
    border: solid green;
    border-width: 2px;
    color: green;
    font-weight: bold;
    margin-right: 5%;
    margin-left: 0;
}
.bad  {
    border: solid red;
    border-width: 2px;
    margin-left: 0;
    margin-right: 5%;
    color: rgb(192, 101, 101);
}
div.navbar { text-align: center; }
div.contents {
    background-color: rgb(204,204,255);
    padding: 0.5em;
    border: none;
    margin-right: 5%;
}
table {
    margin-left: -4%;
    margin-right: 4%;
    font-family: sans-serif;
    background: white;
    border-width: 2px;
    border-color: white;
  }
th { font-family: sans-serif; background: rgb(204, 204, 153) }
td { font-family: sans-serif; background: rgb(255, 255, 153) }
.tocline { list-style: none; }
</style>
<link rel="stylesheet" type="text/css" href= 
"http://www.w3.org/StyleSheets/TR/W3C-WD.css">
</head>
<body>
<div class="head">
<p><a href="http://www.w3.org/"><img class="head" src= 
"http://www.w3.org/Icons/WWW/w3c_home.gif" alt="W3C"></a></p>

<h1 class="notoc">Model Architecture for<br>
Voice Browser Systems</h1>

<h3 class="notoc">W3C Working Draft <i>23 December 1999</i></h3>

<dl>
<dt>This version:</dt>

<dd><a href= 
"http://www.w3.org/TR/1999/WD-voice-architecture-19991223">
http://www.w3.org/TR/1999/WD-voice-architecture-19991223</a></dd>

<dt>Latest version:</dt>

<dd><a href=
"http://www.w3.org/TR/voice-architecture">
http://www.w3.org/TR/voice-architecture</a></dd>

<dt>Editors:</dt>

<dd>M. K. Brown, Bell Labs, Murray Hill, NJ<br>
D. A. Dahl, Unisys, Malvern, PA</dd>
</dl>

<p class="copyright"><a href= 
"http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">
Copyright</a> &#169; 1999 <a href="http://www.w3.org/">
W3C</a><sup>&#174;</sup> (<a href=
"http://www.lcs.mit.edu/">MIT</a>, <a href=
"http://www.inria.fr/">INRIA</a>, <a href=
"http://www.keio.ac.jp/">Keio</a>), All Rights Reserved. <abbr
title="World Wide Web Consortium">W3C</abbr> <a href= 
"http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">
liability</a>, <a href= 
"http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">
trademark</a>, <a href= 
"http://www.w3.org/Consortium/Legal/copyright-documents">document
use</a> and <a href= 
"http://www.w3.org/Consortium/Legal/copyright-software">software
licensing</a> rules apply.</p>

<hr>
</div>

<h2 class="notoc">Abstract</h2>

<p>The W3C Voice Browser working group aims to develop
specifications to enable access to the Web using spoken
interaction. This document is part of a set of requirements
studies for voice browsers, and provides a model architecture for
processing speech within voice browsers.</p>

<h2>Status of this document</h2>

<p>This document describes a model architecture for speech
processing in voice browsers as an aid to work on understanding
requirements. Related requirement drafts are linked from the <a
href="/TR/1999/WD-voice-intro-19991223">introduction</a>. The
requirements are being released as working drafts but are not
intended to become proposed recommendations.</p>

<p>This specification is a Working Draft of the Voice Browser working
group for review by W3C members and other interested parties.  This is
the first public version of this document. It is a draft document and
may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use W3C Working Drafts as reference
material or to cite them as other than "work in progress".</p>

<p>Publication as a Working Draft does not imply endorsement by
the W3C membership, nor of members of the Voice Browser working
groups. This is still a draft document and may be updated,
replaced or obsoleted by other documents at any time. It is
inappropriate to cite W3C Working Drafts as other than "work in
progress."</p>

<p>This document has been produced as part of the <a href= 
"http://www.w3.org/Voice/">W3C Voice Browser Activity</a>,
following the procedures set out for the <a href= 
"http://www.w3.org/Consortium/Process/">W3C Process</a>. The
authors of this document are members of the <a href= 
"http://www.w3.org/Voice/Group">Voice Browser Working Group</a>.
This document is for public review. Comments should be sent to
the public mailing list &lt;<a href=
"mailto:www-voice@w3.org">www-voice@w3.org</a>&gt; (<a href= 
"http://www.w3.org/Archives/Public/www-voice/">archive</a>) by
14th January 2000.</p>

<p>A list of current W3C Recommendations and other technical
documents can be found at <a href="http://www.w3.org/TR">
http://www.w3.org/TR</a>.</p>

<h2>0. Introduction</h2>

<p>To assist in clarifying the scope of charters of each of the
several subgroups of the W3C Voice Browser Working Group, a
representative or model architecture for a typical voice browser
application has been developed.&#160; This architecture
illustrates one possible arrangement of the main components of a
typical system, and should not be construed as a
recommendation.&#160;Other proposed architectures for spoken
language systems are currently available, and may also be
compatible with voice browsers, for example the <a href= 
"http://fofoca.mitre.org/index.html">DARPA Communicator
architecture.</a></p>

<p>Connections between components have been shown explicitly in
the interest of clearly indicating the flow of information among
the processes (and thereby indicating the interaction of the W3C
subgroups).&#160; Each of the currently existing subgroups
(Universal Access, Speech Synthesis, Grammar Representation,
Natural Language, and Dialog) is represented in this
architecture.&#160; New subgroups are currently being initiated
and can contribute additional elements to this architecture for
future drafts.</p>

<p>The design is intended to be agnostic with respect to client,
proxy, or server implementation of the various components,
although in practice some components will naturally fall into
client or server roles in relation to other components (indeed,
some components can be both clients to some components and
servers to other components).&#160; An open-agent architecture,
where component connections are implicit, could be used for
actual implementation of such a system where components can
migrate to client and/or server roles as necessary to fulfill
their duties.&#160; The model architecture is designed to
accommodate synchronized multi-modal input and multi-media
output.</p>

<h2>1. Model Architecture</h2>

<p>The model architecture is shown in Figure 1.&#160; Solid
(green) boxes indicate system components, peripheral solid
(yellow) boxes indicate points of usage for markup language, and
dotted peripheral boxes indicate information flows.</p>

<p align="center"><img src="new_arch2-crop.gif" width="643"
height="491" alt=
"diagram showing model architecture for speech processing"></p>

<p align="center"><b>Figure 1. System Architecture</b></p>

<p>Two types of clients are illustrated: telephony and data
networking.&#160; The fundamental telephony client is, of course,
the telephone, either wireline or wireless.&#160; The handset
telephone requires PSTN (Public Switched Telephone Network)
interface, which can be either tip/ring, T1, or higher level, and
may include hybrid echo cancellation to remove line echoes for
ASR barge-in over audio output.&#160; A speakerphone will also
require an acoustic echo canceller to remove room echoes.&#160;
The data network interface will require only acoustic echo
cancellation if used with an open microphone since there is no
line echo on data networks.&#160; The IP interface is shown for
illustration only.&#160; Other data transport mechanisms can be
used as well.</p>

<p>Once data has passed the client interface, it can be processed
in a similar manner.&#160; One minor difference may be speech
endpointing.&#160; Endpointing will most likely be performed
either in the telephony interface or at the front-end of the ASR
processor for speech input from telephony interface.&#160; For
speech via the IP interface endpointing can be performed at the
client as well as the ASR front-end.&#160; The choice of where
endpointing occurs is coupled with the choice for echo
cancellation.</p>

<p>It is currently not clear how non-speech data will be handled
at the telephony interface. This can include inputs such as
pointing device input from a "smart phone," address books and
other client resident file data, and eventually even data like
video.&#160; These smart telephone devices are now on the drawing
boards of many suppliers.&#160; Some this traffic can be handled
by WAP/WML, but there are still open issues with regards to
multi-modality.&#160; Therefore voice markup language
specifications should provide means for extending the language
features.</p>

<p>Data from the ASR/DTMF (etc.) recognizer must be in a format
compatible with the NL (Natural Language) interpreter.&#160;
Typically this would be text, but might include non-textual
components for pointing device input, in which case pointing
coordinates can be associated with text and/or semantic
tags.&#160; If the recognizer has detected valid input while
output is still being presented, the recognizer can signal the
presentation component to stop output.&#160; Barge-in may not be
desirable for certain types of multi-media output, and should
primarily be considered important for interrupting speech
output.&#160; In some cases it may also be undesirable to
interrupt speech output, such as in the processing of commands to
change speaking volume or rate.</p>

<p>The recognizer can produce multiple outputs and associated
confidence scores.&#160; The NL interpreter can also produce
multiple interpretations.&#160; Interpreted NL output is
coordinated with other modes of input that may require
interpretation in the current NL context or may alter or augment
the interpretation of the NL input.&#160;&#160;&#160; It is the
responsibility of the multi-media integration module to produce
possibly multiple coordinated joint interpretations of the
multi-modal input and present these to the dialog manager.&#160;
Context information can also be shared with the dialog manager to
further refine the interpretation, including resolution of
anaphora and implied expressions.&#160; The dialog manager is
responsible for the final selection of best interpretation.</p>

<p>The dialog manager is also responsible for responding to the
input statement.&#160; This responsibility can include resolving
ambiguity, issuing instructions and/or queries to the task
manager, collecting output from the task manager, formation of a
natural language expression or visual presentation of the task
manager output, and coordination of recognizer context.</p>

<p>The task manager is primarily an Application Program Interface
(API), but can also include pragmatic and application specific
reasoning.&#160; The task manager can be an agent, or proxy, can
possess state, and can communicate with other agents or proxies
for services.&#160; The primary application interface for the
task manager is expected to be web servers, but can be other
API's as well.</p>

<p>Finally, the presentation manager, or output media "renderer,"
has responsibility for formatting multi-media output in a
coordinated manner.&#160; The presentation manager should be
aware of the client device capabilities.</p>
</body>
</html>