NOTE-emma-usecases-20091215 79.1 KB

Raw Blame History Permalink

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta name="generator" content=
"HTML Tidy for Linux/x86 (vers 1 September 2005), see www.w3.org" />
<title>Use Cases for Possible Future EMMA Features</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<style type="text/css">
/*<![CDATA[*/
code           { font-family: monospace; }

div.constraint,
div.issue,
div.note,
div.notice     { margin-left: 2em; }

ol.enumar      { list-style-type: decimal; }
ol.enumla      { list-style-type: lower-alpha; }
ol.enumlr      { list-style-type: lower-roman; }
ol.enumua      { list-style-type: upper-alpha; }
ol.enumur      { list-style-type: upper-roman; }


div.exampleInner pre { margin-left: 1em;
                       margin-top: 0em; margin-bottom: 0em}
div.exampleOuter {border: 4px double gray;
                  margin: 0em; padding: 0em}
div.exampleInner { background-color: #d5dee3;
                   border-top-width: 4px;
                   border-top-style: double;
                   border-top-color: #d3d3d3;
                   border-bottom-width: 4px;
                   border-bottom-style: double;
                   border-bottom-color: #d3d3d3;
                   padding: 4px; margin: 0em }
div.exampleWrapper { margin: 4px }
div.exampleHeader { font-weight: bold;
                    margin: 4px}

table {
        width:80%;
                border:1px solid #000;
                border-collapse:collapse;
                font-size:90%;
        }

td,th{
                border:1px solid #000;
                border-collapse:collapse;
                padding:5px;
        }


caption{
                background:#ccc;
                font-size:140%;
                border:1px solid #000;
                border-bottom:none;
                padding:5px;
                text-align:center;
        }

img.center {
  display: block;
  margin-left: auto;
  margin-right: auto;
}
p.caption {
  text-align: center
}


.RFC2119 {
  text-transform: lowercase;
  font-style: italic;
}
/*]]>*/
</style>

<style type="text/css">
/*<![CDATA[*/
 p.c1 {font-weight: bold}
/*]]>*/
</style>

<link href="http://www.w3.org/StyleSheets/TR/W3C-WG-NOTE.css" type="text/css" rel="stylesheet" />
<meta content="MSHTML 6.00.6000.16762" name="GENERATOR" />
<style type="text/css">
/*<![CDATA[*/
 ol.c2 {list-style-type: lower-alpha}
 li.c1 {list-style: none}
/*]]>*/
</style>
</head>
<body xml:lang="en" lang="en">
<div class="head"><a href="http://www.w3.org/"><img alt="W3C" src=
"http://www.w3.org/Icons/w3c_home" width="72" height="48" /></a>
<h1 id="title">Use Cases for Possible Future EMMA Features</h1>
<h2 id="w3c-doctype">W3C Working Group Note <i>15</i> <i>December</i> <i>2009</i></h2>

<dl>
<dt>This version:</dt>
<dd><a href="http://www.w3.org/TR/2009/NOTE-emma-usecases-20091215">http://www.w3.org/TR/2009/NOTE-emma-usecases-20091215</a></dd>
<dt>Latest version:</dt>
<dd><a href="http://www.w3.org/TR/emma-usecases">http://www.w3.org/TR/emma-usecases</a></dd>
<dt>Previous version:</dt>
<dd><em>This is the first publication.</em></dd>

<dt>Editor:</dt>
<dd>Michael Johnston, AT&amp;T</dd>
<dt>Authors:</dt>
<dd>Deborah A. Dahl, Invited Expert</dd>
<dd>Ingmar Kliche, Deutsche Telekom AG</dd>
<dd>Paolo Baggia, Loquendo</dd>
<dd>Daniel C. Burnett, Voxeo</dd>
<dd>Felix Burkhardt, Deutsche Telekom AG</dd>
<dd>Kazuyuki Ashimura, W3C</dd>
</dl>
<p class="copyright"><a href=
"http://www.w3.org/Consortium/Legal/ipr-notice#Copyright">Copyright</a>
© 2009 <a href="http://www.w3.org/"><acronym title=
"World Wide Web Consortium">W3C</acronym></a><sup>®</sup> (<a href=
"http://www.csail.mit.edu/"><acronym title=
"Massachusetts Institute of Technology">MIT</acronym></a>, <a href=
"http://www.ercim.org/"><acronym title=
"European Research Consortium for Informatics and Mathematics">ERCIM</acronym></a>,
<a href="http://www.keio.ac.jp/">Keio</a>), All Rights Reserved.
W3C <a href=
"http://www.w3.org/Consortium/Legal/ipr-notice#Legal_Disclaimer">liability</a>,
<a href=
"http://www.w3.org/Consortium/Legal/ipr-notice#W3C_Trademarks">trademark</a>
and <a href=
"http://www.w3.org/Consortium/Legal/copyright-documents">document
use</a> rules apply.</p>
</div>
<!-- end of head div -->
<hr title="Separator for header" />
<h2 id="abstract">Abstract</h2>
<p>The EMMA: Extensible MultiModal Annotation specification defines
an XML markup language for capturing and providing metadata on the
interpretation of inputs to multimodal systems. Throughout the
implementation report process and discussion since EMMA 1.0 became
a W3C Recommendation, a number of new possible use cases for the
EMMA language have emerged. These include the use of EMMA to
represent multimodal output, biometrics, emotion, sensor data,
multi-stage dialogs, and interactions with multiple users. In this
document, we describe these use cases and illustrate how the EMMA
language could be extended to support them.</p>

<h2 id="status">Status of this Document</h2>

<p><em>This section describes the status of this document at the
time of its publication. Other documents may supersede this
document. A list of current W3C publications and the latest
revision of this technical report can be found in the <a href=
"http://www.w3.org/TR/">W3C technical reports index</a> at
http://www.w3.org/TR/.</em></p>

<p>This document is a W3C Working Group Note published on 15 December
2009. This is the first publication of this document and it represents
the views of the W3C Multimodal Interaction Working Group at the time
of publication. The document may be updated as new technologies emerge
or mature. Publication as a Working Group Note does not imply
endorsement by the W3C Membership. This is a draft document and may be
updated, replaced or obsoleted by other documents at any time. It is
inappropriate to cite this document as other than work in
progress.</p>

<p>This document is one of a series produced by the
<a href="http://www.w3.org/2002/mmi/">Multimodal Interaction WorkingGroup</a>,
part of the <a href="http://www.w3.org/2002/mmi/Activity">W3C Multimodal Interaction
Activity</a>.

Since <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> became a W3C
Recommendation, a number of new possible use cases for the EMMA language have
emerged, e.g., the use of EMMA to represent multimodal output, biometrics,
emotion, sensor data, multi-stage dialogs and interactions with multiple users.

Therefore the Working Group have been working on a document capturing use cases
and issues for a series of possible extensions to EMMA.

The intention of publishing this Working Group Note is to seek feedback on the
various different use cases.
</p>

<p>Comments on this document can be sent to <a href=
"mailto:www-multimodal@w3.org">www-multimodal@w3.org</a>, the
public forum for discussion of the W3C's work on Multimodal
Interaction. To subscribe, send an email to <a href=
"mailto:www-multimodal-request@w3.org">www-multimodal-request@w3.org</a>
with the word subscribe in the subject line (include the word
unsubscribe if you want to unsubscribe). The <a href=
"http://lists.w3.org/Archives/Public/www-multimodal/">archive</a>
for the list is accessible online.</p>

<p> This document was produced by a group operating under the <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/">5 February 2004 W3C Patent Policy</a>. W3C maintains a <a rel="disclosure" href="http://www.w3.org/2004/01/pp-impl/34607/status">public list of any patent disclosures</a> made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/#def-essential">Essential Claim(s)</a> must disclose the information in accordance with <a href="http://www.w3.org/Consortium/Patent-Policy-20040205/#sec-Disclosure">section 6 of the W3C Patent Policy</a>. </p>

<h2 id="contents">Table of Contents</h2>
<ul>
<li>1. <a href="#s1">Introduction</a></li>
<li>2. <a href="#s2">EMMA use cases</a></li>
</ul>
<ul class="tocline">
<li>2.1 <a href="#s2.1">Incremental results for streaming
modalities such as haptics, ink, monologues, dictation</a></li>
<li>2.2 <a href="#s2.2">Representing biometric information</a></li>
<li>2.3 <a href="#s2.3">Representing emotion in EMMA</a></li>
<li>2.4 <a href="#s2.4">Richer semantic representations in
EMMA</a></li>
<li>2.5 <a href="#s2.5">Representing system output in EMMA</a></li>
<li class="c1">
<ul class="tocline">
<li>2.5.1 <a href="#s2.5.1">Abstracting output from specific
modalities</a></li>
<li>2.5.2 <a href="#s2.5.2">Coordination of outputs distributed
over multiple different modalities</a></li>
</ul>
</li>
<li>2.6 <a href="#s2.6">Representation of dialogs in EMMA</a></li>
<li>2.7 <a href="#s2.7">Logging, analysis, and annotation</a></li>
<li class="c1">
<ul class="tocline">
<li>2.7.1 <a href="#s2.7.1">Log analysis</a></li>
<li>2.7.2 <a href="#s2.7.2">Log annotation</a></li>
</ul>
</li>
<li>2.8 <a href="#s2.8">Multi-sentence inputs</a></li>
<li>2.9 <a href="#s2.9">Multi-participant interactions</a></li>
<li>2.10 <a href="#s2.10">Capturing sensor data such as GPS in
EMMA</a></li>
<li>2.11 <a href="#s2.11">Extending EMMA from NLU to also represent
search or database retrieval results</a></li>
<li>2.12 <a href="#s2.12">Supporting other semantic representation
forms in EMMA</a></li>
</ul>
<ul>
<li><a href="#references">General References</a></li>
</ul>
<hr title="Separator for introduction" />
<h2 id="s1">1. Introduction</h2>
<p>This document presents a set of use cases for possible new
features of the Extensible MultiModal Annotation (EMMA) markup
language. <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> was
designed primarily to be used as a data interchange format by
systems that provide semantic interpretations for a variety of
inputs, including but not necessarily limited to, speech, natural
language text, GUI and ink input. EMMA 1.0 provides a set of
elements for containing the various stages of processing of a
user's input and a set of elements and attributes for specifying
various kinds of metadata such as confidence scores and timestamps.
<a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> became a W3C
Recommendation on February 10, 2009.</p>
<p>A number of possible extensions to <a href=
"http://www.w3.org/TR/emma/">EMMA 1.0</a> have been identified
through discussions with other standards organizations,
implementers of EMMA, and internal discussions within the W3C
Multimodal Interaction Working Group. This document focusses on the
following use cases:</p>
<ol>
<li>Representing incremental results for streaming modalities such
as haptics, ink, monologues, dictation, where it is desirable to
have partial results available before the full input finishes.</li>
<li>Representing biometric results such as the results of speaker
verification or speaker identification (briefly covered in EMMA
1.0).</li>
<li>Representing emotion, for example, as conveyed by intonation
patterns, facial expression, or lexical choice.</li>
<li>Richer semantic representations, for example, integrating EMMA
application semantics with ontologies.</li>
<li>Representing system output in addition to user input, including
topics such as:</li>
<li class="c1">
<ol class="c2">
<li>Isolating presentation logic from dialog/interaction
management.</li>
<li>Coordination of outputs distributed over multiple different
modalities.</li>
</ol>
</li>
<li>Support for archival functions such as logging, human
annotation of inputs, and data analysis.</li>
<li>Representing full dialogs and multi-sentence inputs in addition
to single inputs.</li>
<li>Representing multi-participant interactions.</li>
<li>Representing sensor data such as GPS input.</li>
<li>Representing the results of database queries or search.</li>
<li>Support for forms of representation of application semantics
other than XML, such as JSON.</li>
</ol>
<p>It may be possible to achieve support for some of these features
without modifying the language, through the use of the
extensibility mechanisms of <a href=
"http://www.w3.org/TR/emma/">EMMA 1.0</a>, such as the
<code>&lt;emma:info&gt;</code> element and application-specific
semantics; however, this would significantly reduce
interoperability among EMMA implementations. If features are of
general value then it would be beneficial to define standard ways
of implementing them within the EMMA language. Additionally,
extensions may be needed to support additional new kinds of input
modalities such as multi-touch and accelerometer input.</p>
<p>The W3C Membership and other interested parties are invited to
review this document and send comments to the Working Group's
public mailing list www-multimodal@w3.org <a href=
"http://lists.w3.org/Archives/Public/www-multimodal/">(archive)</a>
.</p>
<h2 id="s2">2. EMMA use cases</h2>
<h3 id="s2.1">2.1 Incremental results for streaming modalities such
as haptic, ink, monologues, dictation</h3>
<p>In EMMA 1.0, EMMA documents were assumed to be created for
completed inputs within a given modality. However, there are
important use cases where it would be beneficial to represent some
level of interpretation of partial results before the input is
complete. For example, in a dictation application, where inputs can
be lengthy it is often desirable to show partial results to give
feedback to the user while they are speaking. In this case, each
new word is appended to the previous sequence of words. Another use
case would be incremental ASR, either for dictation or dialog
applications, where previous results might be replaced as more
evidence is collected. As more words are recognized and provide
more context, earlier word hypotheses may be updated. In this
scenario it may be necessary to replace the previous hypothesis
with a revised one.</p>
<p>In this section, we discuss how the EMMA standard could be
extended to support incremental or streaming results in the
processing of a single input. Some key considerations and areas for
discussion are:</p>
<ol>
<li>Do we need an identifier for a particular stream? Or is
<code>emma:source</code> sufficient? Subsequent messages (carrying
information for a particular stream) may need to have the same
identifier.</li>
<li>Do we need a sequence number to indicate order? Or are
timestamps sufficient (though optional)?</li>
<li>Do we need to mark "begin", "in progress" and "end" of a
stream? There are streams with a particular start and end, like a
dictation. Note that sensors may never explicitly end a
stream.</li>
<li>Do we always append information? Or do we also replace previous
data? A dictation application will probably append new text. But do
we consider sensor data (such as GPS position or device tilt) as
streaming or as "final" data?</li>
</ol>
<p>In the example below for dictation, we show how three new
attributes <code>emma:streamId</code>,
<code>emma:streamSeqNr</code>, and <code>emma:streamProgress</code>
could be used to annotate each result with metadata regarding its
position and status within a stream of input. In this example, the
<code>emma:streamId</code> is a identifier which can be used to
show that different <code>emma:interpretation</code> elements are
members of the same stream. The <code>emma:streamSeqNr</code>
attribute provides a numerical order to elements in the stream
while <code>emma:streamProgress</code> indicates the start of the
stream (and whether to expect more interpretations within the same
stream), and the end of the stream. This is an instance of the
'append' scenario for partial results in EMMA.</p>
<table width="120">
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">User</td>
<td>Hi Joe the meeting has moved</td>
<td>
<pre>
&lt;emma:emma &gt;
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  &lt;emma:interpretation id="int1"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:function="transcription"
    emma:confidence="0.75"
    emma:tokens="Hi Joe the meeting has moved"
    emma:streamId="id1"
    emma:streamSeqNr="0"
    emma:streamProgress="begin"&gt;
      &lt;emma:literal&gt;
      Hi Joe the meeting has moved
      &lt;/emma:literal&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;

</pre></td>
</tr>
<tr>
<td width="50">User</td>
<td>to friday at four</td>
<td>
<pre>
&lt;emma:emma &gt;
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  &lt;emma:interpretation id="int2"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:function="transcription"
    emma:confidence="0.75"
    emma:tokens="to friday at four"
    emma:streamId="id1"
    emma:streamSeqNr="1"
    emma:streamProgress="end"&gt;
      &lt;emma:literal&gt;
      to friday at four
      &lt;/emma:literal&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;

</pre></td>
</tr>
</table>
<p>In the example below, a speech recognition hypothesis for the
whole string is updated once more words have been recognized. This
is an instance of the 'replace' scenario for partial results in
EMMA. Note that the <code>emma:streamSeqNr</code> is the same for
each interpretation in this case.</p>
<table width="120">
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">User</td>
<td>Is there a Pisa</td>
<td>
<pre>
&lt;emma:emma &gt;
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  &lt;emma:interpretation id="int1"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:function="dialog"
    emma:confidence="0.7"
    emma:tokens="is there a pisa"
    emma:streamId="id2"
    emma:streamSeqNr="0"
    emma:streamProgress="begin"&gt;
      &lt;emma:literal&gt;
      is there a pisa
      &lt;/emma:literal&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;

</pre></td>
</tr>
<tr>
<td width="50">User</td>
<td>Is there a pizza restaurant</td>
<td>
<pre>
&lt;emma:emma &gt;
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  &lt;emma:interpretation id="int2"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:function="dialog"
    emma:confidence="0.9"
    emma:tokens="is there a pizza restaurant"
    emma:streamId="id2"
    emma:streamSeqNr="0"
    emma:streamProgress="end"&gt;
      &lt;emma:literal&gt;
      is there a pizza restaurant
      &lt;/emma:literal&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;

</pre></td>
</tr>
</table>
<p>One issue for the 'replace' case of incremental results, is how
to specify that a result replaces multiple of the previously
received results. For example, a system could receive partial
results consisting of each word in turn of an utterance, and then a
final result which is the final recognition for the whole sequence
of words. One approach to this problem would be to allow
<code>emma:streamSeqNr</code> to specify a range of inputs to be
replaced. For example, if the <code>emma:streamSeqNr</code> for
each of three single word results was 1, 2, and then 3. A final
revised result could be marked as
<code>emma:streamSeqNr="1-3"</code> indicating that it is a revised
result for those three words.</p>
<p>One issue is whether timestamps might be used to track ordering
instead of introducing new attributes. One problem is that
timestamp attributes are not required and may not always be
available. Also as shown in the example, chunks of input in a
stream may not always be in sequential order. Even with timestamps
providing an order some kind of 'begin' and 'end' flag is needed
(like <code>emma:streamProgress)</code> to indicate indicate the
beginning and end of transmission of streamed input. Moreover,
timestamps do not provide sufficient information to detect whether
a message has been lost.</p>
<p>Another possibility to explore for representation of incremental
results would be to use an <code>&lt;emma:sequence&gt;</code>
element containing the interim results and a derived result which
contains the combination.</p>
<p>Another issue to explore is the relationship between incremental
results and the MMI lifecyle events within the <a href=
"http://www.w3.org/TR/mmi-arch/">MMI Architecture</a>.</p>
<h3 id="s2.2">2.2 Representing biometric information</h3>
<p>Biometric technologies include systems designed to identify
someone or verify a claim of identity based on their physical or
behavioral characteristics. These include speaker verification,
speaker identification, face recognition, and iris recognition,
among others. <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a>
provided some capability for representing the results of biometric
analysis through values of the <code>emma:function</code> attribute
such as "verification". However, it did not discuss the specifics
of this use case in any detail. It may be worth exploring further
considerations and consequences of using EMMA to represent
biometric results. As one example, if different biometric results
are represented in EMMA, this would simplify the process of fusing
the outputs of multiple biometric technologies to obtain a more
reliable overall result. &nbsp;It should also make it easier to
take into account non-biometric claims of identity, such as a
statement like "this is Kazuyuki", represented in EMMA, along with
a speaker verification result based on the speaker's voice, which
would also be represented in EMMA. In the following example, we
have extended the set of values for <code>emma:function</code> to
include "identification" for an interpretation showing the results
of a biometric component that picks out an individual from a set of
possible individuals (who are they). This contrasts with
"verification" which is used for verification of a particular user
(are they who they say they are).</p>
<h4 id="biometric_example">Example</h4>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td>an image of a face</td>
<td>
<pre>
&lt;emma:emma&gt;
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  &lt;emma:interpretation id=“int1"
    emma:confidence="0.75”
    emma:medium="visual"
    emma:mode="photograph"
    emma:verbal="false"
    emma:function="identification"&gt;
      &lt;person&gt;12345&lt;/person&gt;
      &lt;name&gt;Mary Smith&lt;/name&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;

</pre></td>
</tr>
</tbody>
</table>
<p>One direction to explore further is the relationship between
work on messaging protocols for biometrics within the OASIS
Biometric Identity Assurance Services (<a href="http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=bias">BIAS</a>)
standards committee and EMMA.</p>
<h3 id="s2.3">2.3 Representing emotion in EMMA</h3>
<p>In addition to speech recognition, and other tasks such as
speaker verification and identification, another kind of
interpretation of speech that is of increasing importance is
determination of the emotional state of the speaker, based on, for
example, their prosody, lexical choice, or other features. This
information can be used, for example, to make the dialog logic of
an interactive system sensitive to the user's emotional state.
Emotion detection can also use other modalities such as vision
(facial expression, posture) and physiological sensors such as skin
conductance measurement or blood pressure. Multimodal approaches
where evidence is combined from multiple different modalities are
also of significance for emotion classification.</p>
<p>The creation of a markup language for emotion has been a recent
focus of attention in W3C. Work that initiated in the W3C Emotion
Markup Language Incubator Group (<a href=
"http://www.w3.org/2005/Incubator/emotion/XGR-emotionml-20081120/">EmotionML
XG</a>), has now transitioned to the <a href=
"http://www.w3.org/2002/mmi/">W3C Multimodal Working Group</a> and
the <a href="http://www.w3.org/TR/emotionml">EmotionML</a> language
has been published as a working draft. One of the major use cases
for that effort is: "Automatic recognition of emotions from
sensors, including physiological sensors, speech recordings, facial
expressions, etc., as well as from multi-modal combinations of
sensors."</p>
<p>Given the similarities to the technologies and annotations used
for other kinds of input processing (recognition, semantic
classification) which are now captured in EMMA, it makes sense to
explore the use of EMMA for capture of emotional classification of
inputs. Just as EMMA does not standardize the application markup
for semantic results, though, it does not make sense to try and
standardize emotion markup within EMMA. One promising approach is
to combine the containers and metadata annotation of EMMA with the
<a href="http://www.w3.org/TR/emotionml">EmotionML</a> markup, as
shown in the following example.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td width="50">expression of boredom</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns:emo="http://www.w3.org/2009/10/emotionml"&gt;
  &lt;emma:interpretation id="emo1"
    emma:start="1241035886246"
    emma:end="1241035888246"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:verbal="false"
    emma:signal="http://example.com/input345.amr"
    emma:media-type="audio/amr; rate:8000;"
    emma:process="engine:type=emo_class&amp;vn=1.2”&gt;
      &lt;emo:emotion&gt;
        &lt;emo:intensity
          value="0.1"
          confidence="0.8"/&gt;
        &lt;emo:category
          set="everydayEmotions"
          name="boredom"
          confidence="0.1"/&gt;
      &lt;/emo:emotion&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>In this example, we use the capabilities of EMMA for describing
the input signal, its temporal characteristics, modality, sampling
rate, audio codec etc. and EmotionML is used to provide the
specific representation of the emotion. Other EMMA container
elements also have strong use cases for emotion recognition. For
example, <code>&lt;emma:one-of&gt;</code> can be used to represent
N-best lists of competing classifications of emotion. The
<code>&lt;emma:group&gt;</code> element could be used to combine a
semantic interpretation of a user input with an emotional
classification, as illustrated in the following example. Note that
all of the general properties of the signal can be specified on the
<code>&lt;emma:group&gt;</code> element.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td width="50">spoken input "flights to boston tomorrow" to dialog
system in angry voice</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns:emo="http://www.w3.org/2009/10/emotionml"&gt;
  &lt;emma:group id="result1"
    emma:start="1241035886246"
    emma:end="1241035888246"
    emma:medium="acoustic"
    emma:mode="voice"
    emma:verbal="false"
    emma:signal="http://example.com/input345.amr"
    emma:media-type="audio/amr; rate:8000;"&gt;
    &lt;emma:interpretation id="asr1"
      emma:tokens="flights to boston tomorrow"
      emma:confidence="0.76"
      emma:process="engine:type=asr_nl&amp;vn=5.2”&gt;
        &lt;flight&gt;
          &lt;dest&gt;boston&lt;/dest&gt;
          &lt;date&gt;tomorrow&lt;/date&gt;
        &lt;/flight&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation id="emo1"
      emma:process="engine:type=emo_class&amp;vn=1.2”&gt;
      &lt;emo:emotion&gt;
        &lt;emo:intensity
          value="0.3"
          confidence="0.8"/&gt;
        &lt;emo:category
          set="everydayEmotions"
          name="anger"
          confidence="0.8"/&gt;
      &lt;/emo:emotion&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:group-info&gt;
    meaning_and_emotion
    &lt;/emma:group-info&gt;
  &lt;/emma:group&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>The element <code>&lt;emma:group&gt;</code> can also be used to
capture groups of emotion detection results from individual
modalities for combination by a multimodal fusion component or when
automatic recognition results are described together with manually
annotated data. This use case is inspired by <a href=
"http://www.w3.org/2005/Incubator/emotion/XGR-emotion/#AppendixUseCases">
Use case 2b (II)</a> of the Emotion Incubator Group Report. The
following example illustrates the grouping of three
interpretations, namely: a speech analysis emotion classifier, a
physiological emotion classifier measuring blood pressure, and a
human annotator viewing video, for two different media files (from
the same episode) that are synchronized via <code>emma:start</code>
and <code>emma:end</code> attributes. In this case, the
physiological reading is for a subinterval of the video and audio
recording.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td width="50">audio, video, and physiological sensor of a test
user acting with a new design.</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns:emo="http://www.w3.org/2009/10/emotionml"&gt;
  &lt;emma:group id="result1"&gt;
    &lt;emma:interpretation id="speechClassification1"
      emma:medium="acoustic"
      emma:mode="voice"
      emma:verbal="false"
      emma:start="1241035884246"
      emma:end="1241035887246"
      emma:signal="http://example.com/video_345.mov"
      emma:process="engine:type=emo_voice_classifier”&gt;
        &lt;emo:emotion&gt;
          &lt;emo:category
            set="everydayEmotions"
            name="anger"
            confidence="0.8"/&gt;
        &lt;/emo:emotion&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation id="bloodPressure1"
      emma:medium="tactile"
      emma:mode="blood_pressure"
      emma:verbal="false"
      emma:start="1241035885300"
      emma:end="1241035886900"
      emma:signal="http://example.com/bp_signal_345.cvs"
      emma:process="engine:type=emo_physiological_classifier”&gt;
        &lt;emo:emotion&gt;
          &lt;emo:category
            set="everydayEmotions"
            name="anger"
            confidence="0.6"/&gt;
        &lt;/emo:emotion&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation id="humanAnnotation1"
      emma:medium="visual"
      emma:mode="video"
      emma:verbal="false"
      emma:start="1241035884246"
      emma:end="1241035887246"
      emma:signal="http://example.com/video_345.mov"
      emma:process="human:type=labeler&amp;id=1”&gt;
        &lt;emo:emotion&gt;
          &lt;emo:category
            set="everydayEmotions"
            name="fear"
            confidence="0.6"/&gt;
        &lt;/emo:emotion&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:group-info&gt;
    several_emotion_interpretations
    &lt;/emma:group-info&gt;
  &lt;/emma:group&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>A combination of <code>&lt;emma:group&gt;</code> and
<code>&lt;emma:derivation&gt;</code> could be used to represent a
combined emotional analysis resulting from analysis of multiple
different modalities of the user's behavior. The
<code>&lt;emma:derived-from&gt;</code> and
<code>&lt;emma:derivation&gt;</code> elements can be used to
capture both the fused result and combining inputs in a single EMMA
document. In the following example, visual analysis of user
activity and analysis of their speech have been combined by a
multimodal fusion component to provide an combined multimodal
classification of the user's emotional state. The specifics of the
multimodal fusion algorithm are not relevant here, or to EMMA in
general. Note though that in this case, the multimodal fusion
appears to have compensated for uncertainty in the visual analysis
which gave two results with equal confidence, one for fear and one
for anger. The <code>emma:one-of</code> element is used to capture
the N-best list of multiple competing results from the video
classifier.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td width="50">multimodal fusion of emotion classification of user
based on analysis of voice and video</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns:emo="http://www.w3.org/2009/10/emotionml"&gt;
  &lt;emma:interpretation id="multimodalClassification1"
    emma:medium="acoustic,visual"
    emma:mode="voice,video"
    emma:verbal="false"
    emma:start="1241035884246"
    emma:end="1241035887246"
    emma:process="engine:type=multimodal_fusion”&gt;
      &lt;emo:emotion&gt;
        &lt;emo:category
          set="everydayEmotions"
          name="anger"
          confidence="0.7"/&gt;
      &lt;/emo:emotion&gt;
    &lt;emma:derived-from ref="mmgroup1" composite="true"/&gt;
  &lt;/emma:interpretation&gt;
  &lt;emma:derivation&gt;
    &lt;emma:group id="mmgroup1"&gt;
      &lt;emma:interpretation id="speechClassification1"
        emma:medium="acoustic"
        emma:mode="voice"
        emma:verbal="false"
        emma:start="1241035884246"
        emma:end="1241035887246"
        emma:signal="http://example.com/video_345.mov"
        emma:process="engine:type=emo_voice_classifier”&gt;
          &lt;emo:emotion&gt;
            &lt;emo:category
              set="everydayEmotions"
              name="anger"
              confidence="0.8"/&gt;
          &lt;/emo:emotion&gt;
      &lt;/emma:interpretation&gt;
      &lt;emma:one-of id="video_nbest"
        emma:medium="visual"
        emma:mode="video"
        emma:verbal="false"
        emma:start="1241035884246"
        emma:end="1241035887246"
        emma:signal="http://example.com/video_345.mov"
        emma:process="engine:type=video_classifier"&gt;
        &lt;emma:interpretation id="video_result1"
          &lt;emo:emotion&gt;
            &lt;emo:category
              set="everydayEmotions"
              name="anger"
              confidence="0.5"/&gt;
          &lt;/emo:emotion&gt;
        &lt;/emma:interpretation&gt;
        &lt;emma:interpretation id="video_result2"
          &lt;emo:emotion&gt;
            &lt;emo:category
              set="everydayEmotions"
              name="fear"
              confidence="0.5"/&gt;
          &lt;/emo:emotion&gt;
        &lt;/emma:interpretation&gt;
      &lt;/emma:one-of&gt;
      &lt;emma:group-info&gt;
      emotion_interpretations
      &lt;/emma:group-info&gt;
    &lt;/emma:group&gt;
  &lt;/emma:derivation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>One issue which need to be addressed is the relationship between
EmotionML <code>confidence</code> attribute values and
<code>emma:confidence</code> values. Could the
<code>emma:confidence</code> value be used as an overall confidence
value for the emotion result, or should confidence values appear
only within the EmotionML markup since confidence is used for
different dimensions of the result? If a series of possible emotion
classifications are contained in <code>emma:one-of</code> should
they be ordered by the EmotionML confidence values?</p>
<h3 id="s2.4">2.4 Richer semantic representations in EMMA</h3>
<p>Enriching the semantic information represented in EMMA would be
helpful for certain use cases. For example, the concepts in an EMMA
application semantics representation might include references to
concepts in an ontology such as WordNet. Then, a translation system
might make use of a sense disambiguator to represent the
probabilities of different senses of a word, for example, "spicy"
in the example has two possible WordNet senses. In the following
example, inputs to a machine translation system are annotated in
the application semantics with specific WordNet senses which are
used to distinguish among different senses of the words. A
translation system might make use of a sense disambiguator to
represent the probabilities of different senses of a word, for
example, "spicy" in the example has two possible WordNet
senses.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td>I love to eat Mexican food because it is spicy</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"
  xmlns="http://example.com/universal_translator"&gt;
  &lt;emma:interpretation id="spanish"&gt;
    &lt;result xml:lang="es"&gt;
    Adoro alimento mejicano porque es picante.
    &lt;/result&gt;
    &lt;emma:derived-from resource="#english" composite="false"/&gt;
  &lt;/emma:interpretation&gt;
  &lt;emma:derivation&gt;
    &lt;emma:interpretation id="english"
      emma:tokens="I love to eat Mexican food
                   because it is spicy"&gt;
      &lt;assertion&gt;
        &lt;interaction
          wordnet="1828736"
          wordnet-desc="love, enjoy (get pleasure from)"
          token="love"&gt;
          &lt;experiencer
            reference="first"
            token="I"&gt;
                &lt;attribute quantity="single"/&gt;
          &lt;/experiencer&gt;
          &lt;attribute time="present"/&gt;
          &lt;content&gt;
            &lt;interaction wordnet="1157345"
              wordnet-desc="eat (take in solid food)"
              token="to eat"&gt;
              &lt;object id="obj1"
                wordnet="7555863"
                wordnet-desc="food, solid food (any solid
                              substance (as opposed to
                              liquid) that is used as a source
                              of nourishment)"
                        token="food"&gt;
                &lt;restriction
                  wordnet="3026902"
                  wordnet-desc="Mexican (of or relating
                                to Mexico or its inhabitants)"
                                token="Mexican"/&gt;
              &lt;/object&gt;
            &lt;/interaction&gt;
          &lt;/content&gt;
          &lt;reason token="because"&gt;
            &lt;experiencer reference="third"
              target="obj1" token="it"/&gt;
                &lt;attribute time="present"/&gt;
                &lt;one-of token="spicy"&gt;
                  &lt;modification wordnet="2397732"
                    wordnet-desc="hot, spicy (producing a
                                  burning sensation on
                                  the taste nerves)"
                    confidence="0.8"/&gt;
                  &lt;modification wordnet="2398378"
                    wordnet-desc="piquant, savory,
                                  savoury, spicy, zesty
                                  (having an agreeably
                                  pungent taste)"
                    confidence="0.4"/&gt;
                &lt;/one-of&gt;
           &lt;/reason&gt;
         &lt;/interaction&gt;
       &lt;/assertion&gt;
     &lt;/emma:interpretation&gt;
  &lt;/emma:derivation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>In addition to sense disambiguation it could also be useful to
relate concepts to superordinate concepts in some ontology. For
example, it could be useful to know that O'Hare is an airport and
Chicago is a city, even though they might be used interchangeably
in an application. For example, in an air travel application a user
might say "I want to fly to O'Hare" or "I want to fly to
Chicago".</p>
<h3 id="s2.5">2.5 Representing system output in EMMA</h3>
<p><a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> was explicitly
limited in scope to representation of the interpretation of user
inputs. Most interactive systems also produce system output and one
of the major possible extensions of the EMMA language would be to
provide support for representation of the outputs made by the
system in addition to the user inputs. One advantage of having EMMA
representation for system output is that system logs can have
unified markup representation across input and output for viewing
and analyzing user/system interactions. In this section, we
consider two different use cases for addition of output
representation to EMMA.</p>
<h4 id="s2.5.1">2.5.1 Abstracting output from specific modality or
output language</h4>
<p>It is desirable for a multimodal dialog designer to be able to
isolate dialog flow (for example <a href=
"http://www.w3.org/TR/2009/WD-scxml-20091029/">SCXML</a> code) from
the details of specific utterances produced by a system. This can
achieved by using presentation or media planning component that
takes the abstract intent from the system and creates one or more
modality-specific presentations. In addition to isolating dialog
logic from specific modality choice this can also make it easier to
support different technologies for the same modality. For example,
in the example below, the GUI technology is HTML, but abstracting
output would also support using a different GUI technology like
Flash, or <a href="http://www.w3.org/Graphics/SVG/">SVG</a>. If
EMMA is extended to support output, then EMMA documents could be
used for communication from the dialog manager to the presentation
planning component, and also potentially for the documents
generated by the presentation component, which could embed specific
markup such as HTML and <a href=
"http://www.w3.org/TR/speech-synthesis/">SSML</a>. Just as there
can be multiple different stages of processing of a user input,
there may be multiple stages of processing of an output, and the
mechanisms of EMMA can be used to capture and provide metadata on
these various stages of output processing.</p>
<p>Potential benefits for this approach include:</p>
<ol>
<li>Accessibility: it would be useful for an application to be able
to accommodate users who might have an assistive device or devices
without requiring special logic or even special applications.</li>
<li>Device independence: An application could separate the flow in
the IM from the details of the presentation. This might be
especially useful if there are a lot of target devices with
different types of screens, cameras, or possibilities for haptic
output.</li>
<li>Adapting to user preferences: An application could accommodate
different dynamic preferences, for example, switching to visual
presentation from speech in public places without disturbing the
application flow.</li>
</ol>
<p>In the following example, we consider the introduction of a new
EMMA element, <code>&lt;emma:presentation&gt;</code> which is the
output equivalent of the input element
<code>&lt;emma:interpretation&gt;</code>. Like
<code>&lt;emma:interpretation&gt;</code> this element can take
<code>emma:medium</code> and <code>emma:mode</code> attributes
classifying the specific modality. It could also potentially take
timestamp annotations indicating the time at which the output
should be produced. One issue is whether timestamps should be used
for the intended time of production or for the actual time of
production and how to capture both. Relative timestamps could be
used to anchor the planned time of presentation to another element
of system output. In this example we show how the
<code>emma:semantic-rep</code> attribute proposed in <a href=
"#s2.12">Section 2.12</a> could potentially be used to indicate the
markup language of the output.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Output</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">IM (step 1)</td>
<td>semantics of "what would you like for lunch?"</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:presentation&gt;
    &lt;question&gt;
      &lt;topic&gt;lunch&lt;/topic&gt;
      &lt;experiencer&gt;second person&lt;/experiencer&gt;
      &lt;object&gt;questioned&lt;/object&gt;
    &lt;/question&gt;
  &lt;/emma:presentation&gt;
&lt;/emma:emma&gt;

</pre>
<p>or, more simply, without natural language generation:</p>
<pre>
&lt;emma:emma&gt;
  &lt;emma:presentation&gt;
    &lt;text&gt;what would you like for lunch?&lt;/text&gt;
  &lt;/emma:presentation&gt;
&lt;/emma:emma&gt;

</pre></td>
</tr>
<tr>
<td width="50">presentation manager (voice output)</td>
<td>text "what would you like for lunch?"</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:presentation
    emma:medium="acoustic"
    emma:mode="voice"
    emma:verbal="true"
    emma:function="dialog"
    emma:semantic-rep="ssml"&gt;
      &lt;speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
        http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
        xml:lang="en-US"&gt;
          what would you like for lunch&lt;/speak&gt;
  &lt;/emma:presentation&gt;
&lt;/emma:emma&gt;

</pre></td>
</tr>
<tr>
<td width="50">presentation manager (GUI output)</td>
<td>text "what would you like for lunch?"</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:presentation
    emma:medium="visual"
    emma:mode="graphics"
    emma:verbal="true"
    emma:function="dialog"
    emma:semantic-rep="html"&gt;
      &lt;html&gt;
        &lt;body&gt;
          &lt;p&gt;what would you like for lunch?"&lt;/p&gt;
          &lt;input name="" type="text"&gt;
          &lt;input type="submit" name="Submit"
           value="Submit"&gt;
        &lt;/body&gt;
      &lt;/html&gt;
  &lt;/emma:presentation&gt;
&lt;/emma:emma&gt;

</pre></td>
</tr>
</tbody>
</table>
<h4 id="s2.5.2">2.5.2 Coordination of outputs distributed over
multiple different modalities</h4>
<p>A critical issue in the enablement of effective multimodal
output is to enable synchronization of outputs in different output
media. For example, text to speech output or prompts may be
coordinated with graphical outputs such as highlighting of items in
an HTML table. EMMA markup could potentially be used to indicate
that elements in each medium should be coordinated in their
presentation. In the following example, a new attribute
<code>emma:sync</code> is used to indicate the relationship between
a <code>&lt;mark&gt;</code> in <a href=
"http://www.w3.org/TR/speech-synthesis/">SSML</a> and an element to
be highlighted in HTML content. The <code>emma:process</code>
attribute could be used to identify the presentation planning
component. Again <code>emma:semantic-rep</code> is used to indicate
the embedded markup language.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Output</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">system</td>
<td width="50">Coordinated presentation of table with TTS</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:group id=“gp1"
    emma:medium="acoustic,visual"
    emma:mode="voice,graphics"
    emma:process="http://example.com/presentation_planner"&gt;
    &lt;emma:presentation id=“pres1"
      emma:medium="acoustic"
      emma:mode="voice"
      emma:verbal="true"
      emma:function="dialog"
      emma:semantic-rep="ssml"&gt;
      &lt;speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
        http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
        xml:lang="en-US"&gt;
        Item 4 &lt;mark emma:sync="123"/&gt; costs fifteen dollars.
      &lt;/speak&gt;
    &lt;/emma:presentation&gt;
    &lt;emma:presentation id=“pres2"
      emma:medium="visual"
      emma:mode="graphics"
      emma:verbal="true"
      emma:function="dialog"
      emma:semantic-rep="html"
      &lt;table xmlns="http://www.w3.org/1999/xhtml"&gt;
        &lt;tr&gt;
          &lt;td emma:sync="123"&gt;Item 4&lt;/td&gt;
          &lt;td&gt;15 dollars&lt;/td&gt;
        &lt;/tr&gt;
      &lt;/table&gt;
    &lt;/emma:presentation&gt;
  &lt;/emma:group&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>One issue to be considered is the potential role of the
Synchronized Multimedia Integration Language (<a href=
"http://www.w3.org/TR/REC-smil/">SMIL</a>) for capturing multimodal
output synchronization. SMIL markup for multimedia presentation
could potentially be embedded within EMMA markup coming from an
interaction manager to a client for rendering.</p>
<h3 id="s2.6">2.6 Representation of dialogs in EMMA</h3>
<p>The scope of <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a>
was explicitly limited to representation of single turns of user
input. For logging, analysis, and training purposes it could be
useful to be able to represent multi-stage dialogs in EMMA. The
following example shows a sequence of two EMMA documents where the
the first is a request from the system and the second is the user
response. A new attribute <code>emma:in-response-to</code> is used
to relate the system output to the user input. EMMA already has an
attribute <code>emma:dialog-turn</code> used to provide an
indicator of the turn of interaction.</p>
<h4 id="dialog_example">Example</h4>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">system</td>
<td width="50">where would you like to go?</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:presentation id="pres1"
    emma:dialog-turn="turn1"
    emma:in-response-to="initial"&gt;
      &lt;prompt&gt;
      where would you like to go?
      &lt;/prompt&gt;
  &lt;/emma:presentation&gt;
&lt;/emma:emma&gt;

</pre></td>
</tr>
<tr>
<td width="50">user</td>
<td>New York</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:interpretation id="int1"
    emma:dialog-turn="turn2"
    emma:tokens="new york"
    emma:in-response-to="pres1"&gt;
      &lt;location&gt;
      New York
      &lt;/location&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>In this case, each utterance is still a single EMMA document,
and markup is being used to encode the fact that the utterance are
part of an ongoing dialog. Another possibility would be to use EMMA
markup to contain a whole dialog within a single EMMA document. For
example, a flight query dialog could be represented as follows
using <code>&lt;emma:sequence&gt;</code>:</p>
<h4 id="sequence_example">Example</h4>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td>flights to boston</td>
<td rowspan="5">
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:sequence&gt;
    &lt;emma:interpretation id="user1"
      emma:dialog-turn="turn1"
      emma:in-response-to="initial"&gt;
      &lt;emma:literal&gt;
      flights to boston
      &lt;/emma:literal&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:presentation id="sys1"
      emma:dialog-turn="turn2"
      emma:in-response-to="user1"&gt;
      &lt;prompt&gt;
      traveling to boston,
      which departure city
      &lt;/prompt&gt;
    &lt;/emma:presentation&gt;
    &lt;emma:interpretation id="user2"
      emma:dialog-turn="turn3"
      emma:in-response-to="sys1"&gt;
      &lt;emma:literal&gt;
      san francisco
      &lt;/emma:literal&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:presentation id="sys2"
      emma:dialog-turn="turn4"
      emma:in-response-to="user2"&gt;
      &lt;prompt&gt;
      departure date
      &lt;/prompt&gt;
    &lt;/emma:presentation&gt;
    &lt;emma:interpretation id="user3"
      emma:dialog-turn="turn5"
      emma:in-response-to="sys2"&gt;
      &lt;emma:literal&gt;
      next thursday
      &lt;/emma:literal&gt;
    &lt;/emma:interpretation&gt;
  &lt;/emma:sequence&gt;
&lt;/emma:emma&gt;

</pre></td>
</tr>
<tr>
<td width="50">system</td>
<td>traveling to Boston, which departure city?</td>
</tr>
<tr>
<td width="50">user</td>
<td>San Francisco</td>
</tr>
<tr>
<td width="50">system</td>
<td>departure date</td>
</tr>
<tr>
<td width="50">user</td>
<td>next thursday</td>
</tr>
</tbody>
</table>
<p>Note that in this example with
<code>&lt;emma:sequence&gt;</code> the
<code>emma:in-response-to</code> attribute is still important since
there is no guarantee that an utterance in a dialog is a response
to the previous utterance. For example, a sequence of utterances
may all be from the user.</p>
<p>One issue that arises with the representation of whole dialogs
is that the resulting EMMA documents with full sets of metadata may
become quite large. One possible extension that could help with
this would be allow the value of <code>emma:in-response-to</code>
to be URI valued so it can refer to another EMMA document.</p>
<h3 id="s2.7">2.7 Logging, analysis, and annotation</h3>
<p>EMMA was initially designed to facilitate communication among
components of an interactive system. It has become clear over time
that the language can also play an important role in logging of
user/system interactions. In this section, we consider possible
advantages of EMMA for log analysis and illustrate how elements
such as <code>&lt;emma:derived-from&gt;</code> could be used to
capture and provide metadata on annotations made by human
annotators.</p>
<h3 id="s2.7.1">2.7.1 Log analysis</h3>
<p>The proposal above for representing system output in EMMA would
support after the fact analysis of dialogs. For example, if both
the system's and the user's utterance are represented in EMMA, it
should be much easier to examine relationships between factors such
as how the wording of prompts might affect user's responses or even
the modality that users select for their responses. It would also
be easier to study timing relationships between the system prompt
and the user's responses. For example, prompts that are confusing
might consistently elicit longer times before the user starts
speaking. This would be useful even without a presentation manager
or fission component. In the following example, it might be useful
to look into the relationship between the end of the prompt and the
start of the user's response. We use here the
<code>emma:in-response-to</code> attribute suggested in <a href=
"#s2.6">Section 2.6</a> for the representation of dialogs in
EMMA.</p>
<h4 id="log_example">Example</h4>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">system</td>
<td>where would you like to go?</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:presentation id="pres1"
    emma:dialog-turn="turn1"
    emma:in-response-to="initial"
    emma:start="1241035886246"
    emma:end="1241035888306"&gt;
    &lt;prompt&gt;
    where would you like to go?
    &lt;/prompt&gt;
  &lt;/emma:presentation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
<tr>
<td width="50">user</td>
<td>New York</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:interpretation id="int1"
    emma:dialog-turn="turn2"
    emma:in-response-to="pres1"
    emma:start="1241035891246"
    emma:end="1241035893000""&gt;
    &lt;destination&gt;
    New York
    &lt;/destination&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;

</pre></td>
</tr>
</tbody>
</table>
<h3 id="s2.7.2">2.7.2 Log annotation</h3>
<p>EMMA is generally used to show the recognition, semantic
interpretation etc. assigned to inputs based on <em>machine</em>
processing of the user input. Another potential use case is to
provide a mechanism for showing the interpretation assigned to an
input by a human annotator and using
<code>&lt;emma:derived-from&gt;</code> to show the relationship
between the input received the annotation. The
<code>&lt;emma:one-of&gt;</code> element can then be used to show
multiple competing annotations for an input. The
<code>&lt;emma:group&gt;</code> element could be used to contain
multiple different kinds of annotation on a single input. One
question here is whether <code>emma:process</code> can be used for
identification of the labeller, and whether there is a need for any
additional EMMA machinery to better support this this use case. In
these examples, <code>&lt;emma:literal&gt;</code> contains mixed
content with text and elements. This is in keeping with the EMMA
1.0 schema.</p>
<p>One issue that arises concerns the meaning of an
<code>emma:confidence</code> value on an annotated interpretation.
It may be preferable to have another attribute for annotator
confidence rather than overloading the current
<code>emma:confidence</code>.</p>
<p>Another issue concerns mixing of system results and human
annotation. Should these be grouped or is the annotation a derived
from the system's interpretation. Also it would be useful to
capture the time of the annotation. The current timestamps are used
for the time of the input itself. Where should annotation
timestamps be recorded?</p>
<p>It would also be useful to have a way to specify open ended
information about the annotator such as their native language,
profession, experience etc. One approach would be to be to have a
new attribute e.g. <code>emma:annotator</code> with a URI value
that could point to a description of the annotator.</p>
<p>It could be useful for very common annotations to have in
addition to <code>emma:tokens</code> another dedicated element to
indicate the annotated transcription, for example,
<code>emma:annotated-tokens</code> or
<code>emma:transcription</code>.</p>
<p>In the following example, we show how
<code>emma:interpretation</code> and <code>emma:derived-from</code>
could be used to capture the annotation of an input.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td width="614"><strong>Input</strong></td>
<td width="531"><strong>EMMA</strong></td>
</tr>
<tr>
<td width="93">user</td>
<td>
<p>In this example the user has said:</p>
<p>"flights from boston to san francisco leaving on the fourth of
september"</p>
<p>and the semantic interpretation here is a semantic tagging of
the utterance done by a human annotator. emma:process is used to
provide details about the annotation</p>
</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:interpretation id="annotation1"
    emma:process="annotate:type=semantic&amp;annotator=michael"
    emma:confidence="0.90"&gt;
      &lt;emma:literal&gt;
      flights from &lt;src&gt;san francisco&lt;/src&gt; to
      &lt;dest&gt;boston&lt;/dest&gt; on
      &lt;date&gt;the fourth of september&lt;/date&gt;
      &lt;/emma:literal&gt;
    &lt;emma:derived-from resource="#asr1"/&gt;
  &lt;/emma:interpretation&gt;
  &lt;emma:derivation&gt;
    &lt;emma:interpretation id="asr1"
      emma:medium="acoustic"
      emma:mode="voice"
      emma:function="dialog"
      emma:verbal="true"
      emma:lang="en-US"
      emma:start="1241690021513"
      emma:end="1241690023033"
      emma:media-type="audio/amr; rate=8000"
      emma:process="smm:type=asr&amp;version=watson6"
      emma:confidence="0.80"&gt;
        &lt;emma:literal&gt;
        flights from san francisco
        to boston on the fourth of september
        &lt;/emma:literal&gt;
    &lt;/emma:interpretation&gt;
  &lt;/emma:derivation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<p>Taking this example a step further,
<code>&lt;emma:group&gt;</code> could be used to group annotations
made by multiple different annotators of the same utterance:</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td width="614"><strong>Input</strong></td>
<td width="531"><strong>EMMA</strong></td>
</tr>
<tr>
<td width="93">user</td>
<td>
<p>In this example the user has said:</p>
<p>"flights from boston to san francisco leaving on the fourth of
september"</p>
<p>and the semantic interpretation here is a semantic tagging of
the utterance done by two different human annotators.
<code>emma:process</code> is used to provide details about the
annotation</p>
</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:group emma:confidence="1.0"&gt;
    &lt;emma:interpretation id="annotation1"
      emma:process="annotate:type=semantic&amp;annotator=michael"
      emma:confidence="0.90"&gt;
        &lt;emma:literal&gt;
        flights from &lt;src&gt;san francisco&lt;/src&gt;
        to &lt;dest&gt;boston&lt;/dest&gt;
        on &lt;date&gt;the fourth of september&lt;/date&gt;
        &lt;/emma:literal&gt;
      &lt;emma:derived-from resource="#asr1"/&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation id="annotation2"
      emma:process="annotate:type=semantic&amp;annotator=debbie"
      emma:confidence="0.90"&gt;
        &lt;emma:literal&gt;
        flights from &lt;src&gt;san francisco&lt;/src&gt;
        to &lt;dest&gt;boston&lt;/dest&gt; on
        &lt;date&gt;the fourth of september&lt;/date&gt;
        &lt;/emma:literal&gt;
      &lt;emma:derived-from resource="#asr1"/&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:group-info&gt;semantic_annotations&lt;/emma:group-info&gt;
  &lt;/emma:group&gt;
  &lt;emma:derivation&gt;
    &lt;emma:interpretation id="asr1"
      emma:medium="acoustic"
      emma:mode="voice"
      emma:function="dialog"
      emma:verbal="true"
      emma:lang="en-US"
      emma:start="1241690021513"
      emma:end="1241690023033"
      emma:media-type="audio/amr; rate=8000"
      emma:process="smm:type=asr&amp;version=watson6"
      emma:confidence="0.80"&gt;
        &lt;emma:literal&gt;
        flights from san francisco to boston
        on the fourth of september
        &lt;/emma:literal&gt;
    &lt;/emma:interpretation&gt;
  &lt;/emma:derivation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<h3 id="s2.8">2.8 Multisentence Inputs</h3>
<p>For certain applications, it is useful to be able to represent
the semantics of multi-sentence inputs, which may be in one of more
modalities such as speech (e.g. voicemail), text (e.g. email), or
handwritten input. One application use case is for summarizing a
voicemail or email. We develop this example below.</p>
<p>There are at least two possible approaches to addressing this
use case.</p>
<ol>
<li>If there is no reason to distinguish the individual sentences
of the input or interpret them individually, the entire input could
be included as the value of the <code>emma:tokens</code> attribute
of an <code>&lt;emma:interpretation&gt;</code> or
<code>&lt;emma:one-of&gt;</code> element, where the semantics of
the input is represented as the value of an
<code>&lt;emma:interpretation&gt;</code>. Although in principle
there is no upper limit on the length of a <code>emma:tokens</code>
attribute, in practice, this approach might be cumbersome for
longer or more complicated texts.</li>
<li>If more structure is required, the interpretations of the
individual sentences in the input could be grouped as individual
<code>&lt;emma:interpretation&gt;</code> elements under an
<code>&lt;emma:sequence&gt;</code> element. A single unified
semantics representing the meaning of the entire input could then
be represented with the sequence as the value of
<code>&lt;emma:derived-from&gt;</code>.</li>
</ol>
The example below illustrates the first approach.
<h4 id="multisentence_example">Example</h4>
<table border="1">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td width="614"><strong>Input</strong></td>
<td width="531"><strong>EMMA</strong></td>
</tr>
<tr>
<td width="93">user</td>
<td>
<p>Hi Group,</p>
<p>You are all invited to lunch tomorrow at Tony's Pizza at 12:00.
Please let me know if you're planning to come so that I can make
reservations. Also let me know if you have any dietary
restrictions. Tony's Pizza is at 1234 Main Street. We will be
discussing ways of using EMMA.</p>
<p>Debbie</p>
</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:interpretation
    emma:tokens="Hi Group, You are all invited to
    lunch tomorrow at Tony's Pizza at 12:00.
    Please let me know if you're planning to
    come so that I can make reservations.
    Also let me know if you have any dietary
    restrictions. Tony's Pizza is at 1234
    Main Street. We will be discussing
    ways of using EMMA." &gt;
      &lt;business-event&gt;lunch&lt;/business-event&gt;
      &lt;host&gt;debbie&lt;/host&gt;
      &lt;attendees&gt;group&lt;/attendees&gt;
      &lt;location&gt;
        &lt;name&gt;Tony's Pizza&lt;/name&gt;
        &lt;address&gt; 1234 Main Street&lt;/address&gt;
      &lt;/location&gt;
      &lt;date&gt; tuesday, March 24&lt;/date&gt;
      &lt;needs-rsvp&gt;true&lt;/needs-rsvp&gt;
      &lt;needs-restrictions&gt;true&lt;/need-restrictions&gt;
      &lt;topic&gt;ways of using EMMA&lt;/topic&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;


</pre></td>
</tr>
</tbody>
</table>
<h3 id="s2.9">2.9 Multi-participant interactions</h3>
<p><a href="http://www.w3.org/TR/emma/">EMMA 1.0</a> primarily
focussed on the interpretation of inputs from a single user. Both
for annotation of human-human dialogs and for the emerging systems
which support dialog or multimodal interaction with multiple
participants (such as multimodal systems for meeting analysis), it
is important to support annotation of interactions involving
multiple different participants. The proposals above for capturing
dialog can play an important role. One possible further extension
would be to add specific markup for annotation of the user making a
particular contribution. In the following example, we use an
attribute <code>emma:participant</code> to identify the participant
contributing each response to the prompt.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td width="668"><strong>Input</strong></td>
<td width="480"><strong>EMMA</strong></td>
</tr>
<tr>
<td width="90">system</td>
<td>Please tell me your lunch orders</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:presentation id="pres1"
    emma:dialog-turn="turn1"
    emma:in-response-to="initial"
    emma:start="1241035886246"
    emma:end="1241035888306"&gt;
      &lt;prompt&gt;please tell me your lunch orders&lt;/prompt&gt;
  &lt;/emma:presentation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
<tr>
<td width="90">user1</td>
<td>I'll have a mushroom pizza</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:interpretation id="int1"
    emma:dialog-turn="turn2"
    emma:in-response-to="pres1"
    emma:participant="user1"
    emma:start="1241035891246"
    emma:end="1241035893000""&gt;
      &lt;pizza&gt;
        &lt;topping&gt;
        mushroom
        &lt;/topping&gt;
      &lt;/pizza&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;

</pre></td>
</tr>
<tr>
<td width="90">user3</td>
<td>I'll have a pepperoni pizza.</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:interpretation id="int2"
    emma:dialog-turn="turn3"
    emma:in-response-to="pres1"
    emma:participant="user2"
    emma:start="1241035896246"
    emma:end="1241035899000""&gt;
      &lt;pizza&gt;
        &lt;topping&gt;
        pepperoni
        &lt;/topping&gt;
      &lt;/pizza&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;

</pre></td>
</tr>
</tbody>
</table>
<h3 id="s2.10">2.10 Capturing sensor data such as GPS in EMMA</h3>
<p>The multimodal examples described in the <a href=
"http://www.w3.org/TR/emma/">EMMA 1.0</a> specification, include
combination of spoken input with a location specified by touch or
pen. With the increase in availability of GPS and other location
sensing technology such as cell tower triangulation in mobile
devices, it is desirable to provide a method for annotating inputs
with the device location and, in some cases fusing the GPS
information with the spoken command in order to derive a complete
interpretation. GPS information could potentially be determined
using the <a href=
"http://www.w3.org/TR/2009/WD-geolocation-API-20090707/">Geolocation
API Specification</a> from the <a href=
"http://www.w3.org/2008/geolocation/">Geolocation working group</a>
and then encoded into a EMMA result sent to a server for
fusion.</p>
<p>One possibility using the current EMMA capabilities is to use
<code>&lt;emma:group&gt;</code> to associate GPS markup with the
semantics of a spoken command. For example, the user might say
"where is the nearest pizza place?" and the interpretation of the
spoken command is grouped with markup capturing the GPS sensor
data. This example uses the existing
<code>&lt;emma:group&gt;</code> element and extends the set of
values of <code>emma:medium</code> and <code>emma:mode</code> to
include <code>"sensor"</code> and <code>"gps"</code>
respectively.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td width="50">where is the nearest pizza place?</td>
<td rowspan="2">
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:group&gt;
    &lt;emma:interpretation
      emma:tokens="where is the nearest pizza place"
      emma:confidence="0.9"
      emma:medium="acoustic"
      emma:mode="voice"
      emma:start="1241035887111"
      emma:end="1241035888200"
      emma:process="reco:type=asr&amp;version=asr_eng2.4"
      emma:media-type="audio/amr; rate=8000"
      emma:lang="en-US"&gt;
        &lt;category&gt;pizza&lt;/category&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation
      emma:medium="sensor"
      emma:mode="gps"
      emma:start="1241035886246"
      emma:end="1241035886246"&gt;
        &lt;lat&gt;40.777463&lt;/lat&gt;
        &lt;lon&gt;-74.410500&lt;/lon&gt;
        &lt;alt&gt;0.2&lt;/alt&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:group-info&gt;geolocation&lt;/emma:group-info&gt;
  &lt;/emma:group&gt;
&lt;/emma:emma&gt;

</pre></td>
</tr>
<tr>
<td width="50">GPS</td>
<td>(GPS coordinates)</td>
</tr>
</tbody>
</table>
<p>Another, more abbreviated, way to incorporate sensor information
would be to have spatial correlates of the timestamps and allow for
location stamping of user inputs, e.g. <code>emma:lat</code> and
<code>emma:lon</code> attributes that could appear on EMMA
container elements to indicate the location where the input was
produced.</p>
<h3 id="s2.11">2.11 Extending EMMA from NLU to also represent
search or database retrieval results</h3>
<p>In many of the use cases considered so far, EMMA is used for
representation of the results of speech recognition and then for
the results of natural language understanding, and possibly
multimodal fusion. In systems used for voice search, the next step
is often to conduct search and extract a set of records or
documents. Strictly speaking, this stage of processing is out of
scope for EMMA. It is odd though to have the mechanisms of EMMA
such as <code>&lt;emma:one-of&gt;</code> for ambiguity all the way
up to NLU or multimodal fusion, but not to have access to the same
apparatus for representation of the next stage of processing which
can often be search or database lookup. Just as we can use
<code>&lt;emma:one-of&gt;</code> and <code>emma:confidence</code>
to represent N-best recognitions or semantic interpretations,
similarly we can use them to represent a series of search results
along with their relative confidence. One issue is whether we need
some measure other than confidence for relevance ranking, or is the
same confidence attribute can be used.</p>
<p>One issue that arises is whether it would be useful to have some
recommended or standardized element to use for query results e.g
<code>&lt;result&gt;</code> as in the following example. Another
issue is how to annotate information about the database and the
query that was issued. The database could be indicate as part of
the <code>emma:process</code> value as in the following example.
For web search the query URL could be annotated on the result e.g.
<code>&lt;result url="http://cnn.com"/&gt;</code>. For database
queries, the query, SQL for example could be annotated on the
results or on the containing <code>&lt;emma:group&gt;</code>.</p>
<p>The following example shows the use of EMMA to represent the
results of database retrieval from an employee directory. The user
says "John Smith". After ASR, NLU, and then database look up, the
system returns the XML here which shows the N-best lists associated
with each of these three stages of processing. Here
<code>&lt;emma:derived-from&amp;gr;</code> is used to indicate the
relations between each of the <code>&lt;emma:one-of&gt;</code>
elements. However, if you want to see which specific ASR result a
record is derived from, you would need to put
<code>&lt;emma:derived-from&gt;</code> on the individual
elements.</p>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td width="50">User says "John Smith"</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:one-of id="db_results1"
    emma:process="db:type=mysql&amp;database=personel_060109.db&gt;
    &lt;emma:interpretation id="db_nbest1"
      emma:confidence="0.80" emma:tokens="john smith"&gt;
        &lt;result&gt;
          &lt;name&gt;John Smith&lt;/name&gt;
          &lt;room&gt;dx513&lt;/room&gt;
          &lt;number&gt;123-456-7890&gt;/number&gt;
        &lt;/result&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation id="db_nbest2"
      emma:confidence="0.70" emma:tokens="john smith"&gt;
        &lt;result&gt;
          &lt;name&gt;John Smith&lt;/name&gt;
          &lt;room&gt;ef312&lt;/room&gt;
          &lt;number&gt;123-456-7891&gt;/number&gt;
        &lt;/result&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation id="db_nbest3"
      emma:confidence="0.50" emma:tokens="jon smith"&gt;
        &lt;result&gt;
          &lt;name&gt;Jon Smith&lt;/name&gt;
          &lt;room&gt;dv900&lt;/room&gt;
          &lt;number&gt;123-456-7892&gt;/number&gt;
       &lt;/result&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:interpretation id="db_nbest4"
      emma:confidence="0.40" emma:tokens="joan smithe"&gt;
        &lt;result&gt;
          &lt;name&gt;Joan Smithe&lt;/name&gt;
          &lt;room&gt;lt567&lt;/room&gt;
          &lt;number&gt;123-456-7893&gt;/number&gt;
        &lt;/result&gt;
    &lt;/emma:interpretation&gt;
    &lt;emma:derived-from resource="#nlu_results1/&gt;
  &lt;/emma:one-of&gt;
  &lt;emma:derivation&gt;
    &lt;emma:one-of id="nlu_results1"
      emma:process="smm:type=nlu&amp;version=parser"&gt;
      &lt;emma:interpretation id="nlu_nbest1"
        emma:confidence="0.99" emma:tokens="john smith"&gt;
          &lt;fn&gt;john&lt;/fn&gt;&lt;ln&gt;smith&lt;/ln&gt;
      &lt;/emma:interpretation&gt;
      &lt;emma:interpretation id="nlu_nbest2"
        emma:confidence="0.97" emma:tokens="jon smith"&gt;
          &lt;fn&gt;jon&lt;/fn&gt;&lt;ln&gt;smith&lt;/ln&gt;
      &lt;/emma:interpretation&gt;
      &lt;emma:interpretation id="nlu_nbest3"
        emma:confidence="0.93" emma:tokens="joan smithe"&gt;
          &lt;fn&gt;joan&lt;/fn&gt;&lt;ln&gt;smithe&lt;/ln&gt;
      &lt;/emma:interpretation&gt;
      &lt;emma:derived-from resource="#asr_results1/&gt;
    &lt;/emma:one-of&gt;
    &lt;emma:one-of id="asr_results1"
      emma:medium="acoustic" emma:mode="voice"
      emma:function="dialog" emma:verbal="true"
      emma:lang="en-US" emma:start="1241641821513"
      emma:end="1241641823033"
      emma:media-type="audio/amr; rate=8000"
      emma:process="smm:type=asr&amp;version=watson6"&gt;
        &lt;emma:interpretation id="asr_nbest1"
          emma:confidence="1.00"&gt;
            &lt;emma:literal&gt;john smith&lt;/emma:literal&gt;
        &lt;/emma:interpretation&gt;
        &lt;emma:interpretation id="asr_nbest2"
          emma:confidence="0.98"&gt;
            &lt;emma:literal&gt;jon smith&lt;/emma:literal&gt;
        &lt;/emma:interpretation&gt;
        &lt;emma:interpretation id="asr_nbest3"
          emma:confidence="0.89" &gt;
            &lt;emma:literal&gt;joan smithe&lt;/emma:literal&gt;
        &lt;/emma:interpretation&gt;
   &lt;/emma:one-of&gt;
  &lt;/emma:derivation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<h3 id="s2.12">2.12 Supporting other semantic representation forms
in EMMA</h3>
<p>In the <a href="http://www.w3.org/TR/emma/">EMMA 1.0</a>
specification, the semantic representation of an input is
represented either in XML in some application namespace or as a
literal value using <code>emma:literal</code>. In some
circumstances it could be beneficial to allow for semantic
representation in other formats such as JSON. Serializations such
as JSON could potentially be contained within
<code>emma:literal</code> using CDATA, and a new EMMA annotation
e.g. <code>emma:semantic-rep</code> used to indicate the semantic
representation language being used.</p>
<h4 id="semantic_representation_example">Example</h4>
<table width="120">
<tbody>
<tr>
<td><strong>Participant</strong></td>
<td><strong>Input</strong></td>
<td><strong>EMMA</strong></td>
</tr>
<tr>
<td width="50">user</td>
<td>semantics of spoken input</td>
<td>
<pre>
&lt;emma:emma
  version="2.0"
  xmlns:emma="http://www.w3.org/2003/04/emma"
  xmlns="http://www.example.com/example"&gt;
  &lt;emma:interpretation id=“int1"
    emma:confidence=".75”
    emma:medium="acoustic"
    emma:mode="voice"
    emma:verbal="true"
    emma:function="dialog"
    emma:semantic-rep="json"
      &lt;emma:literal&gt;
        &lt;![CDATA[
              {
           drink: {
              liquid:"coke",
              drinksize:"medium"},
           pizza: {
              number: "3",
              pizzasize: "large",
              topping: [ "pepperoni", "mushrooms" ]
           }
          }
          ]]&gt;
      &lt;/emma:literal&gt;
  &lt;/emma:interpretation&gt;
&lt;/emma:emma&gt;
</pre></td>
</tr>
</tbody>
</table>
<h2 id="references">General References</h2>
<p>EMMA 1.0 Requirements <a href=
"http://www.w3.org/TR/EMMAreqs/">http://www.w3.org/TR/EMMAreqs/</a></p>
<p>EMMA Recommendation <a href=
"http://www.w3.org/TR/emma/">http://www.w3.org/TR/emma/</a></p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>Thanks to Jim Larson (W3C Invited Expert) for his contribution
to the section on EMMA for multimodal output.</p>
</body>
</html>