index.html 41.1 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<title>Introduction and Overview of W3C Speech Interface
Framework</title>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type" />
<meta content="Microsoft FrontPage 4.0" name="GENERATOR" />
<style type="text/css">
body { 
font-family: sans-serif;
margin-left: 10%; 
margin-right: 5%; 
color: black;
background-color: white;
background-attachment: fixed;
background-image: url(http://www.w3.org/StyleSheets/TR/WD);
background-position: top left;
background-repeat: no-repeat;
font-family: Tahoma, Verdana, "Myriad Web", Syntax, sans-serif;
}
.unfinished {  font-style: normal; background-color: #FFFF33}
.dtd-code {  font-family: monospace;
 background-color: #dfdfdf; white-space: pre;
 border: #000000; border-style: solid;
 border-top-width: 1px; border-right-width: 1px;
 border-bottom-width: 1px; border-left-width: 1px; }
p.copyright {font-size: smaller}
h2,h3 {margin-top: 1em;}
ul.toc li {list-style: none}
ul.toc a {text-decoration: none }
code {
    color: green;
    font-family: monospace;
    font-weight: bold;
}
.example {
    border: solid green;
    border-width: 2px;
    color: green;
    font-weight: bold;
    margin-right: 5%;
    margin-left: 0;
}
.bad  {
    border: solid red;
    border-width: 2px;
    margin-left: 0;
    margin-right: 5%;
    color: rgb(192, 101, 101);
}
div.navbar { text-align: center; }
div.contents {
    background-color: rgb(204,204,255);
    padding: 0.5em;
    border: none;
    margin-right: 5%;
}
table {
    margin-left: 0;
    margin-right: 0;
    font-family: sans-serif;
    background: white;
    border-width: 2px;
    border-color: white;
  }
th { font-family: sans-serif; background: rgb(204, 204, 153) }
td { font-family: sans-serif; background: rgb(255, 255, 153) }
.tocline { list-style: none; }
</style>

<link rel="stylesheet" type="text/css"
href="http://www.w3.org/StyleSheets/TR/W3C-WD" />
</head>
<body>
<div class="head">
<p><a href="http://www.w3.org/"><img class="head"
src="http://www.w3.org/Icons/WWW/w3c_home" alt="W3C" width="72" height="48"/></a></p>

<h1 class="head">Introduction and Overview of W3C Speech
Interface Framework</h1>

<h2 class="notoc">W3C Working Draft 4 December 2000</h2>

<dl>
<dt>This version:</dt>

<dd><a
href="http://www.w3.org/TR/2000/WD-voice-intro-20001204/">
http://www.w3.org/TR/2000/WD-voice-intro-20001204</a></dd>

<dt>Latest version:</dt>

<dd><a
href="http://www.w3.org/TR/voice-intro">http://www.w3.org/TR/voice-intro</a></dd>

<dt>Previous version:</dt>

<dd><a
href="http://www.w3.org/TR/1999/WD-voice-intro-19991223">http://www.w3.org/TR/1999/WD-voice-intro-19991223</a></dd>

<dt>Editor:</dt>

<dd>Jim A. Larson, Intel Architecture Labs</dd>
</dl>
<p class="copyright"><a
            href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Copyright">
            Copyright</a> &copy;2000 <a href="http://www.w3.org/"><abbr title="World
            Wide Web Consortium">W3C</abbr></a><sup>&reg;</sup> (<a
            href="http://www.lcs.mit.edu/"><abbr title="Massachusetts Institute of
            Technology">MIT</abbr></a>, <a href="http://www.inria.fr/"><abbr lang="fr"
            title="Institut National de Recherche en Informatique et
            Automatique">INRIA</abbr></a>, <a href="http://www.keio.ac.jp/">Keio</a>),
            All Rights Reserved. W3C <a
            href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Legal_Disclaimer">liability</a>,
            <a
            href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#W3C_Trademarks">trademark</a>,
            <a
            href="http://www.w3.org/Consortium/Legal/copyright-documents-19990405">document
            use</a> and <a
            href="http://www.w3.org/Consortium/Legal/copyright-software-19980720">software
            licensing</a> rules apply.</p>

<hr />
</div>

<h2 class="notoc"><a id="abstract"
name="abstract">Abstract</a></h2>

<p>The World Wide Web Consortium's Voice Browser Working Group is
defining several markup languages for applications supporting
speech input and output. These markup languages will enable
speech applications across a range of hardware and software
platforms. Specifically, the Working Group is designing markup
languages for dialog, speech recognition grammar, speech
synthesis, natural language semantics, and a collection of
reusable dialog components. These markup languages make up the
W3C Speech Interface Framework. The speech community is invited
to review and comment on the working draft requirement and
specification documents.</p>

<h2><a id="status" name="status">Status of This Document</a></h2>

<p>This document describes a model architecture for speech
processing in voice browsers. It also briefly describes markup
languages for dialog, speech recognition grammar, speech
synthesis, natural language semantics, and a collection of
reusable dialog components. This document is being released as a
working draft, but is not intended to become a proposed
recommendation.</p>

<p>This specification is a Working Draft of the Voice Browser
working group for review by W3C members and other interested
parties. It is a draft document and may be updated, replaced, or
obsoleted by other documents at any time. It is inappropriate to
use W3C Working Drafts as reference material or to cite them as
other than "work in progress".</p>

<p>Publication as a Working Draft does not imply endorsement by
the W3C membership, nor of members of the Voice Browser working
groups. This is still a draft document and may be updated,
replaced or obsoleted by other documents at any time. It is
inappropriate to cite W3C Working Drafts as other than "work in
progress."</p>

<p>This document has been produced as part of the <a
href="http://www.w3.org/Voice/">W3C Voice Browser Activity</a>,
following the procedures set out for the <a
href="http://www.w3.org/Consortium/Process/">W3C Process</a>. The
authors of this document are members of the <a
href="http://www.w3.org/Voice/Group">Voice Browser Working
Group</a>. This document is for public review. Comments should be
sent to the public mailing list &lt;<a
href="mailto:www-voice@w3.org">www-voice@w3.org</a>&gt; (<a
href="http://www.w3.org/Archives/Public/www-voice/">archive</a>).</p>

<p>A list of current W3C Recommendations and other technical
documents can be found at <a
href="http://www.w3.org/TR">http://www.w3.org/TR</a>.</p>

<h2>1. <a id="group" name="group">Voice Browser Working
Group</a></h2>

<p>The Voice Browser Working Group was <a
href="http://www.w3.org/Voice/1999/voice-wg-charter.html">chartered</a>
by the World Wide Web Consortium (W3C) within the User Interface
Activity in May 1999 to prepare and review markup languages that
enable voice browsers. Members meet weekly via telephone and
quarterly in face-to-face meetings.</p>

<p>The <a href="http://www.w3.org/Voice/">W3C Voice Browser
Working Group</a> is open to any member of the W3C Consortium.
The Voice Browser Working Group has also invited experts whose
affiliations are not members of the W3C Consortium. The four
founding members of the VoiceXML Forum, as well as telelphony
applications venders, speech recognition and text to speech
engine venders, web portals, hardware venders, software venders,
telcos and appliance manufactures have representatives who
participate in the Voice Browser Working Group. Current members
include AskJeves, AT&amp;T, Avaya, BT, Canon, Cisco, France
Telecon, General Magic, Hitachi, HP, IBM, isSound, Intel, Locus
Dialogue, Lucent, Microsoft, Mitre, Motorola, Nokia, Nortel,
Nuance, Phillips, PipeBeach, Speech Works, Sun, Telecon Italia,
TellMe.com, and Unisys, in addition to several invited
experts.</p>

<h2 class="notoc">Table of Contents</h2>

<ul class="toc">
<li><a href="#abstract">Abstract</a></li>

<li><a href="#status">Status of this Document</a></li>

<li>1. <a href="#group">The Voice Browser Working Group</a></li>

<li>2. <a href="#browsers">Voice Browsers</a></li>

<li>3. <a href="#benefits">Voice Browser Benefits</a></li>

<li>4. <a href="#spif">W3C Speech Interface Framework</a></li>

<li>5. <a href="#other">Other Uses for Markup Languages</a></li>

<li>6. <a href="#specs">Individual Markup Languages Overview</a> 

<ul>
<li>6.1. <a href="#gram">Speech Recognition Grammar
Specification</a></li>

<li>6.2. <a href="#synth">Speech Synthesis</a></li>

<li>6.3. <a href="#dialog">Dialog</a></li>

<li>6.4. <a href="#nl">Natural Language Semantics</a></li>

<li>6.5 <a href="#reuse">Reusable Dialog Components</a></li>
</ul>
</li>

<li>7. <a href="#examples">Example Markup Language Use</a></li>

<li>8. <a href="#submissions">Submissions</a></li>

<li>9. <a href="#reading">Further Reading Material</a></li>

<li>10. <a href="#summary">Summary</a></li>
</ul>

<h2>2. <a id="browsers" name="browsers">Voice Browsers</a></h2>

<p>A <em>voice browser</em> is a device (hardware and software)
that interprets voice markup languages to generate voice output,
interpret voice input, and possibly accept and produce other
modalities of input and output.</p>

<p>Currently the major deployment of voice browsers enable users
to speak and listen using a telephone or cell phone to access
information available on the World Wide Web. These voice browsers
accept DTMF and spoken words as input, and produce synthesized
speech or replay prerecorded speech as output. The voice markup
languages interpreted by voice browsers are also frequently
available on the World Wide Web. However, many other deployments
of voice browsers are possible.</p>

<p>Hardware devices may include telephones or cell phones,
hand-held computers, palm-sized computers, laptop PCs, and
desktop PCs. Voice browser hardware processors may be embedded
into appliances such as TVs, radios, VCRs, remote controls,
ovens, refrigerators, coffeepots, doorbells, and practically any
other electronic or electrical device.</p>

<p>Possible software applications include:</p>

<ul>
<li>Accessing business information, including the corporate
"front desk" asking callers who or what they want, automated
telephone ordering services, support desks, order tracking,
airline arrival and departure information, cinema and theater
booking services, and home banking services</li>

<li>Accessing public information, including community information
such as weather, traffic conditions, school closures, directions
and events; local, national and international news; national and
international stock market information; and business and
e-commerce transactions</li>

<li>Accessing personal information, including calendars, address
and telephone lists, to-do lists, shopping lists, and calorie
counters</li>

<li>Assisting the user to communicate with other people sending
and receiving voice-mail messages</li>
</ul>

<p>Our definition of a voice browser does not support a voice
interface to HTML pages. A voice browser processes scripts
written using voice markup languages. HTML is not among the
languages which can be interpreted by a voice browser. Some
venders are creating voice-enabled HTML browsers that produce
voice instead of displaying text on a screen display. A
voice-enabled HTML browser must determine the sequence of text to
present to the user as voice, and possibly how to verbally
present non-text data such as tables, illustrations, and
animations. A voice browser, on the other hand, interprets a
script which specifies exactly what to verbally present to the
user as well as when to present each piece of information</p>

<h2>3. <a id="benefits" name="benefits">Voice Browser
Benefits</a></h2>

<p>Voice is a <em>very natural</em> user interface because it
enables the user to speak and listen using skills learned during
childhood. Currently users speak and listen to telephones and
cell phones with no display to interact with voice browsers. Some
voice browsers may have small screens, such as those found on
cell phones and palm computers. In the future, voice browsers may
also support other modes and media such as pen, video, and sensor
input and graphics animation and actuator controls as output. For
example, voice and pen input would be appropriate for Asian users
whose spoken language does not lend itself to entry with
traditional QWERTY keyboards.</p>

<p>Some voice browsers are <em>portable</em>. They can be used
anywhere&#8212;at home, at work, and on the road. Information
will be <em>available</em> to a greater audience, especially to
people who have access to handsets, either telephones or cell
phones, but not to networked computers.</p>

<p>Voice browsers present a <em>pragmatic</em> interface for
functionally blind users or users needing Web access while
keeping their hands and eyes free for other things. Voice
browsers present an invisible user interface to the user, while
freeing workspace previously occupied by keyboards and mice.</p>

<h2>4. <a id="spif" name="spif">W3C Speech Interface
Framework</a></h2>

<p>The Voice Browser Working group has defined the <i>W3C Speech
Interface Framework</i>, shown in Figure 1. The white boxes
represent typical components of a speech-enabled web application.
The black arrows represent data flowing among these components.
The blue ovals indicate data specified using markup languages
used to guide components to accomplish their respective tasks. To
review the latest requirement and specification documents for
each of the markup languages, see the section entitled
Requirements and Language specification Documents on our <a
href="http://www.w3.org/Voice/">W3C Voice Browser home web
site</a>.</p>

<p align="center"><img src="voice-intro-fig1.gif" width="559"
height="392"
alt="block diagram for speech interface framework" /></p>

<p>Components of the W3C Speech Interface Framework include the
following:</p>

<p><i>Automatic Speech Recognizer (ASR)</i>&#8212;accepts speech
from the user and produces text. The ASR uses a grammar to
recognize words from the user's spoken speech. Some ASRs use
grammars specified by a developer using the <b>Speech Grammar
Markup Language</b>. Other ASRs use statistical grammars
generated from large corpra of speech data. These grammars are
represented using the <b>N-gram Stochastic Grammar Markup
Language.</b></p>

<p><i>DTMF Tone Recognizer</i>&#8212;accepts touch-tones produced
by a telephone when the user presses the keys on the telephone's
keypad. Telephone users may use touch-tones to enter digits or
make menu selections.</p>

<p><i>Language Understanding Component</i>&#8212;extracts
semantics from a text string by using a prespecified grammar. The
text string may by produced by an ASR or be entered directly by a
user via a keyboard. The Language Understanding Component may
also use grammars specified using the <b>Speech Grammar Markup
Language</b> or the <b>N-gram Stochastic Grammar Markup
Language.</b> The output of the Language Understanding Component
is expressed using the <b>Natural Language Semantics Markup
Language.</b></p>

<p><i>Context Interpreter</i>&#8212;enhances the semantics from
the Language Understanding Module by obtaining context
information from a dialog history (not shown in Figure 1). For
example, the Context Interpreter may replace a pronoun by a noun
to which the pronoun referred. The input and output from the
Context Interpreter is expressed using the <b>Natural Language
Semantics Markup Language.</b></p>

<p><i>Dialog Manager</i>&#8212;prompts the user for input, makes
sense of the input, and determines what to do next according to
instructions in a dialog script specified using VoiceXML 2.0
modeled after VoiceXML 1.0. Depending upon the input received,
the dialog manager may invoke application services, or download
another dialog script from the web, or cause information to be
presented to the user. The Dialog Manager accepts input specified
using the <b>Natural Language Semantics Markup Language.</b>
Dialog scripts may refer to <b>Reusable Dialog Components</b>,
portions of another dialog script which can be reused across
multiple applications.</p>

<p><i>Media Planner</i>&#8212;determines whether output from the
dialog manager should be presented to the user as synthetic
speech or prerecorded audio.</p>

<p><i>Recorded audio player</i>&#8212;replays prerecorded audio
files to the user, either in conjunction with, or in place of
synthesized voices.</p>

<p><i>Language Generator</i>&#8212;Accepts text from the media
planner and prepares it for presentation to the user as spoken
voice via a text-to-speech synthesizer (TTS). The text may
contain markup tags expressed using the <b>Speech Synthesis
Markup Language</b> which provides hints and suggestions for how
acoustic sounds should be produced. These tags may be produced
automatically by the Language Generator or manually inserted by a
developer.</p>

<p><i>Text-to-Speech Synthesizer (TTS)</i>&#8212;Accepts text
from the Language Generator and produces acoustic signals which
the user hears as a human-like voice according to hints specified
using the <b>Speech Synthesis Markup Language</b>.</p>

<p>The components of any specific voice browser may differ
significantly from the Components shown in Figure 1. For example,
the Context Interpretation, Language Generation and Media
Planning components may be incorporated into the Dialog Manager,
or the tone recognizer may be incorporated into the Context
Interpretation. However, most voice browser implementations will
still be able to use of the various markup languages defined in
the W3C Speech Interface Framework.</p>

<p>The Voice Browser Working Group is not defining the components
in the W3C Speech Interface Framework. It is defining markup
languages for representing data in each of the blue ovals in
Figure 1. Specifically, the Voice Browser Working Group is
defining the following markup languages:</p>

<ul>
<li>
<p>Speech Recognition Grammar Specification</p>
</li>

<li>
<p>N-gram Grammar Markup Language</p>
</li>

<li>
<p>Speech Synthesis Markup Language</p>
</li>

<li>
<p>Dialog Markup Language</p>
</li>
</ul>

<p>The Voice Browser Working Group is also defining packaged
dialogs which we call <b>Reusable Components</b>. As their name
suggests, reusable components can be reused in other dialog
scripts, decreasing the implementation effort and increasing user
interface consistency. The Working Group may also define a
collection of reusable components such as solicit the user's
credit card number and exploration date, solicit the user's
address, etc.</p>

<p>Just as HTML formats data for screen-based interactions over
the Internet, an XML-based language is needed to format data for
voice-based interactions over the Internet. All markup languages
recommended by the Working Group will be XML-based, so XML
language processors can process any of the W3C Speech Interface
Framework markup languages.</p>

<h2>5. <a id="other" name="other">Other Uses of the Markup
Languages</a></h2>

<p>Figure 2 illustrates the W3C Speech Interface Framework
extended to support multiple modes of input and output. It is
anticipated that another working group will be formed to specify
the <b>Multimodal Dialog Language</b>, an extension of the Dialog
Language. We anticipate that another Working Group will be
established to take over our current work in defining the
Multimodal Dialog Language.</p>

<p align="center"><img src="voice-intro-fig2.gif" width="556"
height="402"
alt="block diagram for multimodal interface framework" /></p>

<p>Markup languages also may be used in applications not usually
associated with voice browsers. The following applications also
may benefit from the use of voice browser markup languages:</p>

<ul>
<li><em>Text-based Information Storage and
Retrieval</em>&#8212;Acceptance of text from a keyboard and
presents the text on a display. It uses neither ASR nor TTS, but
makes heavy use of the language understanding module and the
natural language semantic markup language.</li>

<li><em>Robot Command and Control</em>&#8212;Users speak commands
that control a mechanical robot. This application may use both
Speech Recognition Grammar Specification and dialog markup
languages.</li>

<li><em>Medical Transcription</em>&#8212;A complex, specialized
speech recognition grammar is used to extract medical information
from text produced by the ASR. A human editor corrects the
resulting text before printing.</li>

<li><em>Newsreader</em>&#8212;A language generator produces
marked-up text for presenting voice to the user. This application
uses a special language generator to markup text from news wire
services for verbal presentation.</li>
</ul>

<h2>6. <a id="specs" name="specs">Individual Markup Language
Overviews</a></h2>

<p>To review the latest requirement and specification documents
for each of the following languages, see the section titled
Requirements and Language specification Documents on our <a
href="http://www.w3.org/Voice/">W3C Voice Browser home web
site</a></p>

<h3><a id="gram" name="gram">6.1. Speech Recognition Grammar
Specification</a></h3>

<p>The Speech Recognition Grammar Specification supports the
definition of Context-Free Grammars (CFG) and, by subsumption,
Finite-State Grammars (FSG). The specification defines an XML
Grammar Markup Language, and an optional Augmented Backus-Naur
Format (ABNF) Markup Language. Automatic transformations between
the two formats is possible, for example, by XSLT to convert the
XML format to ABNF. We anticipate that development tools will be
constructed that provide the familiar ABNF format to developers,
and enable XML software to manipulate the XML grammar format. The
ABNF and XML languages are modeled after Sun's <a
href="http://www.w3.org/Submission/2000/06/">JSpeech Grammar
Format</a>. Some of the interesting features of the draft
specification:</p>

<ul>
<li>
<p>Ability to cross-reference grammars by URI and to use this
ability to define libraries of useful grammars.</p>
</li>

<li>
<p>Internationalized.</p>
</li>

<li>
<p>Semantic tagging mechanism for interpretation of spoken input
(under development).</p>
</li>

<li>
<p>Applicable to non-speech input modalities, e.g. DTMF input or
parsing and interpretation of typed input.</p>
</li>
</ul>

<p>A complementary speech recognition grammar language
specification is defined for N-Gram language models.</p>

<p>Terms used in the Speech Grammar Markup Language requirements
and specification documents include:</p>

<table border="1" cellpadding="6" cellspacing="1" width="85%"
summary="term in first column, explanation in second">
<tbody>
<tr>
<th width="24%">CFG</th>
<td width="76%">Context-Free Grammar. A formal computer science
term for a language that permits embedded recursion.</td>
</tr>

<tr>
<th width="24%">BNF</th>
<td width="76%">Backus-Naur Format. A language used widely in
computer science for textural representations of CFGs.</td>
</tr>

<tr>
<th width="24%">ABNF</th>
<td width="76%">Augmented Backus-Naur Format. The language
defined in the grammar specification that extends a conventional
BNF representation with regular grammar capabilities, syntax for
cross-referencing between grammars and other useful syntactic
features</td>
</tr>

<tr>
<th width="24%">Grammar</th>
<td width="76%">The representation of constraints defining the
set of allowable sentences in a language. E.g. a grammar for
describing a set of sentences for ordering a pizza.</td>
</tr>

<tr>
<th width="24%">Language</th>
<td width="76%">A formal computer science term for the collection
of set of sentences associated with a particular domain. Language
may refer to natural or program language.</td>
</tr>
</tbody>
</table>

<h3><a id="synth" name="synth">6.2. Speech Synthesis</a></h3>

<p>A text document may be produced automatically, authored by
people, or a combination of both. The Speech Synthesis Markup
Language supports high-level specifications, including the
selection of voice characteristics (name, gender, and age) and
the speed, volume, and emphasis of individual words. The language
also may describe how to pronounce acronyms, such as "Nasa" for
NASA, or spelled, such as "N, double A, C, P," for NAACP. At a
lower level, designers may specify prosodic control, which
includes pitch, timing, pausing, and speaking rate. The Speech
Synthesis Markup Language is modeled on Sun's <a
href="http://java.sun.com/products/java-media/speech/forDevelopers/JSML/index.html">
<b>Java Speech Markup Language</b></a>.</p>

<p>There is some variance in the use of terminology in the speech
synthesis community. The following definitions establish a common
understanding</p>

<table border="1" cellpadding="6" cellspacing="1" width="85%"
summary="term in first column, explanation in second">
<tbody>
<tr>
<th>Prosody</th>
<td width="76%">Features of speech such as pitch, pitch range,
speaking rate and volume.</td>
</tr>

<tr>
<th width="24%">Speech Synthesis</th>
<td width="76%">The process of automatic generation of speech
output from data input which may include plain text, <span
class="diff">formatted text or binary objects</span>.</td>
</tr>

<tr>
<th width="24%">Text-To-Speech</th>
<td width="76%">The process of automatic generation of speech
output from text or annotated text input.</td>
</tr>
</tbody>
</table>

<h3><a id="dialog" name="dialog">6.3. VoiceXML 2.0</a></h3>

<p>VoiceXML 2.0 Markup supports four I/O modes: speech
recognition and DTMF as input with synthesized speech and
prerecorded speech as output. VoiceXML 2.0 supports
system-directed speech dialogs where the system prompts the user
for responses, makes sense of the input, and determines what to
do next. VoiceXML 2.0 also supports mixed initiative speech
dialogs. In addition, VoiceXML also supports task switching and
the handling of events, such as recognition errors, incomplete
information entered by the user, timeouts, barge-in, and
developer-defined events. Barge-in allows users to speak while
the browser is speaking. VoiceXML 2.0 is modeled after <a
href="http://www.w3.org/Submission/2000/04/">VoiceXML 1.0</a>
designed by the <a href="http://www.voicexml.org/">VoiceXML
Forum</a>, whose founding members are AT&amp;T, IBM, Lucent, and
Motorola.</p>

<p>Terms used in the Dialog Markup Language requirements and
specification documents include:</p>

<table border="1" cellpadding="6" cellspacing="1" width="85%"
summary="term in first column, explanation in second">
<tbody>
<tr>
<th>Dialog Markup Language</th>
<td>a language in which voice dialog behavior is specified. The
language may include reference to scripting elements which can
also determine dialog behavior.</td>
</tr>

<tr>
<th>Voice Browser</th>
<td>a software device which interprets a voice markup language
and generates a dialog with voice output and possibly other
output modalities and/or voice input and possibly other
modalities.</td>
</tr>

<tr>
<th>Dialog</th>
<td>a model of interactive behavior underlying the interpretation
of the markup language. The model consists of states, variables,
events, event handlers, inputs and outputs.</td>
</tr>

<tr>
<th>Utterance</th>
<td>Used in this document generally to refer to a meaningful user
input in any modality supported by the platform, not limited to
spoken inputs. For example, speech, DTMF, pointing, handwriting,
text and OCR.</td>
</tr>

<tr>
<th>Mixed initiative dialog</th>
<td>A type of dialog in which either they system or the user can
take the initiative at any point in the dialog by failing to
respond directly to the previous utterance. For example, the user
can make corrections, volunteer additional information, etc.
Systems support mixed initiative dialog to various degrees.
Compare to "directed dialog."</td>
</tr>

<tr>
<th>Directed dialog</th>
<td>Also referred to as "system initiative" or "system led." A
type of dialog in which the user is permitted only direct literal
responses to the system's prompts.</td>
</tr>

<tr>
<th>State</th>
<td>the basic interact ional unit defined in the markup language.
A state can specify variables, event handlers, outputs and
inputs. A state may describe output content to be presented to
the user, input which the user can enter, event handlers
describing, for example, which variables to bind and which state
to transition to when an event occurs.</td>
</tr>

<tr>
<th>Events</th>
<td>generated when a state is executed by the voice browser; for
example, when outputs or inputs in a state are rendered or
interpreted. Events are typed and may include information; for
example, an input event generated when an utterance is recognized
may include the string recognized, an interpretation, confidence
score, and so on.</td>
</tr>

<tr>
<th>Event Handlers</th>
<td>are specified in the voice markup language and describe how
events generated by the voice browser are to be handled.
Interpretation of events may bind variables, or map the current
state into another state (possibly itself).</td>
</tr>

<tr>
<th>Output</th>
<td>content specified in an element of the markup language for
presentation to the user. The content is rendered by the voice
browser; for example, audio files or text rendered by a TTS.
Output can also contain parameters for the output device; for
example, volume of audio file playback, language for TTS, etc.
Events are generated when, for example, the audio file has been
played.</td>
</tr>

<tr>
<th>Input</th>
<td>content (and its interpretation) specified in an element of
the markup language which can be given as input by a user; for
example, a grammar for DTMF and speech input. Events are
generated by the voice browser when, for example, the user has
spoken an utterance and variables may be bound to information
contained in the event. Input can also specify parameters for the
input device; for example, timeout parameters, etc.</td>
</tr>
</tbody>
</table>

<h3><a id="nl" name="nl">6.4. Natural Language Semantics</a></h3>

<p>The Natural Language Semantics Markup Language supports XML
semantic representations. For application-specific information,
it is based on the W3C <a
href="http://www.w3.org/TR/2000/WD-xforms-datamodel-20000406/">XForms.</a>
The Natural Language Semantics Markup Language also includes
application-independent elements defined by the W3C Voice Browser
group. This application-independent information includes
confidences, the grammar matched by the interpretation, speech
recognizer input, and timestamps. The Natural Language Semantics
Markup Language combines elements from the XForms, natural
language semantics, and application-specific namespaces. For
example, the text, "I want to fly from New York to Boston, and,
then, to Washington, DC", could be represented as:</p>

<pre>
&lt;result xmlns:xf="http://www.w3.org/2000/xforms" 
x-model="http://flight-model"
grammar="http://flight-grammar"&gt;
  &lt;interpretation confidence=100&gt;
    &lt;xf:instance&gt;
       &lt;flight:trip&gt;
         &lt;leg1&gt; 
           &lt;from&gt;New York&lt;/from&gt; 
           &lt;to&gt;Boston&lt;/to&gt; 
         &lt;/leg1&gt;
         &lt;leg2&gt; 
           &lt;from&gt;Boston&lt;/from&gt; 
           &lt;to&gt;DC&lt;/to&gt; 
         &lt;/leg2&gt;
       &lt;/flight:trip&gt;
    &lt;/xf:instance&gt;
    &lt;input mode="speech"&gt;
      I want to fly from New York to Boston, and, 
      then, to Washington, DC
    &lt;/input&gt;
  &lt;/interpretation&gt;
&lt;/result&gt;
</pre>

<p>Terms used in the Natural Language Semantics Markup Language
requirements and specification documents include:</p>

<table border="1" cellpadding="6" cellspacing="1" width="85%"
summary="term in first column, explanation in second">
<tbody>
<tr>
<th width="23%">Natural language interpreter</th>
<td width="77%">A device which produces a representation of the
meaning of a natural language expression.</td>
</tr>

<tr>
<th width="23%">Natural language expression</th>
<td width="77%">An unformatted spoken or written utterance in a
human language such as English, French, Japanese, etc.</td>
</tr>
</tbody>
</table>

<h3><a id="reuse" name="reuse">6.5 Reusable Dialog
Components</a></h3>

<p>Reusable Dialog Components are dialog components (chunks of
dialog script or platform-specific objects that pose frequently
asked questions in dialog scripts, and can be invoked from any
dialog script) that are reusable (can be used multiple times
within an application or used by multiple applications) and that
meet specific interface (configuration parameter and return value
format) requirements. The purpose of reusable components is to
reduce the effort to implement a dialog by reusing encapsulations
of common dialog tasks, and to promote consistency across
applications. The W3C Voice Browser Working Group is defining the
interface for Reusable Dialog Components. Future specifications
will define standard reusable dialog components for designated
tasks that are portable across platforms.</p>

<h2>7. <a id="examples" name="examples">Example of Markup
Language Use</a></h2>

<p>The following speech dialog fragment illustrates the use of
the speech synthesis, Speech Recognition Grammar Specification,
and speech dialog markup languages:</p>

<pre>
&lt;menu&gt;                                                                      
  &lt;!-- This is an example of a menu which present the user --&gt;
  &lt;!-- with a prompt  and listens for the user to utter a choice --&gt;
  &lt;prompt&gt;                                                         
    &lt;!-- This text is presented to the user as synthetic speech --&gt;
    &lt;!-- The emphasisis element adds emphasis to its content  --&gt;
    Welcome to Ajax Travel Do you want to fly to
    &lt;emphasis&gt;New York, Boston&lt;/emphasis&gt; or   
    &lt;emphasis&gt;Washington DC&lt;/emphasis&gt;
  &lt;/prompt&gt;
  &lt;!-- When the user speaks an utterance that matches the grammar --&gt;
  &lt;!-- control is transferred to the "next" VoiceXML document --&gt;
  &lt;choice next="http://www.NY..."&gt; 
    &lt;!-- The Grammar element indicates the words which --&gt;
    &lt;!-- the user may utter to select this choice --&gt;                      
    &lt;grammar&gt;
      &lt;choice&gt; 
        &lt;item&gt; New York &lt;/item&gt; 
        &lt;item&gt; The Big Apple &lt;/item&gt; 
      &lt;/choice&gt;
    &lt;/grammar&gt; 
  &lt;/choice&gt;
  &lt;choice next="http://www.Boston..."&gt;
    &lt;grammar&gt;
      &lt;choice&gt; 
        &lt;item&gt; Boston &lt;/item&gt; 
        &lt;item&gt; Beantown &lt;/item&gt; 
      &lt;/choice&gt;
    &lt;/grammar&gt; 
  &lt;/choice&gt;
  &lt;choice next="http://www.Wash...."&gt;
    &lt;grammar&gt;
      &lt;choice&gt; 
        &lt;item&gt; Washington D.C. &lt;/item&gt; 
        &lt;item&gt; Washington &lt;/item&gt;
        &lt;item&gt; The U.S. Capital &lt;/item&gt; 
      &lt;/choice&gt;
    &lt;/grammar&gt; 
 &lt;/choice&gt;
&lt;/menu&gt;
</pre>

<p>In the example above, the Dialog Markup Language describes
when a voice menu which contains a prompt to be presented to the
user. The user may respond by saying and of several choices. When
the user speech matches a particular grammar, control is
transferred to the dialog fragment at the "next" location.</p>

<p>The Speech Synthesis Markup Language describes how text is
rendered to the user. The Speech Synthesis Markup Language
includes &lt;emphasis&gt; element. When rendered to the user, the
word "you" will be emphasized, and the end of the sentence will
raise in pitch to indicate a question.</p>

<p>The Speech Recognition Grammar Specification describes the
words that the user must say when making a choice. The
&lt;grammar&gt; element is shown within the &lt;choice&gt;
element. The language understanding module will recognize "New
York" or "The Big Apple" to mean New York, "Boston" or "Beantown"
to mean Boston, and "Washington, D.C.," "Washington," or "The
U.S. Capital" to mean Washington.</p>

<p>An example user-computer dialog resulting from interpreting
the above dialog script is</p>

<pre>
Computer: <i>Welcome to Ajax Travel Do you want to fly 
          to New York, Boston, or Washington DC?</i>

User:     Beantown

Computer: <i>(transfers to dialog script associated with Boston)</i>
</pre>

<h2>8. <a id="submissions"
name="submissions">Submissions</a></h2>

<p>W3C has acknowledged the <a
href="http://www.w3.org/Submission/2000/06/">JSGF and JSML
submission</a> from the <a href="http://www.sun.com/">Sun
Microsystems</a>. The W3C Voice Browser Working Group plans to
develop specifications for its Speech Synthesis Markup Language
and Speech Grammar Specification using JSGF and JSML as a
model.</p>

<p>W3C has acknowledged the <a
href="http://www.w3.org/Submission/2000/04/">VoiceXML 1.0
submission</a> from the <a
href="http://www.voicexml.org/">VoiceXML Forum</a>. The W3C <a
href="http://www.w3.org/Voice/Group/">Voice Browser Working
Group</a> plans to adopt VoiceXML 1.0 as the basis for developing
a Dialog Markup Language for interactive voice response
applications. See <a
href="http://www.zdnet.com/eweek/stories/general/0,11011,2574350,00.html">
ZDNet's article</a> covering the announcement</p>

<h2>9. <a id="reading" name="reading">Further Reading
Material</a></h2>

<p>The following resources are related to the efforts of the
Voice Browser working group.</p>

<dl>
<dt><a href="http://www.w3.org/TR/REC-CSS2/aural.html">Aural
CSS</a></dt>

<dd>The aural rendering of a document, already commonly used by
the blind and print-impaired communities, combines speech
synthesis and "auditory icons." Often such aural presentation
occurs by converting the document to plain text and feeding this
to a screen reader -- software or hardware that simply reads all
the characters on the screen. This results in less effective
presentation than would be the case if the document structure
were retained. Style sheet properties for aural presentation may
be used together with visual properties (mixed media) or as an
aural alternative to visual presentation.</dd>

<dt><br />
<a href="http://www.etsi.org/">The European Telecommunications
Standards Institute (ETSI)</a></dt>

<dd>The European Telecommunications Standards Institute (ETSI)
ETSI is a non-profit organization whose mission is "to determine
and produce the telecommunications standards that will be used
for decades to come". ETSI's work is complementary to W3C's. The
ETSI STQ Aurora DSR Working Group standardizes algorithms for
Distributed Speech Recognition (DSR). The idea is to preprocess
speech signals before transmission to a server connected to a
speech recognition engine. Navigate to http://www.etsi.org/stq/
for more details.</dd>

<dt><br />
<a
href="http://www.java.sun.com/products/java-media/speech/forDevelopers/JSGF/index.html">
Java Speech Grammar Format</a></dt>

<dd>The Java&#8482; Speech Grammar Format is used for defining
context free grammars for speech recognition. JSGF adopts the
style and conventions of the Java programming language in
addition to use of traditional grammar notations.<br />
</dd>

<dt><a href="http://www.microsoft.com/IIT/">Microsoft Speech
Site</a></dt>

<dd class="c5">This site describes the Microsoft speech API, and
contains a recognizer and synthesizer that can be
downloaded.</dd>

<dt><br />
<a href="http://www.w3.org/TR/NOTE-voice">NOTE-voice</a></dt>

<dd>This note describes features needed for effective interaction
with Web browsers that are based upon voice input and output.
Some extensions are proposed to HTML 4.0 and CSS2 to support
voice browsing, and some work is proposed in the area of speech
recognition and synthesis to make voice browsers more
effective.</dd>

<dt><br />
<a
href="http://www.bell-labs.com/project/tts/sable.html">SABLE</a></dt>

<dd>SABLE is a markup language for controlling text to speech
engines. It has evolved out of work on combining three existing
text to speech languages: SSML, STML and JSML.</dd>

<dt><br />
<a href="http://www.alphaworks.ibm.com/tech">SpeechML</a></dt>

<dd><i>(IBM's server precludes a simple URL for this, but you can
reach the SpeechML site by following the link for Speech
Recognition in the left frame)</i> SpeechML plays a similar role
to VoxML, defining a markup language written in XML for IVR
systems. SpeechML features close integration with Java.</dd>

<dt><br />
<a href="http://www.w3.org/Voice/TalkML">TalkML</a></dt>

<dd>This is an experimental markup language from HP Labs, written
in XML, and aimed at describing spoken dialogs in terms of
prompts, speech grammars and production rules for acting on
responses. It is being used to explore ideas for object-oriented
dialog structures, and for next generation aural style
sheets.</dd>

<dt><br />
<a href="http://www.w3.org/Voice/WWW8/slide1.html">Voice Browsers
and Style Sheets</a></dt>

<dd>Presentation by Dave Raggett on May 13th 1999 as part of the
Style stack of Developer's Day in <a
href="http://www8.org/">WWW8</a>. The presentation makes
suggestions for extensions to <a
href="http://www.w3.org/TR/REC-CSS2/aural.html">ACSS</a>.</dd>

<dt><br />
<a href="http://www.vxml.org/">VoiceXML site</a></dt>

<dd>The VoiceXML Forum formed by AT&amp;, IBM, Lucent and
Motorola to pool their experience. The Forum has published an
early version of the VoiceXML specification. This builds on
earlier work on PML, VoxML and SpeechML.</dd>
</dl>

<h2>10. <a id="summary" name="summary">Summary</a></h2>

<p>The W3C Voice Browser Working Group is defining markup
languages for speech recognition grammars, speech dialog, natural
language semantics, multimodal dialogs, and speech synthesis, as
well as a collection of reusable dialog components. In addition
to voice browsers, these languages can also support a wide range
of applications including information storage and retrieval,
robot command and control, medical transcription, and newsreader
applications. The speech community is invited to review and
comment on working draft requirement and specification
documents.</p>
</body>
</html>