index.html
41.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="generator" content="HTML Tidy, see www.w3.org" />
<title>Introduction and Overview of W3C Speech Interface
Framework</title>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type" />
<meta content="Microsoft FrontPage 4.0" name="GENERATOR" />
<style type="text/css">
body {
font-family: sans-serif;
margin-left: 10%;
margin-right: 5%;
color: black;
background-color: white;
background-attachment: fixed;
background-image: url(http://www.w3.org/StyleSheets/TR/WD);
background-position: top left;
background-repeat: no-repeat;
font-family: Tahoma, Verdana, "Myriad Web", Syntax, sans-serif;
}
.unfinished { font-style: normal; background-color: #FFFF33}
.dtd-code { font-family: monospace;
background-color: #dfdfdf; white-space: pre;
border: #000000; border-style: solid;
border-top-width: 1px; border-right-width: 1px;
border-bottom-width: 1px; border-left-width: 1px; }
p.copyright {font-size: smaller}
h2,h3 {margin-top: 1em;}
ul.toc li {list-style: none}
ul.toc a {text-decoration: none }
code {
color: green;
font-family: monospace;
font-weight: bold;
}
.example {
border: solid green;
border-width: 2px;
color: green;
font-weight: bold;
margin-right: 5%;
margin-left: 0;
}
.bad {
border: solid red;
border-width: 2px;
margin-left: 0;
margin-right: 5%;
color: rgb(192, 101, 101);
}
div.navbar { text-align: center; }
div.contents {
background-color: rgb(204,204,255);
padding: 0.5em;
border: none;
margin-right: 5%;
}
table {
margin-left: 0;
margin-right: 0;
font-family: sans-serif;
background: white;
border-width: 2px;
border-color: white;
}
th { font-family: sans-serif; background: rgb(204, 204, 153) }
td { font-family: sans-serif; background: rgb(255, 255, 153) }
.tocline { list-style: none; }
</style>
<link rel="stylesheet" type="text/css"
href="http://www.w3.org/StyleSheets/TR/W3C-WD" />
</head>
<body>
<div class="head">
<p><a href="http://www.w3.org/"><img class="head"
src="http://www.w3.org/Icons/WWW/w3c_home" alt="W3C" width="72" height="48"/></a></p>
<h1 class="head">Introduction and Overview of W3C Speech
Interface Framework</h1>
<h2 class="notoc">W3C Working Draft 4 December 2000</h2>
<dl>
<dt>This version:</dt>
<dd><a
href="http://www.w3.org/TR/2000/WD-voice-intro-20001204/">
http://www.w3.org/TR/2000/WD-voice-intro-20001204</a></dd>
<dt>Latest version:</dt>
<dd><a
href="http://www.w3.org/TR/voice-intro">http://www.w3.org/TR/voice-intro</a></dd>
<dt>Previous version:</dt>
<dd><a
href="http://www.w3.org/TR/1999/WD-voice-intro-19991223">http://www.w3.org/TR/1999/WD-voice-intro-19991223</a></dd>
<dt>Editor:</dt>
<dd>Jim A. Larson, Intel Architecture Labs</dd>
</dl>
<p class="copyright"><a
href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Copyright">
Copyright</a> ©2000 <a href="http://www.w3.org/"><abbr title="World
Wide Web Consortium">W3C</abbr></a><sup>®</sup> (<a
href="http://www.lcs.mit.edu/"><abbr title="Massachusetts Institute of
Technology">MIT</abbr></a>, <a href="http://www.inria.fr/"><abbr lang="fr"
title="Institut National de Recherche en Informatique et
Automatique">INRIA</abbr></a>, <a href="http://www.keio.ac.jp/">Keio</a>),
All Rights Reserved. W3C <a
href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#Legal_Disclaimer">liability</a>,
<a
href="http://www.w3.org/Consortium/Legal/ipr-notice-20000612#W3C_Trademarks">trademark</a>,
<a
href="http://www.w3.org/Consortium/Legal/copyright-documents-19990405">document
use</a> and <a
href="http://www.w3.org/Consortium/Legal/copyright-software-19980720">software
licensing</a> rules apply.</p>
<hr />
</div>
<h2 class="notoc"><a id="abstract"
name="abstract">Abstract</a></h2>
<p>The World Wide Web Consortium's Voice Browser Working Group is
defining several markup languages for applications supporting
speech input and output. These markup languages will enable
speech applications across a range of hardware and software
platforms. Specifically, the Working Group is designing markup
languages for dialog, speech recognition grammar, speech
synthesis, natural language semantics, and a collection of
reusable dialog components. These markup languages make up the
W3C Speech Interface Framework. The speech community is invited
to review and comment on the working draft requirement and
specification documents.</p>
<h2><a id="status" name="status">Status of This Document</a></h2>
<p>This document describes a model architecture for speech
processing in voice browsers. It also briefly describes markup
languages for dialog, speech recognition grammar, speech
synthesis, natural language semantics, and a collection of
reusable dialog components. This document is being released as a
working draft, but is not intended to become a proposed
recommendation.</p>
<p>This specification is a Working Draft of the Voice Browser
working group for review by W3C members and other interested
parties. It is a draft document and may be updated, replaced, or
obsoleted by other documents at any time. It is inappropriate to
use W3C Working Drafts as reference material or to cite them as
other than "work in progress".</p>
<p>Publication as a Working Draft does not imply endorsement by
the W3C membership, nor of members of the Voice Browser working
groups. This is still a draft document and may be updated,
replaced or obsoleted by other documents at any time. It is
inappropriate to cite W3C Working Drafts as other than "work in
progress."</p>
<p>This document has been produced as part of the <a
href="http://www.w3.org/Voice/">W3C Voice Browser Activity</a>,
following the procedures set out for the <a
href="http://www.w3.org/Consortium/Process/">W3C Process</a>. The
authors of this document are members of the <a
href="http://www.w3.org/Voice/Group">Voice Browser Working
Group</a>. This document is for public review. Comments should be
sent to the public mailing list <<a
href="mailto:www-voice@w3.org">www-voice@w3.org</a>> (<a
href="http://www.w3.org/Archives/Public/www-voice/">archive</a>).</p>
<p>A list of current W3C Recommendations and other technical
documents can be found at <a
href="http://www.w3.org/TR">http://www.w3.org/TR</a>.</p>
<h2>1. <a id="group" name="group">Voice Browser Working
Group</a></h2>
<p>The Voice Browser Working Group was <a
href="http://www.w3.org/Voice/1999/voice-wg-charter.html">chartered</a>
by the World Wide Web Consortium (W3C) within the User Interface
Activity in May 1999 to prepare and review markup languages that
enable voice browsers. Members meet weekly via telephone and
quarterly in face-to-face meetings.</p>
<p>The <a href="http://www.w3.org/Voice/">W3C Voice Browser
Working Group</a> is open to any member of the W3C Consortium.
The Voice Browser Working Group has also invited experts whose
affiliations are not members of the W3C Consortium. The four
founding members of the VoiceXML Forum, as well as telelphony
applications venders, speech recognition and text to speech
engine venders, web portals, hardware venders, software venders,
telcos and appliance manufactures have representatives who
participate in the Voice Browser Working Group. Current members
include AskJeves, AT&T, Avaya, BT, Canon, Cisco, France
Telecon, General Magic, Hitachi, HP, IBM, isSound, Intel, Locus
Dialogue, Lucent, Microsoft, Mitre, Motorola, Nokia, Nortel,
Nuance, Phillips, PipeBeach, Speech Works, Sun, Telecon Italia,
TellMe.com, and Unisys, in addition to several invited
experts.</p>
<h2 class="notoc">Table of Contents</h2>
<ul class="toc">
<li><a href="#abstract">Abstract</a></li>
<li><a href="#status">Status of this Document</a></li>
<li>1. <a href="#group">The Voice Browser Working Group</a></li>
<li>2. <a href="#browsers">Voice Browsers</a></li>
<li>3. <a href="#benefits">Voice Browser Benefits</a></li>
<li>4. <a href="#spif">W3C Speech Interface Framework</a></li>
<li>5. <a href="#other">Other Uses for Markup Languages</a></li>
<li>6. <a href="#specs">Individual Markup Languages Overview</a>
<ul>
<li>6.1. <a href="#gram">Speech Recognition Grammar
Specification</a></li>
<li>6.2. <a href="#synth">Speech Synthesis</a></li>
<li>6.3. <a href="#dialog">Dialog</a></li>
<li>6.4. <a href="#nl">Natural Language Semantics</a></li>
<li>6.5 <a href="#reuse">Reusable Dialog Components</a></li>
</ul>
</li>
<li>7. <a href="#examples">Example Markup Language Use</a></li>
<li>8. <a href="#submissions">Submissions</a></li>
<li>9. <a href="#reading">Further Reading Material</a></li>
<li>10. <a href="#summary">Summary</a></li>
</ul>
<h2>2. <a id="browsers" name="browsers">Voice Browsers</a></h2>
<p>A <em>voice browser</em> is a device (hardware and software)
that interprets voice markup languages to generate voice output,
interpret voice input, and possibly accept and produce other
modalities of input and output.</p>
<p>Currently the major deployment of voice browsers enable users
to speak and listen using a telephone or cell phone to access
information available on the World Wide Web. These voice browsers
accept DTMF and spoken words as input, and produce synthesized
speech or replay prerecorded speech as output. The voice markup
languages interpreted by voice browsers are also frequently
available on the World Wide Web. However, many other deployments
of voice browsers are possible.</p>
<p>Hardware devices may include telephones or cell phones,
hand-held computers, palm-sized computers, laptop PCs, and
desktop PCs. Voice browser hardware processors may be embedded
into appliances such as TVs, radios, VCRs, remote controls,
ovens, refrigerators, coffeepots, doorbells, and practically any
other electronic or electrical device.</p>
<p>Possible software applications include:</p>
<ul>
<li>Accessing business information, including the corporate
"front desk" asking callers who or what they want, automated
telephone ordering services, support desks, order tracking,
airline arrival and departure information, cinema and theater
booking services, and home banking services</li>
<li>Accessing public information, including community information
such as weather, traffic conditions, school closures, directions
and events; local, national and international news; national and
international stock market information; and business and
e-commerce transactions</li>
<li>Accessing personal information, including calendars, address
and telephone lists, to-do lists, shopping lists, and calorie
counters</li>
<li>Assisting the user to communicate with other people sending
and receiving voice-mail messages</li>
</ul>
<p>Our definition of a voice browser does not support a voice
interface to HTML pages. A voice browser processes scripts
written using voice markup languages. HTML is not among the
languages which can be interpreted by a voice browser. Some
venders are creating voice-enabled HTML browsers that produce
voice instead of displaying text on a screen display. A
voice-enabled HTML browser must determine the sequence of text to
present to the user as voice, and possibly how to verbally
present non-text data such as tables, illustrations, and
animations. A voice browser, on the other hand, interprets a
script which specifies exactly what to verbally present to the
user as well as when to present each piece of information</p>
<h2>3. <a id="benefits" name="benefits">Voice Browser
Benefits</a></h2>
<p>Voice is a <em>very natural</em> user interface because it
enables the user to speak and listen using skills learned during
childhood. Currently users speak and listen to telephones and
cell phones with no display to interact with voice browsers. Some
voice browsers may have small screens, such as those found on
cell phones and palm computers. In the future, voice browsers may
also support other modes and media such as pen, video, and sensor
input and graphics animation and actuator controls as output. For
example, voice and pen input would be appropriate for Asian users
whose spoken language does not lend itself to entry with
traditional QWERTY keyboards.</p>
<p>Some voice browsers are <em>portable</em>. They can be used
anywhere—at home, at work, and on the road. Information
will be <em>available</em> to a greater audience, especially to
people who have access to handsets, either telephones or cell
phones, but not to networked computers.</p>
<p>Voice browsers present a <em>pragmatic</em> interface for
functionally blind users or users needing Web access while
keeping their hands and eyes free for other things. Voice
browsers present an invisible user interface to the user, while
freeing workspace previously occupied by keyboards and mice.</p>
<h2>4. <a id="spif" name="spif">W3C Speech Interface
Framework</a></h2>
<p>The Voice Browser Working group has defined the <i>W3C Speech
Interface Framework</i>, shown in Figure 1. The white boxes
represent typical components of a speech-enabled web application.
The black arrows represent data flowing among these components.
The blue ovals indicate data specified using markup languages
used to guide components to accomplish their respective tasks. To
review the latest requirement and specification documents for
each of the markup languages, see the section entitled
Requirements and Language specification Documents on our <a
href="http://www.w3.org/Voice/">W3C Voice Browser home web
site</a>.</p>
<p align="center"><img src="voice-intro-fig1.gif" width="559"
height="392"
alt="block diagram for speech interface framework" /></p>
<p>Components of the W3C Speech Interface Framework include the
following:</p>
<p><i>Automatic Speech Recognizer (ASR)</i>—accepts speech
from the user and produces text. The ASR uses a grammar to
recognize words from the user's spoken speech. Some ASRs use
grammars specified by a developer using the <b>Speech Grammar
Markup Language</b>. Other ASRs use statistical grammars
generated from large corpra of speech data. These grammars are
represented using the <b>N-gram Stochastic Grammar Markup
Language.</b></p>
<p><i>DTMF Tone Recognizer</i>—accepts touch-tones produced
by a telephone when the user presses the keys on the telephone's
keypad. Telephone users may use touch-tones to enter digits or
make menu selections.</p>
<p><i>Language Understanding Component</i>—extracts
semantics from a text string by using a prespecified grammar. The
text string may by produced by an ASR or be entered directly by a
user via a keyboard. The Language Understanding Component may
also use grammars specified using the <b>Speech Grammar Markup
Language</b> or the <b>N-gram Stochastic Grammar Markup
Language.</b> The output of the Language Understanding Component
is expressed using the <b>Natural Language Semantics Markup
Language.</b></p>
<p><i>Context Interpreter</i>—enhances the semantics from
the Language Understanding Module by obtaining context
information from a dialog history (not shown in Figure 1). For
example, the Context Interpreter may replace a pronoun by a noun
to which the pronoun referred. The input and output from the
Context Interpreter is expressed using the <b>Natural Language
Semantics Markup Language.</b></p>
<p><i>Dialog Manager</i>—prompts the user for input, makes
sense of the input, and determines what to do next according to
instructions in a dialog script specified using VoiceXML 2.0
modeled after VoiceXML 1.0. Depending upon the input received,
the dialog manager may invoke application services, or download
another dialog script from the web, or cause information to be
presented to the user. The Dialog Manager accepts input specified
using the <b>Natural Language Semantics Markup Language.</b>
Dialog scripts may refer to <b>Reusable Dialog Components</b>,
portions of another dialog script which can be reused across
multiple applications.</p>
<p><i>Media Planner</i>—determines whether output from the
dialog manager should be presented to the user as synthetic
speech or prerecorded audio.</p>
<p><i>Recorded audio player</i>—replays prerecorded audio
files to the user, either in conjunction with, or in place of
synthesized voices.</p>
<p><i>Language Generator</i>—Accepts text from the media
planner and prepares it for presentation to the user as spoken
voice via a text-to-speech synthesizer (TTS). The text may
contain markup tags expressed using the <b>Speech Synthesis
Markup Language</b> which provides hints and suggestions for how
acoustic sounds should be produced. These tags may be produced
automatically by the Language Generator or manually inserted by a
developer.</p>
<p><i>Text-to-Speech Synthesizer (TTS)</i>—Accepts text
from the Language Generator and produces acoustic signals which
the user hears as a human-like voice according to hints specified
using the <b>Speech Synthesis Markup Language</b>.</p>
<p>The components of any specific voice browser may differ
significantly from the Components shown in Figure 1. For example,
the Context Interpretation, Language Generation and Media
Planning components may be incorporated into the Dialog Manager,
or the tone recognizer may be incorporated into the Context
Interpretation. However, most voice browser implementations will
still be able to use of the various markup languages defined in
the W3C Speech Interface Framework.</p>
<p>The Voice Browser Working Group is not defining the components
in the W3C Speech Interface Framework. It is defining markup
languages for representing data in each of the blue ovals in
Figure 1. Specifically, the Voice Browser Working Group is
defining the following markup languages:</p>
<ul>
<li>
<p>Speech Recognition Grammar Specification</p>
</li>
<li>
<p>N-gram Grammar Markup Language</p>
</li>
<li>
<p>Speech Synthesis Markup Language</p>
</li>
<li>
<p>Dialog Markup Language</p>
</li>
</ul>
<p>The Voice Browser Working Group is also defining packaged
dialogs which we call <b>Reusable Components</b>. As their name
suggests, reusable components can be reused in other dialog
scripts, decreasing the implementation effort and increasing user
interface consistency. The Working Group may also define a
collection of reusable components such as solicit the user's
credit card number and exploration date, solicit the user's
address, etc.</p>
<p>Just as HTML formats data for screen-based interactions over
the Internet, an XML-based language is needed to format data for
voice-based interactions over the Internet. All markup languages
recommended by the Working Group will be XML-based, so XML
language processors can process any of the W3C Speech Interface
Framework markup languages.</p>
<h2>5. <a id="other" name="other">Other Uses of the Markup
Languages</a></h2>
<p>Figure 2 illustrates the W3C Speech Interface Framework
extended to support multiple modes of input and output. It is
anticipated that another working group will be formed to specify
the <b>Multimodal Dialog Language</b>, an extension of the Dialog
Language. We anticipate that another Working Group will be
established to take over our current work in defining the
Multimodal Dialog Language.</p>
<p align="center"><img src="voice-intro-fig2.gif" width="556"
height="402"
alt="block diagram for multimodal interface framework" /></p>
<p>Markup languages also may be used in applications not usually
associated with voice browsers. The following applications also
may benefit from the use of voice browser markup languages:</p>
<ul>
<li><em>Text-based Information Storage and
Retrieval</em>—Acceptance of text from a keyboard and
presents the text on a display. It uses neither ASR nor TTS, but
makes heavy use of the language understanding module and the
natural language semantic markup language.</li>
<li><em>Robot Command and Control</em>—Users speak commands
that control a mechanical robot. This application may use both
Speech Recognition Grammar Specification and dialog markup
languages.</li>
<li><em>Medical Transcription</em>—A complex, specialized
speech recognition grammar is used to extract medical information
from text produced by the ASR. A human editor corrects the
resulting text before printing.</li>
<li><em>Newsreader</em>—A language generator produces
marked-up text for presenting voice to the user. This application
uses a special language generator to markup text from news wire
services for verbal presentation.</li>
</ul>
<h2>6. <a id="specs" name="specs">Individual Markup Language
Overviews</a></h2>
<p>To review the latest requirement and specification documents
for each of the following languages, see the section titled
Requirements and Language specification Documents on our <a
href="http://www.w3.org/Voice/">W3C Voice Browser home web
site</a></p>
<h3><a id="gram" name="gram">6.1. Speech Recognition Grammar
Specification</a></h3>
<p>The Speech Recognition Grammar Specification supports the
definition of Context-Free Grammars (CFG) and, by subsumption,
Finite-State Grammars (FSG). The specification defines an XML
Grammar Markup Language, and an optional Augmented Backus-Naur
Format (ABNF) Markup Language. Automatic transformations between
the two formats is possible, for example, by XSLT to convert the
XML format to ABNF. We anticipate that development tools will be
constructed that provide the familiar ABNF format to developers,
and enable XML software to manipulate the XML grammar format. The
ABNF and XML languages are modeled after Sun's <a
href="http://www.w3.org/Submission/2000/06/">JSpeech Grammar
Format</a>. Some of the interesting features of the draft
specification:</p>
<ul>
<li>
<p>Ability to cross-reference grammars by URI and to use this
ability to define libraries of useful grammars.</p>
</li>
<li>
<p>Internationalized.</p>
</li>
<li>
<p>Semantic tagging mechanism for interpretation of spoken input
(under development).</p>
</li>
<li>
<p>Applicable to non-speech input modalities, e.g. DTMF input or
parsing and interpretation of typed input.</p>
</li>
</ul>
<p>A complementary speech recognition grammar language
specification is defined for N-Gram language models.</p>
<p>Terms used in the Speech Grammar Markup Language requirements
and specification documents include:</p>
<table border="1" cellpadding="6" cellspacing="1" width="85%"
summary="term in first column, explanation in second">
<tbody>
<tr>
<th width="24%">CFG</th>
<td width="76%">Context-Free Grammar. A formal computer science
term for a language that permits embedded recursion.</td>
</tr>
<tr>
<th width="24%">BNF</th>
<td width="76%">Backus-Naur Format. A language used widely in
computer science for textural representations of CFGs.</td>
</tr>
<tr>
<th width="24%">ABNF</th>
<td width="76%">Augmented Backus-Naur Format. The language
defined in the grammar specification that extends a conventional
BNF representation with regular grammar capabilities, syntax for
cross-referencing between grammars and other useful syntactic
features</td>
</tr>
<tr>
<th width="24%">Grammar</th>
<td width="76%">The representation of constraints defining the
set of allowable sentences in a language. E.g. a grammar for
describing a set of sentences for ordering a pizza.</td>
</tr>
<tr>
<th width="24%">Language</th>
<td width="76%">A formal computer science term for the collection
of set of sentences associated with a particular domain. Language
may refer to natural or program language.</td>
</tr>
</tbody>
</table>
<h3><a id="synth" name="synth">6.2. Speech Synthesis</a></h3>
<p>A text document may be produced automatically, authored by
people, or a combination of both. The Speech Synthesis Markup
Language supports high-level specifications, including the
selection of voice characteristics (name, gender, and age) and
the speed, volume, and emphasis of individual words. The language
also may describe how to pronounce acronyms, such as "Nasa" for
NASA, or spelled, such as "N, double A, C, P," for NAACP. At a
lower level, designers may specify prosodic control, which
includes pitch, timing, pausing, and speaking rate. The Speech
Synthesis Markup Language is modeled on Sun's <a
href="http://java.sun.com/products/java-media/speech/forDevelopers/JSML/index.html">
<b>Java Speech Markup Language</b></a>.</p>
<p>There is some variance in the use of terminology in the speech
synthesis community. The following definitions establish a common
understanding</p>
<table border="1" cellpadding="6" cellspacing="1" width="85%"
summary="term in first column, explanation in second">
<tbody>
<tr>
<th>Prosody</th>
<td width="76%">Features of speech such as pitch, pitch range,
speaking rate and volume.</td>
</tr>
<tr>
<th width="24%">Speech Synthesis</th>
<td width="76%">The process of automatic generation of speech
output from data input which may include plain text, <span
class="diff">formatted text or binary objects</span>.</td>
</tr>
<tr>
<th width="24%">Text-To-Speech</th>
<td width="76%">The process of automatic generation of speech
output from text or annotated text input.</td>
</tr>
</tbody>
</table>
<h3><a id="dialog" name="dialog">6.3. VoiceXML 2.0</a></h3>
<p>VoiceXML 2.0 Markup supports four I/O modes: speech
recognition and DTMF as input with synthesized speech and
prerecorded speech as output. VoiceXML 2.0 supports
system-directed speech dialogs where the system prompts the user
for responses, makes sense of the input, and determines what to
do next. VoiceXML 2.0 also supports mixed initiative speech
dialogs. In addition, VoiceXML also supports task switching and
the handling of events, such as recognition errors, incomplete
information entered by the user, timeouts, barge-in, and
developer-defined events. Barge-in allows users to speak while
the browser is speaking. VoiceXML 2.0 is modeled after <a
href="http://www.w3.org/Submission/2000/04/">VoiceXML 1.0</a>
designed by the <a href="http://www.voicexml.org/">VoiceXML
Forum</a>, whose founding members are AT&T, IBM, Lucent, and
Motorola.</p>
<p>Terms used in the Dialog Markup Language requirements and
specification documents include:</p>
<table border="1" cellpadding="6" cellspacing="1" width="85%"
summary="term in first column, explanation in second">
<tbody>
<tr>
<th>Dialog Markup Language</th>
<td>a language in which voice dialog behavior is specified. The
language may include reference to scripting elements which can
also determine dialog behavior.</td>
</tr>
<tr>
<th>Voice Browser</th>
<td>a software device which interprets a voice markup language
and generates a dialog with voice output and possibly other
output modalities and/or voice input and possibly other
modalities.</td>
</tr>
<tr>
<th>Dialog</th>
<td>a model of interactive behavior underlying the interpretation
of the markup language. The model consists of states, variables,
events, event handlers, inputs and outputs.</td>
</tr>
<tr>
<th>Utterance</th>
<td>Used in this document generally to refer to a meaningful user
input in any modality supported by the platform, not limited to
spoken inputs. For example, speech, DTMF, pointing, handwriting,
text and OCR.</td>
</tr>
<tr>
<th>Mixed initiative dialog</th>
<td>A type of dialog in which either they system or the user can
take the initiative at any point in the dialog by failing to
respond directly to the previous utterance. For example, the user
can make corrections, volunteer additional information, etc.
Systems support mixed initiative dialog to various degrees.
Compare to "directed dialog."</td>
</tr>
<tr>
<th>Directed dialog</th>
<td>Also referred to as "system initiative" or "system led." A
type of dialog in which the user is permitted only direct literal
responses to the system's prompts.</td>
</tr>
<tr>
<th>State</th>
<td>the basic interact ional unit defined in the markup language.
A state can specify variables, event handlers, outputs and
inputs. A state may describe output content to be presented to
the user, input which the user can enter, event handlers
describing, for example, which variables to bind and which state
to transition to when an event occurs.</td>
</tr>
<tr>
<th>Events</th>
<td>generated when a state is executed by the voice browser; for
example, when outputs or inputs in a state are rendered or
interpreted. Events are typed and may include information; for
example, an input event generated when an utterance is recognized
may include the string recognized, an interpretation, confidence
score, and so on.</td>
</tr>
<tr>
<th>Event Handlers</th>
<td>are specified in the voice markup language and describe how
events generated by the voice browser are to be handled.
Interpretation of events may bind variables, or map the current
state into another state (possibly itself).</td>
</tr>
<tr>
<th>Output</th>
<td>content specified in an element of the markup language for
presentation to the user. The content is rendered by the voice
browser; for example, audio files or text rendered by a TTS.
Output can also contain parameters for the output device; for
example, volume of audio file playback, language for TTS, etc.
Events are generated when, for example, the audio file has been
played.</td>
</tr>
<tr>
<th>Input</th>
<td>content (and its interpretation) specified in an element of
the markup language which can be given as input by a user; for
example, a grammar for DTMF and speech input. Events are
generated by the voice browser when, for example, the user has
spoken an utterance and variables may be bound to information
contained in the event. Input can also specify parameters for the
input device; for example, timeout parameters, etc.</td>
</tr>
</tbody>
</table>
<h3><a id="nl" name="nl">6.4. Natural Language Semantics</a></h3>
<p>The Natural Language Semantics Markup Language supports XML
semantic representations. For application-specific information,
it is based on the W3C <a
href="http://www.w3.org/TR/2000/WD-xforms-datamodel-20000406/">XForms.</a>
The Natural Language Semantics Markup Language also includes
application-independent elements defined by the W3C Voice Browser
group. This application-independent information includes
confidences, the grammar matched by the interpretation, speech
recognizer input, and timestamps. The Natural Language Semantics
Markup Language combines elements from the XForms, natural
language semantics, and application-specific namespaces. For
example, the text, "I want to fly from New York to Boston, and,
then, to Washington, DC", could be represented as:</p>
<pre>
<result xmlns:xf="http://www.w3.org/2000/xforms"
x-model="http://flight-model"
grammar="http://flight-grammar">
<interpretation confidence=100>
<xf:instance>
<flight:trip>
<leg1>
<from>New York</from>
<to>Boston</to>
</leg1>
<leg2>
<from>Boston</from>
<to>DC</to>
</leg2>
</flight:trip>
</xf:instance>
<input mode="speech">
I want to fly from New York to Boston, and,
then, to Washington, DC
</input>
</interpretation>
</result>
</pre>
<p>Terms used in the Natural Language Semantics Markup Language
requirements and specification documents include:</p>
<table border="1" cellpadding="6" cellspacing="1" width="85%"
summary="term in first column, explanation in second">
<tbody>
<tr>
<th width="23%">Natural language interpreter</th>
<td width="77%">A device which produces a representation of the
meaning of a natural language expression.</td>
</tr>
<tr>
<th width="23%">Natural language expression</th>
<td width="77%">An unformatted spoken or written utterance in a
human language such as English, French, Japanese, etc.</td>
</tr>
</tbody>
</table>
<h3><a id="reuse" name="reuse">6.5 Reusable Dialog
Components</a></h3>
<p>Reusable Dialog Components are dialog components (chunks of
dialog script or platform-specific objects that pose frequently
asked questions in dialog scripts, and can be invoked from any
dialog script) that are reusable (can be used multiple times
within an application or used by multiple applications) and that
meet specific interface (configuration parameter and return value
format) requirements. The purpose of reusable components is to
reduce the effort to implement a dialog by reusing encapsulations
of common dialog tasks, and to promote consistency across
applications. The W3C Voice Browser Working Group is defining the
interface for Reusable Dialog Components. Future specifications
will define standard reusable dialog components for designated
tasks that are portable across platforms.</p>
<h2>7. <a id="examples" name="examples">Example of Markup
Language Use</a></h2>
<p>The following speech dialog fragment illustrates the use of
the speech synthesis, Speech Recognition Grammar Specification,
and speech dialog markup languages:</p>
<pre>
<menu>
<!-- This is an example of a menu which present the user -->
<!-- with a prompt and listens for the user to utter a choice -->
<prompt>
<!-- This text is presented to the user as synthetic speech -->
<!-- The emphasisis element adds emphasis to its content -->
Welcome to Ajax Travel Do you want to fly to
<emphasis>New York, Boston</emphasis> or
<emphasis>Washington DC</emphasis>
</prompt>
<!-- When the user speaks an utterance that matches the grammar -->
<!-- control is transferred to the "next" VoiceXML document -->
<choice next="http://www.NY...">
<!-- The Grammar element indicates the words which -->
<!-- the user may utter to select this choice -->
<grammar>
<choice>
<item> New York </item>
<item> The Big Apple </item>
</choice>
</grammar>
</choice>
<choice next="http://www.Boston...">
<grammar>
<choice>
<item> Boston </item>
<item> Beantown </item>
</choice>
</grammar>
</choice>
<choice next="http://www.Wash....">
<grammar>
<choice>
<item> Washington D.C. </item>
<item> Washington </item>
<item> The U.S. Capital </item>
</choice>
</grammar>
</choice>
</menu>
</pre>
<p>In the example above, the Dialog Markup Language describes
when a voice menu which contains a prompt to be presented to the
user. The user may respond by saying and of several choices. When
the user speech matches a particular grammar, control is
transferred to the dialog fragment at the "next" location.</p>
<p>The Speech Synthesis Markup Language describes how text is
rendered to the user. The Speech Synthesis Markup Language
includes <emphasis> element. When rendered to the user, the
word "you" will be emphasized, and the end of the sentence will
raise in pitch to indicate a question.</p>
<p>The Speech Recognition Grammar Specification describes the
words that the user must say when making a choice. The
<grammar> element is shown within the <choice>
element. The language understanding module will recognize "New
York" or "The Big Apple" to mean New York, "Boston" or "Beantown"
to mean Boston, and "Washington, D.C.," "Washington," or "The
U.S. Capital" to mean Washington.</p>
<p>An example user-computer dialog resulting from interpreting
the above dialog script is</p>
<pre>
Computer: <i>Welcome to Ajax Travel Do you want to fly
to New York, Boston, or Washington DC?</i>
User: Beantown
Computer: <i>(transfers to dialog script associated with Boston)</i>
</pre>
<h2>8. <a id="submissions"
name="submissions">Submissions</a></h2>
<p>W3C has acknowledged the <a
href="http://www.w3.org/Submission/2000/06/">JSGF and JSML
submission</a> from the <a href="http://www.sun.com/">Sun
Microsystems</a>. The W3C Voice Browser Working Group plans to
develop specifications for its Speech Synthesis Markup Language
and Speech Grammar Specification using JSGF and JSML as a
model.</p>
<p>W3C has acknowledged the <a
href="http://www.w3.org/Submission/2000/04/">VoiceXML 1.0
submission</a> from the <a
href="http://www.voicexml.org/">VoiceXML Forum</a>. The W3C <a
href="http://www.w3.org/Voice/Group/">Voice Browser Working
Group</a> plans to adopt VoiceXML 1.0 as the basis for developing
a Dialog Markup Language for interactive voice response
applications. See <a
href="http://www.zdnet.com/eweek/stories/general/0,11011,2574350,00.html">
ZDNet's article</a> covering the announcement</p>
<h2>9. <a id="reading" name="reading">Further Reading
Material</a></h2>
<p>The following resources are related to the efforts of the
Voice Browser working group.</p>
<dl>
<dt><a href="http://www.w3.org/TR/REC-CSS2/aural.html">Aural
CSS</a></dt>
<dd>The aural rendering of a document, already commonly used by
the blind and print-impaired communities, combines speech
synthesis and "auditory icons." Often such aural presentation
occurs by converting the document to plain text and feeding this
to a screen reader -- software or hardware that simply reads all
the characters on the screen. This results in less effective
presentation than would be the case if the document structure
were retained. Style sheet properties for aural presentation may
be used together with visual properties (mixed media) or as an
aural alternative to visual presentation.</dd>
<dt><br />
<a href="http://www.etsi.org/">The European Telecommunications
Standards Institute (ETSI)</a></dt>
<dd>The European Telecommunications Standards Institute (ETSI)
ETSI is a non-profit organization whose mission is "to determine
and produce the telecommunications standards that will be used
for decades to come". ETSI's work is complementary to W3C's. The
ETSI STQ Aurora DSR Working Group standardizes algorithms for
Distributed Speech Recognition (DSR). The idea is to preprocess
speech signals before transmission to a server connected to a
speech recognition engine. Navigate to http://www.etsi.org/stq/
for more details.</dd>
<dt><br />
<a
href="http://www.java.sun.com/products/java-media/speech/forDevelopers/JSGF/index.html">
Java Speech Grammar Format</a></dt>
<dd>The Java™ Speech Grammar Format is used for defining
context free grammars for speech recognition. JSGF adopts the
style and conventions of the Java programming language in
addition to use of traditional grammar notations.<br />
</dd>
<dt><a href="http://www.microsoft.com/IIT/">Microsoft Speech
Site</a></dt>
<dd class="c5">This site describes the Microsoft speech API, and
contains a recognizer and synthesizer that can be
downloaded.</dd>
<dt><br />
<a href="http://www.w3.org/TR/NOTE-voice">NOTE-voice</a></dt>
<dd>This note describes features needed for effective interaction
with Web browsers that are based upon voice input and output.
Some extensions are proposed to HTML 4.0 and CSS2 to support
voice browsing, and some work is proposed in the area of speech
recognition and synthesis to make voice browsers more
effective.</dd>
<dt><br />
<a
href="http://www.bell-labs.com/project/tts/sable.html">SABLE</a></dt>
<dd>SABLE is a markup language for controlling text to speech
engines. It has evolved out of work on combining three existing
text to speech languages: SSML, STML and JSML.</dd>
<dt><br />
<a href="http://www.alphaworks.ibm.com/tech">SpeechML</a></dt>
<dd><i>(IBM's server precludes a simple URL for this, but you can
reach the SpeechML site by following the link for Speech
Recognition in the left frame)</i> SpeechML plays a similar role
to VoxML, defining a markup language written in XML for IVR
systems. SpeechML features close integration with Java.</dd>
<dt><br />
<a href="http://www.w3.org/Voice/TalkML">TalkML</a></dt>
<dd>This is an experimental markup language from HP Labs, written
in XML, and aimed at describing spoken dialogs in terms of
prompts, speech grammars and production rules for acting on
responses. It is being used to explore ideas for object-oriented
dialog structures, and for next generation aural style
sheets.</dd>
<dt><br />
<a href="http://www.w3.org/Voice/WWW8/slide1.html">Voice Browsers
and Style Sheets</a></dt>
<dd>Presentation by Dave Raggett on May 13th 1999 as part of the
Style stack of Developer's Day in <a
href="http://www8.org/">WWW8</a>. The presentation makes
suggestions for extensions to <a
href="http://www.w3.org/TR/REC-CSS2/aural.html">ACSS</a>.</dd>
<dt><br />
<a href="http://www.vxml.org/">VoiceXML site</a></dt>
<dd>The VoiceXML Forum formed by AT&, IBM, Lucent and
Motorola to pool their experience. The Forum has published an
early version of the VoiceXML specification. This builds on
earlier work on PML, VoxML and SpeechML.</dd>
</dl>
<h2>10. <a id="summary" name="summary">Summary</a></h2>
<p>The W3C Voice Browser Working Group is defining markup
languages for speech recognition grammars, speech dialog, natural
language semantics, multimodal dialogs, and speech synthesis, as
well as a collection of reusable dialog components. In addition
to voice browsers, these languages can also support a wide range
of applications including information storage and retrieval,
robot command and control, medical transcription, and newsreader
applications. The speech community is invited to review and
comment on working draft requirement and specification
documents.</p>
</body>
</html>