text 37.4 KB
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <title>Speech and the Future</title>
  <style type="text/css">

.soundbyte {text-align: center}
.new {color: #FF0000; background-color: #FFFF00}</style>
  <link rel="stylesheet" type="text/css" title="W3C Talk"
  href="../../../Tools/w3ctalk-summary.css" />
  <link href="em.css" rel="stylesheet" type="text/css" />
  <link xmlns:xlink="http://www.w3.org/1999/xlink"
  href="../../../People/Berners-Lee/general.css" rel="stylesheet"
  type="text/css" />
</head>

<body xml:lang="en" lang="en">
<h1>Speech and the Future</h1>

<p><code></code><code>http://www.w3.org/2004/Talks/0914-tbl-speech/text</code></p>

<p align="center" style="text-align: left"><a
href="http://http://www.w3.org/People/Berners-Lee/">Tim Berners-Lee</a></p>

<p>Director, World Wide Web Consortium</p>

<p>SpeechTek New York</p>

<p>2004-09-14</p>

<p></p>

<h3 id="Introducti">Introduction</h3>

<p>Good morning, welcome, and thank you for inviting me to speak today. I'm
going to use speech today but without much technology. I won't be using
slides, you'll just have an audio channel. So even though I'm not an expert
on speech technology -- you all probably know more about it than I do -- I am
putting my faith in speech itself as a medium for the next few minutes.</p>

<p>So, as I'm not a researcher at the forefront on speech technology, I'm not
going to be telling you about the latest and greatest advances. Instead I
come to you, I suppose, with four different roles. One, as someone who spent
a a lot of effort getting one new technology, the Web, from idea into general
deployment, I'm interested in how we as a technical community get from where
we are now to where we'd like to be. Two, as director of the World Wide Web
Consortium, I try to get an overall view of where the new waves of Web
technology are heading, and hopefully how they will fit together.</p>

<p>With my third hat on I'm a researcher at MIT's Computer Science and
Artificial Intelligence Laboratory (CSAIL). MIT, along with ERCIM
organization in Europe, and Keio University in Japan, plays host to the
Consortium, and I get an office in the really nifty new CSAIL building, the
Stata Center. That I like for lots of reasons, one of which is the people you
get to talk to. I have chatted to some of my colleagues who actually are
engaged in leading edge research about the future.</p>

<p>And fourth, I come as a random user who is going to be affected by this
technology and who wants it to work well. It is perhaps the role I'm most
comfortable in, because I can talk about what I would like. I don't normally
try to predict the future - that's too hard: but talking about what we would
like to see is the first step to getting it, so I do that a lot.</p>

<p>When you step back and look at what's happening, then one thing becomes
clearer and clearer -- that things are very interconnected. If you are a fan
of Douglas Adams and/or Ted Nelson, you'll know that all things are
hopelessly intertwingled, and in the new technologies that in certainly the
case. So I'm going to discuss speech first and then some of the things it
connects with.</p>

<h3><a name="Language" id="Language">Language</a></h3>

<p>Speech is a form of language. Language is what its all about, in fact.
Languages of different sorts. Human languages and computer languages. This
conference is, in a way, about the difference between them.</p>

<p>Let's think about natural language first. Human language is an amazing
thing. Anyone who is a technologist has to be constantly in awe of the human
being. When you look at the brain and what it is capable of, and you look at
what people are capable of (especially when they actually put their brains
into use), it is pretty impressive. And in fact I'm most impressed by what
people can do when they get together. And when you think about that, when you
look at how people communicate, and you find this phenomenon of Natural
Language -- this crazy evolving way words and symbols splash between
different people, and while no one can really pin down what any word means,
and while so many of the utterances don't even parse grammatically, still the
end effect is a medium of great power. And of course among the challenges for
speech technology, is that Natural Language varies from place to place and
person to person, and, particularly, evolves all the time. That is speech.</p>

<h3>..Tek</h3>

<p>Now what is technology? Computer Technology is mostly made up of
languages, different sorts of language. The HTML, URIs, HTTP make the web
work, all the technology which we develop at the World Wide Web Consortium,
not to mention Speech technology, involves sets of languages of a different
kind. Computer languages.</p>

<p>The original Web code I wrote in 1990, and the first simple specs of URLs
(then UDIs), HTML and HTTP. By 1993 the Web was exploding very rapidly, and
the Information Technology sector had got wind of it and was planning how to
best use this huge new opportunity. Now, people realized that the reason the
Web was spreading so fast was that there was no central control and no
royalty fee. Anyone could start playing with it -- browsing, running a
server, writing software, without commitment, without ending up in the
control of or owing money to any central company. And they knew that it all
worked because the standards HTML, URIs and HTTP were common standards. Now
I'd written those specs originally and they worked OK, but there was a huge
number of things which we all wanted to to do which were even more exciting.
So there was a need for a place for people, companies, organizations to come
together and build a new evolving set of standards. And still it was
important to keep that openness.</p>

<h3>W3C</h3>

<p>The answer was the World Wide Web Consortium, W3C, and all you have to do
to join is go to the web site and fill in some forms, pay some money to keep
it going, and find some people who can be involved in developing or steering
new technology. You'll need engineers, because we build things here, and
you'll need communicators because you need to let the community know what
your needs are, and you need to make sure your company understands whats
happening in W3C, and how it will affect them at every level. The Consortium
was around 350 members, and we work in a lot of interconnected areas, from
things like HTML and graphics, mobile systems, privacy, program integration
which we call Web Services and data integration which we call Semantic Web,
...too many things to name -- go to the web site w3.org for details -- just
look at the list of areas in which Web technology is evolving. Speech
technology -- recognition and synthesis -- is one of these areas.</p>

<p>So the business we're in is making open common infrastructure which will
make the base of a new wave of technology, new markets, and whole new types
of business in the future. We all are or should be in that business, and
whether we do it well will determine how big a pie the companies here will be
sharing in the future.</p>

<p></p>

<p>Hard unbending languages with well defined grammars. Yes, the technical
terms in something like VoiceXML are defined in English, typically, which is
a natural language -- but English which has been iterated over so much that
effectively, for practical purposes, the technical term -- each TAG in
VoiceXML, say -- becomes different from a word. While the meaning of english
words flows with time, the technical term is an anchor point. The meanings of
the terms have been defined by working groups, labored over, established as a
consensus and described beyond all reasonable possibility of practical
ambiguity in documents we call standards -- or at W3C, Recommendations.</p>

<p>Last Tuesday, we added a new one to that set. After many months of hard
work by the <a href="http://www.w3.org/Voice/">Voice Browser Working
Group</a>, the <a
href="http://www.w3.org/TR/2004/REC-speech-synthesis-20040907/">Speech
Synthesis Markup Language</a>, SSML, became a W3C Recommendation. So now two
machines can exchange bits in SSML, and by that can communicate how to
synthesize speech. Now speech synthesis systems can be built out of
components from different manufacturers because there is a standard bus by
which they can be connected. Now you can invest in speech synthesis, in SSML
data, and your own in-house applications which produce SSML knowing that the
data will retain its value; that it won't commit you to a single technology
supplier. This is the sort of thing which builds a market. It is added to the
<a href="http://www.w3.org/TR/2004/REC-voicexml20-20040316/">VoiceXML 2.0</a>
spec, and the <a
href="http://www.w3.org/TR/2004/REC-speech-grammar-20040316/">Speech
Recognition Grammar Specification</a> which became Recommendations in March. 
Coming up, we have Semantic Interpretation ML, and Call Control ML from the
Voice Browser working group, and from the MultiModal Working Group, InkML for
pen-written information, and the Extended MultiModal Annotation language. So
a lot is happening, and it is an exciting time. </p>

<p>I know and you know that the standards picture in this area isn't all that
rosy. In the area of integration with HTML, he fact that SALT and HTML+Voice
are competing and are not being developed openly in common is one of the
major concerns which I hear from all sides -- (except perhaps from those who
are betting on taking control of part of the space by controlling a
proprietary specification.!)</p>

<p>This sort of tension is the rule for standards. There is always much to be
gained by a company that can take control of a space using proprietary
languages, and then changing them slightly every year.  There is always a lot
to be gained by all in terms of a larger market by having open standards. I
note that in yesterday's announcement by IBM that some of its speech software
will be going open source, Steven Mills says he wants to "spur the industry
around open standards". He talks about the need to "get the ecosystem going.
If that happens, it will bring more business to IBM".  In fact, of many of
the areas of W3C work, speech used to have standards but little open source
support.  It will be interesting to see hoe the IBM contribution affects the
take-off of the whole area.</p>

<p>All I'll say about the SALT/HTML+Voice situation now is that a conference
like this is a good time to think strategically, to weigh the importance of a
solid common foundation for a potentially huge new market area, against short
term benefits there might be from developing your own standards, if you are a
supplier, or of purchasing non-standard technology, if you are a user.</p>

<p>The infrastructure for the connected technology is made up from such
standards, and these standards are written in computer languages, and those
are very different from natural language. The difference between natural
language and computer languages are the chasm which speech technology is
starting to bridge. Speech technology takes on that really difficult task of
making computers communicate with people using human speech, trying to enter
the world of fuzziness and ambiguity. It is really difficult, because
understanding speech is something which human brains can only just do -- in
fact you and I learn to talk just slow enough and just plain enough to be
just understood well enough by a person. When we are understood very
reliably, we tend to speed up or make new shortcuts. So the computer is
chasing the human brain, and that is a challenge at the moment.</p>

<p>I'd end this comparison of the two types of language by noting that
computer languages do also evolve, though in a different way from natural
languages. One of the design goals for the semantic web for data integration
is to allow evolution of data systems, to that the new terms can be
introduced which are related to but different to the old terms, and to get
the maximum interoperability between old and new data and old and new
systems.  This is one of the uses of the web ontology language, OWL.</p>

<h3>Speech dialog</h3>

<p>So you know where I am as a user, my last conversation with a machine was
with a home appliance repair center and was something like the following:</p>

<blockquote>
  <dl>
    <dt></dt>
    <dt>It</dt>
      <dd>What would you like to do? You can make, change or cancel an
        appointment, order a part ...</dd>
    <dt>me</dt>
      <dd>[interrupting] make an appointment</dd>
    <dt>it</dt>
      <dd>You want to make an appointment, right?</dd>
    <dt>me</dt>
      <dd>Right.</dd>
    <dt>It</dt>
      <dd>(pause) I'm sorry. Please say "yes" or "no"</dd>
    <dt>me</dt>
      <dd>Yes.</dd>
    <dt>It</dt>
      <dd>Ok, what sort of a product needs the service? For example, say
        "refrigerator", or "furnace"</dd>
    <dt>me</dt>
      <dd>Washer</dd>
    <dt>It</dt>
      <dd>Ok, so you want to make an appointment to service a washer,
      right?</dd>
    <dt>me</dt>
      <dd>Yes</dd>
    <dt>It</dt>
      <dd>I'm sorry, I didn't get that.</dd>
    <dt>me</dt>
      <dd>Yes!</dd>
    <dt>It</dt>
      <dd>Please say yes or no. You want to make an appointment to service a
        washer, right?</dd>
    <dt>me</dt>
      <dd>Yes!!</dd>
    <dt>It</dt>
      <dd>I'm sorry. Thank you for calling ____ Customer Service. Have a nice
        day.</dd>
      <dd></dd>
    <dt></dt>
    <dt>The good news is, I called back and learned to say <em>yeup</em>, and
    got through. (The bad news is my washer still isn't working!)</dt>
    <dt></dt>
  </dl>
</blockquote>

<p>In fact I found the system worked after I'd learned to say
<em>Yeup</em>.</p>

<p>(It beat a comparable experience I had with DTMF tones trying to trace an
order for a some computer equipment. I called the 1-800 number, went through
a DTML tree -- if you want to do this press 1, ... and so on ..if you want to
track an order press 9, (9) if it was for a computer press 1 (1), if you want
to talk to somebody about it press one (1), and talked to somebody about the
problem for 25 minutes, after which she decided to transfer me to someone
else. Thoughtfully, she gave me a number to call if I was disconnected.
Inevitably, I got disconnected almost immediately. I realized the number she
had given me was just the same 1-800 number, so I hit redial. The redial
didn't seem to send enough digits, so I had to hang up and dial again. I
found my way painfully though the tree to the place I should have been, and
talked for another 40 minutes about how to convert my order from something
they could not deliver to something that they could deliver. And by the end
of the process when I was almost exhausted, and just giving the last element
of personal information so they could credit check the new order, my wife
came in, "Tim, the police are here", and sure enough in come the local
police. They'd had a 911 call, and hadn't been able to call back the line,
and so presumed it must be an emergency. Yes, when I had hit <em>redial</em>,
my phone had forgotten the 1-800 number, but remembered the DTMF tones from
the phone tree; 9-1-1. An interesting system design flaw.)</p>

<h3>Speech: long way to go</h3>

<p>Now I've talked to a few people before coming here to give this talk. I've
chatted with people like Hewlett-Packard's Scott McGlashan, very involved in
speech at W3C, and I've also talked to researchers like Stephanie Seneff and
Victor Zue at the Spoken Language Systems (SLS) research group at MIT's
Computer Science and Artificial Intelligence Laboratory, CSAIL, just along
the corridor from my office.</p>

<p>And when I just talked to these people a few things emerge clearly. One is
that speech technology itself has a very long way to go. Another thing is
that it the most important thing may turn out to be be not the speech
technology itself, but the way in which speech technology connects to all the
other technologies. I'll go into both those points.</p>

<p>Yes, what we have today is exciting, but it is also very much simpler than
the sorts of things we would really like to be able to do.</p>

<p>Don't get me wrong. VXML and SSML and company are great, and you should be
using them. I much prefer to be able to use english on the phone to a call
center than to have to type in touch-tones. However, I notice that the form
of communication I'm involved in cannot be called a conversation. It is more
of an interrogation. The data I am giving has the rigidity of a form to be
filled in, with the extra constraint that I have to go through it in the
order defined by the speech dialog. Now, I know that VoiceXML has facilities
for me to interrupt, and to jump out from one dialog into another, but the
mode in general still tends to be one of a set of scripts to which I must
conform. This is no wonder. The job is to get data from a person. Data is
computer-language stuff, not natural language stuff. The way we make machine
data from human thoughts has been for years been to make the person talk in a
structured, computer-like way. Its not just speech: "wizards" which help you
install things on your computer are similar: straightjackets which make you
think in the computer's way, with the computer's terms, in the computers'
order.</p>

<h3>Context feedback</h3>

<p>The systems in research right now, like the SLS group's Jupiter system
which you can ask about the weather, and its Mercury system which can arrange
a trip for you, are much more sophisticated. They allow keep track of the
context, of which times and places a user is thinking of. They seem to be
happy both gently leading the caller with questions, or being interrogated
themselves..</p>

<p>Here is one example recorded with a random untrained caller who had been
given an access code. The things to watch for include the machine keeping
track context, and of falling back when one tack fails.</p>

<p>[<a href="mercury.wav">speech audio example of Mercury</a>]</p>

<p>Now In understand that when a machine tries to understand a fragment of
speech, or a partly formed sentence, or mumbled words, that the actual
decision it makes about which word must have been said is affected by the
context of the conversation. This is normal also for people: it is just
impossible to extract the information from the noise without some clues. A
person sometimes misunderstands a word if he or she is thinking about the
wrong subject -- sometimes with amusing consequences, like a Freudian slip in
reverse. So this means that speech systems become complex many-layered things
in which the higher layers of abstraction feed down information about what
the person is likely to be saying. (Or maybe what we would like them to be
saying?). So I understand that this is the way speech recognition has to
work. But this architecture prevents the speech system from being separated
into two layers, a layer of speech to text and a layer of natural language
processing.  It means that the simple speech architecture, in which
understanding is a one-way street from audio to syllables to words to
sentences to semantics to actions breaks down.</p>

<p><img src="http://www.w3.org/TR/voice-intro/voice-intro-fig1.gif"
width="559" height="392" alt="block diagram for speech interface framework"
/></p>

<p><em>Figure 1. Speech Interface Framework, from browsing,<a
href="http://www.w3.org/TR/voice-intro/"> Introduction to and Overview of W3C
Speech Interface Framework. </a> The one-way flow in the top half ignores
context information sent back to ASR, which complicates the
architecture.</em></p>

<p></p>

<p>One of the interesting parts of feedback context is  when it is taken all
the way back to the user by an avatar.  Human understanding of speech is very
much a two-way street: not only does a person ask questions of clarification,
as good speech dialog systems do today.  A human also gives low-level
feedback with the wrinkling of the forehead, inclining or nodding of the
head, to indicate how well the understanding process is going. </p>

<p>What are the effects of having this context feedback in the architecture? 
One effect is that when a call is passed to a subsystem which deals with a
particular aspect of the transaction, or for that matter to a human being, it
is useful to pass the whole context.  Instead of, "Please get this person's
car plate", it is more like "Please take the car plate of a southern male,
who likes to spell out letters in international radio alphabet, and is
involved in trying to pay his car tax on this vehicle for 2005, and is still
fairly patient.  The plate numbers of two cars he has registered before are
on in this window, and its probably the top one."</p>

<p>Well, because a speech system is not an island.</p>

<p>In fact, these systems also have keyboards. They also have pens.</p>

<h2>Multimodal</h2>

<p>The big drive, it seems, at the moment, toward speech is the cellphone
market. Mobile phones speech is the dominant mode of communication. While
they have buttons and screens, they are rather small, and also people tend to
use phones when it would be even more dangerous to be looking at the the
screen and using the buttons. However, a phone is in fact a device which
supports a lot more than voice: you can type, it has a camera, and it has
screen. Meanwhile, the boundary of the concepts of "phone" and "computer" are
being pushed and challenged all the time by new forms of PDA. The blackberry
and the sidekick are between computer phone. The PDA market is playing with
all kinds of shapes. Computer LCDs are getting large enough to make a
separate TV screen redundant -- and they can be easier to use and program,
and accept many more formats, than typical DVD players. PCs are coming out
which look more like TVs. France telecom now <a
href="http://www.rd.francetelecom.com/en/technologies/ddm200311/techfiche4.php">proposes</a>
TV over an ADSL (originally, phone, now internet) line. The television would
be delivered by IP. The Internet model is indeed that everything runs over
IP, and IP runs over everything. The result is a platform which embraces IP
becomes open to very rapid spread of new technologies. This is very powerful.
On my phone, for example, I didn't have an MP3 player -- so I downloaded a
shareware one written by someone in Romania.</p>

<p>So in the future, we can expect phones, like TVs, to become
indistinguishable from small personal computers, and for there to be a very
wide range of different combinations of device to suit all tastes and
situations.</p>

<h3>Device Independence</h3>

<p>In fact the ability to view the same information on different devices was
one of the earliest design principles of the web: Device Independence.
Whereas the initial cellphone architectures such as the first WAP tended to
be vertical stacks, and tended to give the phone carrier and the phone
supplier a monopoly channel of communication, the web architecture is that
any device should be able to access any resource. The first gives great
control and short-term profits to a single company; the second creates a
whole new world. This layering is essential to the independent strong markets
for devices, for communication and for content.</p>

<p>From the beginning, this device independence was a high priority -- you
may remember early web sites would foolishly announce that they were only
viewable by those with 800x600 pixel screens. The right thing to do was to
achieve device independence by separating the actual content of the data from
the form in which it happened to be presented. On screens this is done with
style sheets. Style sheets allow information to be authored once and
presented appropriately whatever size screen you have. Web sites which use
style sheets in this way would find that they were more accessible to people
using the new devices. Also, they would find that they were more accessible
to people with disabilities. W3C has a series of guidelines on how to make
your web site accessible as possible to people who for one reason don't use
eyes or ears or hands in the same way that you might to access your web site.
So the principle of separation of the form and content, and that of device
independence, are very important for the new world in which we have such a
diversity of gadgets.</p>

<p>However, this only allows for differences in size of screen. Yes, a blind
person can have a screen reader read a window - but that isn't a good speech
interface.</p>

<p><a href="../../../TR/1999/WD-voice-intro-19991223/"> </a></p>

<h3>GUI vs Conversation</h3>

<p></p>

<p>There is a much more fundamental difference between a conversational
interface and a window-based one. It was actually the conversational one
which came first for computers. For years, the standard way to communicate
with a computer was to type at a command prompt, for the computer to respond
in text, and wait for your to type again. As you typed, you could list and
change the contents of various directories and files on your system. You'd
only see one at a time, and you'd build a mental image of the whole system as
a result of the conversation.</p>

<p>When the Xerox Parc machines and the Apple Lisa came out with a screen of
"folders", what was revolutionary was that you could see the state of the
things your were manipulating. the shared context - the nest structure of
folders and files, or the document your are editing, were displayed by the
computer and seen at each point by the user, so they were a shared
information space with a mutually agreed state. This was so much more
relaxing to use because you didn't have to remember where everything was, you
could see it at each point. That "wysiwyg" feature is something which became
essential for any usable computer system. (In fact I was amazed on 1990 that
people would edit HTML in the raw source without wysiwyg editors.)</p>

<p>Now, with speech, we are in the conversational model again. there is no
shared display of where we are. The person has to remember what it is that
the computer is thinking. The computer has to remember what it thought the
person was thinking. The work at SLS and the clip we heard seem to deal with
the conversational system quite effectively. So what's the problem?</p>

<p>The challenge in fact is that people won't be choosing one mode of
communication, they will be doing all at once. As we've seen, a device will
have many modes. and we have many devices. Already my laptop and phone are
becoming more aware of each other, and starting to use each other -- but only
a little. They are connected by bluetooth - but why can't I use the camera on
my phone for a video chat on my PC? Why can't I use my PC's email as a
voicemail server and check my email from my phone while I drive in, just as I
check my voicemail? To get the most out of computer-human communications, the
system will use everything at once. If I call about the weather and a screen
is nearby, a map should come up. If I want to zoom in on the map, I can say
"Zoom in on Cambridge", or I can point at the map, or I can use a gesture
with a pen on the surface -- or I can type "Cambridge", I can use the
direction keys, or click with a mouse. Suddenly the pure conversational
model, which we can do quiet well, is broken, and so is the pure wysiwyg
model. Impinging on the computer are spoken and typed words, commands,
gestures, handwriting, and so on. These may refer to things discussed in the
past, to things being displayed. The context is partly visible, partly not.
The vocabulary is partly well-known clickstream, partly english which we are
learning to handle, and partly gestures for which we really don't have a
vocabulary, let alone a grammar. The speech recognition system will be
biasing its understanding of words as a function of where the user's hands
are, and what his stance is.</p>

<p>System integration is typically the hairiest part of a software
engineering project: glueing it all together. To glue together a multimedia
system which can deal with all the modes of communication at once will need
some kind of framework in which the very different types of system can
exchange state. Some of the state is hard (the time of the departing plane --
well the flight number at least!), some soft and fuzzy (the sort of time the
user was thinking of leaving, the fact that we are talking travel rather than
accommodation at the moment). So speech technology will not be in a vacuum.
It will not only have to make great strides to work at all -- it will have to
integrate in real time with a host of other very different technologies.</p>

<h3>Back end</h3>

<p>I understand now, that there are a number of people here involved in call
center phone tree systems?  I will not hold your personally responsible for
all the time I spend with these systems -- in fact, I know that speech
technology will actually shorten the amount of time I spend on the phone. I
won't even demand you fix my washing machine.</p>

<p>But while we are here, let me give you one peeve. I speak, I suspect, for
millions when I say this. I am prepared to to type in my account number, or
even sometimes my social security number. I am happy, probably happier, to
speak it carefully out loud. However, once I have told your company what my
account number is, I never ever on the same call want to have to tell you
again. This may seem peevish, but sometimes the user experience may have been
optimized within a small single region, but as a whole, on the large scale,
is a complete mess. Sometimes it is little things. Track who I am as you pass
me between departments. Don't authenticate me with credit card number and
zipcode before telling me your office is closed at weekends. Try to keep you
left hand aware of what the right hand is doing.</p>

<p>Actually, I know that this is a difficult problem. When I applied to have
my green card extended, I first filed the application electronically, then I
went to the office to be photographed, fingerprinted again, and I noticed
that not only did each of the three people I talked to type in my application
number, but they also typed in all my personal details. Why? Because they
were different systems. When I talk to CIOs across pretty much any industry,
I keep hearing the same problem - the stovepipe problem. Different parts of
the company, the organization, the agency, have related data in different
systems. You can't integrated them all but you need to be able to connect
them. The problem is of integrating data between systems which have been
designed quite independently in the past, and are maintained by different
groups which don't necessarily trust or understand each other. I mention this
because this is the problem the semantic web addresses. The semantic web
standards, RDF and OWL, also W3C Recommendations, are all about describing
your data, exporting it into a common format, and then explaining to the
world of machines how the different datasets are actually interconnected in
what they are about, even if they were not physically interconnected. The
Semantic Web, when you take it from an enterprise tool to a global system,
actually becomes a really powerful global system, a sort of global
interconnection bus for data. Why do I talk about this? Because the semantic
web is something people are trying to understand nowadays. Because it
provides a unified view of the data side in your organization, it is
important when we think about how speech ties in with the rest of the system.
And that tying in is very important.</p>

<h3>Semantic Web explanation</h3>

<p>When you use speech grammars and VoiceXML, you are describing possible
speech conversaions. When you use XML schema, you are describing documents. 
RDF is different. When you use RDF and OWL, you are talking about real
things.  Not a conversation about  car, or a car licence plate renewal form,
but a car.</p>

<p>The fact that a form has one value for a plane number will pass with the
form. The fact that a car has one unique plate number is very useful to know
- it constrains the form, and the speech grammars. It allows a machine to
know that two cars in different databases are the same car.</p>

<p>Because this information is about real things, it is much more reusable.
Speech apps will be replaced. Application forms will be revised, much more
often than a car changes its nature. The general properties of a car, or a
product of your company, of  real things, change rarely.  They are useful to
many applcations. This background information is called the
<em>ontology</em>, and OWL the language it is written in.</p>

<p>And data written in RDF labels fields not just with tag names, but with
URIs.  This means that each concept can be allocated withou clashing with
someone else's.  It also means that when you get some semantic web data,
anyone or anything can go look up the terms on the web, and get information
about it.  Car is a subclass of vehicle.</p>

<p>It is no use having a wonderful conversation with a computer about the
sort of vacation you would like to have, if at the end of the day you don't
have a very well-defined dataset with precise details of the flights, hotels
cars and shows which that would involve. Data which can be treated and
understood by all the different programs which will be involved in bringing
that vacation into existence. There is a working draft <a
href="http://www.w3.org/TR/semantic-interpretation/">Semantic Interpretation
for Speech Recognition</a> which is in this area, although hit does not
ground the data in the semantic web.</p>

<h3>Closing the loop</h3>

<p>At the moment speech technology is concentrated in business-consumer (b2c)
applications, where it seems the only job is to get the data to the back-end.
But I'd like to raise the bar higher. When I has a consumer have finished a
conversation and committed to buying a something, I'd like my own computer to
get a document it can process with all the details. My computer ought to be
able to connect it with the credit card transaction, and tax forms, expense
returns and so on. This means we need a common standard for the data. The
semantic web technology gives us RDF as a base language for this, and I hope
that each industry will convert or develop the terms which are useful for
describing products in their own area.</p>

<p>In fact, the development of ontologies could be a a great help in
developing speech applications. The ontology is the modeling of the real
objects in question -- rental cars flights and so on, and their properties -
number of seats, departure times and so on. This structure is the base of
understanding of speech about these subjects. It needs a lot of of added
information about the colloquial ways of talking about such things. So far
I've discussed the run-time problem -- how a computer can interact with a
person. But in fact limiting factors can also be the problems designers have
creating all the dialogs and scripts and so on which it takes to put together
a new application. In fact the amount of effort which goes into a good speech
system is very great. So technology which makes it easier for application
designers can also be a gating factor on deployment.</p>

<h3>Conclusion</h3>

<p>The picture I end up when I try to think of the speech system of the
future is a web. Well, maybe I think of everything as a web. In this case, I
think of a web of concepts, connected to words and phrases, connected to
pronunciation, connected to phrases and dialog fragments. I see also icons
and style sheets for physical display, and I see the sensors that the
computer has trained on the person connected to layers of recognition systems
which, while feeding data from the person, are immersed in a reverse stream
of context which directs them as to what they should be looking for.</p>

<p>Speech communication by computers has always been one of those things
which seemed to more difficult that they seemed at first - though five
decades.</p>

<p>It happens that as I was tidying the house the other day I just came
across a bunch of Isaac Azimov books, and got distracted by a couple of
stories from <em>Earth is Room Enough</em><em>.</em> In most Azimov stories,
computers either communicate very obscurely using teletypes, or they have
flawless speech. He obviously thought that speech would happen, but I haven't
found any stories about the transition time we are in now. The short story
<em>Someday</em> is one of the ones of the post-speech era. At one point the
young Paul is telling his friend Niccolo how he discovered all kinds of
ancient computer -- and these squiggly things (characters) which people had
to use communicate with them.</p>

<blockquote>
  <p>"Each different squiggle stood for a different number. For 'one', you
  made a kind of mark, for 'two' you make another kind of mark, for 'three'
  another one and so on."</p>

  <p>"What for?"</p>

  <p>"So you could compute"</p>

  <p>"What <em>for?</em> You just tell the computer---"</p>

  <p>"Jiminy", cried Paul, his face twisting in anger, "can't you get it
  though your head? These slide rules and things didn't talk<em>.</em>"</p>
</blockquote>

<p>So Asimov certainly imagined we'd get computers chatting seamlessly, and
the goal seems, while a long way off no, attainable in the long run. 
Meanwhile, we have sound technology for voice dialogs which are developed
part prototypes the level of standards.  The important thing for users is to
realize which is possible and what isn't, as it is easy to expect the world
and be disappointed, but also a mistake to realize that here is a very usable
technology which will save a lot of time and money.  And please remember that
when you think about saving time, its not just your call center staff time,
it is the user's time.  It may not show up directly on your spreadsheet, but
it will show up indirectly if frustration levels cause them to switch.  So
use this conference to find out what's happening, and remember to check about
standards conformance.</p>

<p>In the future, integration of speech  with other media, and with semantic
web for the data, will be a major challenge, but will be necessary before the
technology can be used to its utmost.</p>
<hr />
</body>
</html>