Speech user interfaces: An interview with James Lewis

Published: June 2011

James Lewis talks to Gerry Gaffney about speech recognition and voice user interface design.

Gerry Gaffney:

This is Gerry Gaffney with the User Experience podcast.

My guest today has been a human factors engineer with IBM for thirty years. He has a lengthy list of research papers to his name; he’s been designated IBM Master Inventor three times and now holds over seventy patents.

The reason I invited him to the User Experience podcast was because of his excellent and highly readable new book, Practical Speech User Interface Design. Dr James Lewis, welcome to the User Experience podcast.

James Lewis:

Thank you very much, and feel free to call me Jim.

Gerry:

Okay, will do. Why is it that you became interested in speech user interfaces in the first place?

Jim:

Well one answer is, who wouldn’t be interested? It’s been there in all the science fiction films, it’s actually there… more and more in our day-to-day lives. So just from that it’s interesting. But for me it was partly being in the right place at the right time. It was around 1993 that I had an opportunity to pursue a PhD courtesy of IBM and I had to decide what area to pursue.

There was a local university where I could have looked into vision research but they also had a very good psycholinguistics program and it was at around 1993/1994 that IBM was looking into the possibilities of commercializing their speech technologies. So that was the path that I went ahead and chose.

Gerry:

Good speech recognition is one of those things that’s always just around the corner. It seems it’s been just around the corner for certainly as long as I can remember. Has that situation changed now? Is it really just around the corner?

Jim:

Well it turned the corner about ten years ago in terms of having good enough speech recognition to work over the telephone. There are so many aspects to speech. When I first started working on speech I worked on dictation systems and there if you get a word wrong then you have to go through the trouble of correcting it and even if you have a very high recognition rate, even in the, say if it’s 95 percent, that still means one out of every 20 words is wrong which can be some pretty extensive proof reading and correction.

When it comes to understanding what someone has said well enough to match it against a smaller set of alternatives than a dictation system, for example the kinds of choices you usually get in an interactive voice response system or an IVR over the telephone, then the demands of recognition are actually substantially reduced.

Gerry:

I guess that brings us reasonably neatly into a question about, without getting very technical obviously, but can you tell us a little bit about finite state grammars and statistical language models?

Jim:

Well, sure. Those are the two major approaches in current operation to speech recognition over the telephone, and a finite state grammar is more associated with what you might call directed dialog [Jim used the term "command and control" in the interview, but corrected it in a subsequent email.] They tend to be hand crafted. It’s possible to build them automatically from examples but that’s not the typical practice.

More often if you have a set of five, six or seven alternatives, they’re going to be available to somebody at a given point in time, then someone will hand craft that particular grammar, putting in place what the options generally would be, perhaps some synonyms for those options, maybe a little bit of filler before and after to allow it to be slightly more conversational. But the key word is finite.

Gerry:

Can you give us an example Jim?

Jim:

Well, sure. Suppose it’s a point in an interaction where I’m going to ask you what type of account you want to talk to someone about, and so the prompt could be something like, are you interested in your checking, savings or money market account? At a minimum the grammar would have to have “checking”, “savings” and “money market”, and in the grammar you can assign those to have what’s called a semantic interpretation… to have to have its meaning, and in the case of an IVR that typically would be either information that gets screen-popped to the agent who takes the call or it would direct the call further down the appropriate path for whatever questions might follow.

Gerry:

So when we’re talking finite there are just a limited number of phrases or words that the thing will recognise?

Jim:

Right. Yes and here’s where you get kind of an interplay between the prompt and the grammar. So if I say something like, “Would you like to work with your checking, savings or money market account?”, then someone may well respond, “I’d like to work with my money market.”

So you also need to be prepared for that. If you’re not going to be prepared to make that kind of investment in the grammar then you had better make sure that the prompt doesn’t encourage someone to respond in that way.

Gerry:

Because in fact your prompt is also ambiguous there because, “Would you like to…?” I could just answer “Yes” to that in fact.

Jim:

Yes you could but, if you did that you’d be a bad conversational partner. Like a person when you say, do you know what time it is? And they say, yes.

Gerry:

I would never do that.

Jim:

Exactly and it actually is interesting that when you write a prompt like that, a lot of times when it goes through an initial review with a client they’ll say well what if someone just says yes? Or they just say no? In my experience that never happens. You can code for it. You might have in your grammar “yes” and “no” and if someone responds to the prompt with “yes” then you can re-prompt it in a more direct way. You know, please say “checking, savings or money market.”

Gerry:

So presumably then statistical language models are more complex than these finite state grammars?

Jim:

Yes they are. They actually come out of dictation systems. So the idea is that you wouldn’t necessarily have for the kinds of systems we’re talking about that have a well known domain… For example, suppose it’s a banking application; you wouldn’t have a fully fledged dictation system but you might well have a subset of a full dictation system which then is prepared to translate the text into speech quite freely and it’s not a finite state system because this other approach, which uses what are called statistical language models, typically has a dictionary. It has all the words that it can recognize, and it can’t recognise anything outside of the dictionary, but in terms of word order it has a statistical model which will bias the recognition of the word depending upon the context in which it’s spoken.

So that’s how they, for example, handle homonyms> “I want to go to the store to buy two cartons of eggs too.” Those have the three different versions of two: TO, TWO and TOO, and they would be decoded properly because of the surrounding context.

Gerry:

It’s interesting you talk about words because one of the things I guess people don’t realise is how difficult it is to actually sort out words from a string of speech because of this phenomenon of coarticulation. Do you want to tell us a little bit about that?

Jim:

You can look at coarticulation in a couple of ways. You could say that it refers to the laziness of the human speech production mechanism because when we speak the exact phonemes that we produce are influenced by the phonemes that have just occurred and the phonemes that we’re about to produce.

So that actually makes recognition of the speech signal kind of tricky. Now on the other hand inside of our own brains we’re actually wired to decode all of that quite effortlessly – and – in – fact – if – I – produced – each – word – precisely – then that would lead to a loss of efficiency in communication.

Gerry:

It sounds very unnatural as well.

Jim:

Right, sounds unnatural and it’s inefficient.

Gerry:

Now most people would be familiar with recorded announcements. I mean we’ve been talking about how machines recognise speech, I guess. But most people are familiar with recorded announcements that are made by humans and often spliced together, you know, railway announcements and the like, airport announcements.

Increasingly though machines themselves are talking to us. Is it easy for machines to generate speech?

Jim:

Well it’s easy for them to do it and the difficult aspect is the way they do it is sometimes jarring to our expectations about prosody, fluidity; it’s just basically the elements that make speech sound natural. So there are three major ways to produce speech; one is with recorded segments, as you mentioned, and that is the most common that you will find in commercial interactive voice response systems.

For artificial speech production there are two techniques, one is called formant and the other is called concatenative. Now formant is the older technology and it is reasonably intelligible and has been for at least a decade and a half, but it also tends to be quite unnatural sounding; it kind of has that robotic sound to it.

Gerry:

This is the Stephen Hawking speech synthesizer type of …

Jim:

Yes, although I think they may have hooked him up with a… I think that he is probably a real target for people who would like to get better and better speech production out there. But if you think about old Stephen Hawking especially then I do believe those were being produced with formant text speech, so kind of what we have in our heads for speech are formant.

Concatenative is a type of artificial speech which is based on the decomposition and analysis of a human’s speech. So normally you’ll get a professional voice talent to record a wide range of sentences and from those sentences snippets of the speech will be extracted. Typically what are called diphones, two co-occurring sounds and then in the final aspect of the technology when a string is sent to the text-to-speech system to produce speech it’ll analyses the text and try to find the largest segments that it has available to produce and then engages in some very high powered computation to work out how to splice everything together to make it sound as natural as it can.

Gerry:

And it’s interesting you talk about a voice talent. I must admit when I first worked on a speech project a few years ago I thought oh I could just record it [Laughs] but I think it really is essential, isn’t it, to have a professional voice talent working for you?

Jim:

It is and it’s important to have a voice talent who is used to doing the kinds of things that you need done for an interactive voice response system. There are voice talents out there who specialise in things like radio spots and advertisements and that’s a different skill from producing with – how can I put it? -producing with top quality consistency snippets of speech that are going to be played one after the other in different orders.

Gerry:

Yes it’s certainly quite impressive when you get the results of a couple of days’ work with a voice talent and you realise how much they’ve actually covered off and how consistent they actually are.

Jim:

Yes and especially when you have to go back and make a change and then they produce something that fits effortlessly in with the audio they’ve already produced.

Gerry:

Now, Jim, most people, you know, are familiar with other people saying, I just want to talk to a human. I’m sure we’ve all had that reaction when we get onto an organisation that we might not be particularly happy with at that point in time. You’ve quoted some very interesting research, and I think done some research yourself, in the course of preparation for the book about the extent to which people are accepting of talking to machines. Can you tell us what the current state of play is in that regard?

Jim:

Well, you know, IVRs serve two primary purposes. One is to basically find out what a caller’s need is and to route them appropriately, that’s something that happens at the beginning of the IVR. Once you’re routing a person then there are certain kinds of tasks which can be easily… provided as self-service. For example, what are the last five cheques that cleared my bank and there are other types of things that to manage them with any sort of automated self-service is decades, if not more, off in the future.

So one of the things in designing an IVR is to provide self-service where it will be useful and not to get in the way of letting people talk to someone if that’s what they actually need to do, or if that’s just what they want to do. But if you look at the statistics you can see that you can classify tasks from those that have a high preference in most populations to do with self-service just because you get a more consistent result and you don’t have to talk to someone, you can just get the information you want quickly and then move on, to those kinds of tasks that again as I mentioned, simply are not currently feasible with self-service.

Gerry:

Sure, and I guess talking about, you know speaking to IVRs one of the other areas that’s becoming interesting recently is the use of voice biometrics. Is that a reliable technology?

Jim:

Well it depends on what you want to do with it and it is probably not currently reliable enough to stand alone in a single-token authentication but in a multi-token authentication setting it certainly can play a role, and does for a number of enterprises. For example Bell Canada uses it.

Gerry:

So they identify the caller but they also have a secondary check, a secondary token to ensure the security. Is that right?

Jim:

Yes, they may actually ask the person for some piece of identifying information. For a cellular telephone company just knowing that the phone that’s calling you is a phone that’s in your system can provide a certain level of authentication. That coupled with voice biometrics makes a good two-token authentication system.

Gerry:

One of the things that you’ve touched on briefly but I’d be interested in I guess a little bit from you about conversational maxims and human discourse and what the implications are in terms of trying to design machines that blend in with these requirements, if you like.

Jim:

Sure. In coming to speech from what had formerly been a more traditional human factors background, there are certain aspects of linguistics that I touched on in the pursuit of my PhD but, you know, linguistics is an enormously broad area and even psycholinguistics, which is a narrow slice of that, is still very broad.

But when it comes to trying to understand speech in use then you get into an area that’s usually referred to as pragmatics and that can be a very important aspect of communication to understand. So when you think about things like discourse markers then those are aspects of communication that wouldn’t make much sense in isolation but they do make sense in a conversation. Things like when is it alright to say okay? When is it appropriate or inappropriate to say sorry? Or one thing that comes up sometimes is if you know that you’re about to have to shift the person, shift the caller away from their spoken intention of what they want to do, then an expression like “By the way” is something that can be used to ease the transition.

So the study of discourse markers is part of pragmatics and then another aspect that has been enormously influential are what are called the Gricean maxims and they date to the 1960s from a philosopher who was trying to articulate the aspects of language that cause people to have to draw implications from what they’ve heard.

So for an example he had, suppose you hear this in conversation: A and B are discussing a mutual friend C who has taken a new job at a bank. A asks B how C is doing and B says, “Oh, quite well. I think he likes his colleagues and he hasn’t been to prison yet.” Well, that last piece, that’s the spice of conversation when we’re talking to one another to say things that are not just simple expressions of fact but instead touch on literature, culture, force your conversational partner to try to keep up with you a little bit and that’s where a lot of the joy of conversation comes from.

Well Grice was trying to understand that so in order to understand how we try to force people to understand implication, he laid out his maxims of normal discourse which do not require those kinds of, that kind of extra cognition when you’re communicating with someone, and he labeled those, he had four maxims [Jim used "steps" during the interview but corrected it in a susequent email]; quality, quantity, relation and manner.

Now the quantity maxim is basically be as informative as required because if you’re not then communication breaks down. But the flipside is don’t be more information than is required. If you are more informative than is required that puts people in the position of trying to understand what your implication was when you added the unnecessary text. For example, “he hasn’t been to prison yet.” Well that actually does mean something. It plays a beautiful role in human to human speech, it would not be appropriate in an IVR.

Now going to the elements of manner, I’ll just go and list those: avoid obscure expressions, avoid ambiguity, be brief and be orderly. And I don’t know how many IVRs you’ve encountered that are just excessively wordy. It is a real challenge to try to script an IVR so that it conveys all the information it needs to but does so in as brief and orderly a way as possible.

Gerry:

Another area that’s interesting, Jim, is there’s been a lot of research into the phenomenon of people treating machines as if they’re humans and I’m thinking of Nass and Reeves’ The Media Equation book for example and some of the research they quote there. Is that a factor when we’re designing speech user interfaces?

Jim:

Well it is a factor not so much in the Nass and Reeves, or I believe that Nass and Brave also had a book out that was more specifically on speech. A critical part of the scripting and design of an interactive voice response system is to maintain an appropriate service provider to customer social role, and that is an area that I had to do a lot of additional research in because it’s not something that is typically part of a normal basic human factors engineering curriculum. And indeed these aspects of social roles and language and speech were not part of the PhD program that I took either. But fortunately the other person who does this in my department, a woman names Melanie Polkosky, is a real expert in that area and so through conversations with her and her leading me to the appropriate areas of both market research and social psychology that cover this important aspect of maintaining an appropriate tone if I’m a service provider with a customer. That’s been very valuable.

Gerry:

Now you do caution in the book, you suggest that people shouldn’t sweat too much the production of a persona.

Jim:

Yes, that’s true. It’s important to get the social role right for service provider to customer, but from a service provider role many of the elements of what you have to do to get that right are already well known and indeed were expressed quite clearly in a book that I encountered, I believe it was around 1999 or maybe it was 2000, I think it was a little bit before that, by Bruce Balentine and in there they had… the characteristics of the expert call centre agent. And if you take those to heart when you’re scripting, it can really help you avoid making certain kinds of scripting blunders.

Gerry:

To get to something very specific, one of the things that surprised me rather in the book is that you make the case for a broad and at times even very lengthy menus. I think most designers would have as a matter of faith that lengthy lists are to be avoided as they tax the memory too much.

Jim:

Well that was certainly something I believed for a long time and was something that I did carry into my own design practice when I first started doing interactive voice response because it was something that basically everyone knew.

But there were a couple of things that led me away from believing that as a matter of course. One is that in the 1980s I knew people who were involved in some of the arguments that were going back and forth about how many menus to put in the, how many options to put in a visual menu, and there was a time when people even looked at visual menus and you could find sets of guidelines that would say, well you know, 7 plus or minus 2, George Miller.

Gerry:

Yeah, the magic number.

Jim:

Right. And yet can you imagine web pages restricted to 7 plus or minus 2 links?

Gerry:

Not only can I imagine them but I’ve seen them.

Jim:

You may have seen them. Were they a joy to go through?

Gerry:

Definitely not.

Jim:

Right, and also one of the people that I studied with when I was getting my Masters degree in engineering psychology was Kenneth Paap and he wrote the chapter on menu design for the, I believe it was the 1997 edition, the most recent edition of the Handbook of Human Interaction, and he had this really great quote in there.

So most of this chapter has to do with the design on visual menus but in a section on auditory menus he has, this is in reference to auditory menus: “Users don’t actively search for a target, rather they must wait and try to ambush it.”

And so if you think of at least the interaction with a traditional speech menu where an option is spoken and you know there should be at least a short pause, next option, next option. Do you memorise the options?

We rapidly came to realise that, as we were… trying to build these things, that people are not memorising all of the options. You enter an IVR with a goal in mind. So that’s one thing you have to hold in working memory – I’ve got a goal. Now feed me options. If I like the option but it’s not, it isn’t quite close enough to get me to commit to it, so I have to hold that one in place, that’s my best option so far. As I review the others, and it takes some working memory to review those, well I hold these two other pieces of information in place, but that’s pretty much it. I’m not trying to hold a bunch of them in place, just what my goal is, the best I’ve heard so far and whatever the next one is up until the point that I hear one that’s close enough that I commit to it. So you could conceivably listen to a hundred choices like that; wouldn’t be pleasant but by no means are you trying to memorise them.

Gerry:

Yes and I know in the book you did have one example, and I can’t put my finger on it now of course, but it was very lengthy, it was I don’t know it was some dozens of words.

Jim:

And that was Susan Hura’s menu from a small piece she wrote called My Big Fat Menu.

Gerry:

Right, and it worked.

Jim:

Yes and she reported that it worked. It wasn’t their first choice. I believe it was a fall back position but you see the other piece of information here is that one of the other folks in the department I worked in, Patrick Commarford, came into our department as an intern, we liked his work very much and he got a permanent position. That was before he’d completed his PhD, still had to do his dissertation, and the dissertation topic that he settled on was to basically conduct an experiment to see if a broad menu with many options but very little depth would work better, worse or equal to an auditory menu structure that had fewer options per level but more levels. And the thing that he did that turned out to be especially valuable was he measured the memory span of all his participants.

The outcome was that people with a high memory span were able to work with both menu structures about equally. The surprising result was that the people who had a lower memory span performed much more poorly with the deep menu structure than with the broad one. In other words by following the design practice that was intended to help people who might have difficulty with their memory span, it had exactly the opposite result because the additional cognitive demands of navigation were far more damaging to those users than simply waiting to hear one more option.

Gerry:

Very interesting. One of the things that impressed me about the book, and I must say I did find the book to be extremely readable, but I liked not just the depth of knowledge that you displayed in it but also you cast a very critical eye over the research that you do quote and use in the book. Is that something that comes naturally to you?

Jim:

I don’t think it would come naturally to anyone. Maybe I’m speaking out of turn there, maybe it would, not to me. But when I was in the experimental psychology program at New Mexico State University the professor who taught the statistics classes, James Bradley, also taught a class in critical evaluation of research, and that was a real eye opener with regard to some of the things that he would show us and then guide us in the different levels at which we should be skeptical of what we’re reading.

For example, one that I still remember – if I said Harry Harlow, would that mean anything to you?

Gerry:

No.

Jim:

Terry cloth monkeys and wire monkeys.

Gerry:

No, ah yes now I know what you’re talking about, yeah.

Jim:

Okay, so this is classic social psychology and he had the terry cloth monkeys that the little monkeys could get milk from or the wire monkeys that monkeys could get milk from and the finding was they preferred the nice soft warm maternal feeling of surrogate mothers.

But when you look at the pictures of these two, the surrogate mothers, the more inviting and softer cloth mothers had very rounded faces and rounded facial features. The wire surrogate mothers had very squared off features, very unnatural. So that raises then the problem of confounded variable. Was it the cloth or was it the face or some combination of both? And because that wasn’t controlled in the experiment you simply don’t know.

Gerry:

Well it certainly comes through in reading the book that you’ve been quite thorough in reviewing the various papers that you’ve used as inputs and sources for the book, and at times you’ve been quite critical of some of them but you also said, I think in the afterword that you had changed some of your practices as a result of the reading you did in writing the book which I thought was, you know, a nice position to be in.

Jim:

Well yes, this is the first book that I’ve written and it was a very interesting experience, and in researching those areas in which I wasn’t as strong as in some other areas I learned some very interesting things. And one of the other things that I should point out is in the acknowledgements there is a section where I talk a little bit about reading research critically, and I really do want to make it clear that when you read research critically it’s the research you’re criticizing, really not the researcher.

It simply isn’t possible to run any one experiment that can be absolutely free of criticism and I’ve had my share of criticism. It’s all part of what it takes but if people didn’t take the trouble to conduct the research then that would leave designers floundering in many cases. I know it’s not easy to do and whether I’m critical or not, I really appreciate it.

Of course the other thing is that as a designer, you can never count on research to be there to answer all the questions and of course that’s part of what makes design challenging too.

Gerry:

Now I know we’re running close to the end of our window. It’s very likely that people listening to this program or reading the transcripts will be currently working on web or mobile applications or the like and will in coming years become involved with IVR or speech systems. Other than reading your book do you have any general advice for such people?

Jim:

Well there are books out there that are a little older than mine. They’re still well worth the investment and one of them actually goes back to one I mentioned, Bruce Balentine, before. He works for a company called EIG Enterprise Integration Group and he has a book that’s been out since the late ’90s, was revised in the early 2000s called How to Build a Speech a Recognition Application; A Style Guide for Telophony Dialogues and that’s still worth having on your bookshelf.

Probably the other one that would be really important to have is a book by Michael Cohen, James Giangola and Jennifer Balogh. I believe they wrote this when they worked for Nuance called Voice User Interface Design and again it’s very valuable. I’ve still got these on my bookshelf. I still find myself pulling them out and consulting them from time to time.

Gerry:

Well, I think your book is a worthy addition to that duo and the book is Practical Speech User Interface Design by James R Lewis and I heartily recommend it to anybody’s who’s got any sort of interest in this area.

Jim Lewis, thanks so much for joining me today on the User Experience podcast.

Jim:

Okay, I really enjoyed it. Thank you.

Postscript

Jim provided the following comments by email after our conversation:

1. Regarding other sources of information/training — here are the books (other than mine) I’d recommend for voice user interface designers to have in their personal libraries:

Balentine, B., & Morgan, D. P. (2001). How to build a speech recognition application: A style guide for telephony dialogues, 2nd edition. San Ramon, CA: EIG Press.

Balentine, B. (2007). It’s better to be a good machine than a bad person. Annapolis, MD: ICMI Press.

Cohen, M. H., Giangola, J. P., & Balogh, J. (2004). Voice user interface design. Boston, MA: Addison-Wesley.

There are also short-course training opportunities. Enterprise Integration Group (EIG — where Bruce Balentine works and teaches) offers an excellent introduction to IVR UI design – see http://eiginc.com/cms/en/calendar/calendar.

I occasionally give half- or full-day courses at human factors and HCI conferences (and am available for hire through my department at IBM). This year, I’ll be teaching a half-day course at HCII 2011 in Orlando, Florida, 8:30 am to noon Eastern (see http://www.hcii2011.org/index.php?module=webpage&id=57).

2. Regarding auditory menu length:

The key characteristics of any menu structure are its breadth (number of options per menu) and depth (number of menus to deal with to accomplish navigation task). An early guideline from the 1980s (based on the “magic number +/- 7″) was to limit the number of options in IVR menus to four or five per menu. Research since 1997 has consistently shown this to be misguided. The mental demands of navigating a menu structure are greater than the demands of selecting an option from a list. The greatest demands of navigation, therefore, are on those with low working memory capacity. For IVR designers, this means (a) don’t pack more options into a menu than will logically fit, (b) but do not use arguments based on “the magic number +/- 7″ to artificially limit the number of options you put in a menu — to as great an extent as is reasonable, prefer broad over deep auditory menu structures.

Published: June 2011