Voice Content and Usability - problems and opportunities. With Preston So

Published: 29 August 2021

Gerry Gaffney

This is Gerry Gaffney with the User Experience podcast. My guest today is based in New York city. He’s a senior director of product strategy and editor at A List Aart, a columnist at CMS wire, and a contributor to Smashing magazine. He’s an expert in omni-channel content strategy.

He’s author of Decoupled Drupal in Practice and Gatsby: The Definitive Guide. That’s Gatsby the front end framework are not F Scott Fitzgerald’s Jay Gatsby by the way.

His most recent book and the reason I’ve invited him here today is Voice Content and Usability.

Preston So, welcome to the User Experience podcast.

Preston So

Hey Gerry, thanks so much for having me on the show today. It’s a real pleasure to be here.

Gerry

I’ll remind listeners that as always a transcript of this episode is available at uxpod.com. Now, Preston, you write that ‘our long-standing focus on the websites and applications we own, all largely visual and bounded to devices with screens, will need to adapt to embrace other means of accessing content.’ Why is a voice interface so different?

Preston

That’s a great question. And I’ll start with making one kind of important comparison between the ways in which we use voice interfaces, the ways in which we use conversational interfaces and the ways in which we use visual or physical interfaces that we’re more used to today. One of the most interesting conundrums that we face as user experience professionals in the industry right now is in many ways we’ve gone away from a lot of what makes the kind of interactions between humans mediated through the written medium or the spoken medium. Those have been washed away in many cases in favour of a more artificial means of interacting with user interfaces. One example of this is computer mice, keyboards, the smartphones that we use, the touch screens that we use, all of those interfaces are things that are learned are things that are artificial.

They’re not necessarily things that we acquired when we were toddlers, for example, or babies, whereas human language and the ways in which we deploy human speech in particular, which is the most primordial form of interaction that we can have between one another is the very interesting example of how it’s no longer humans that have to play on the playing field of machines or of keyboards or of these devices that are artificial. It’s now our devices and our technologies that need to play on the same playing field as us with human speech.

So that’s number one. I think voice interfaces are very crucial for obviously that next big step forward in terms of human computer interaction. But I think they also, presage a very interesting motion back, a little bit of that pendulum swing back in the other direction where we’re saying, well, hey, we’ve spent decades now on the web. And the web has been primarily mediated through browsers, laptop screens, smartphone screens, and a lot of the tropes, a lot of the motifs of the web and the print medium as well, rely on these very strong conditions for how user experiences are presented. For example, when you flip through a magazine, you flip through a tabloid, there are certain broadsheet borders or limitations to the print medium, to the written medium. When you look at a browser screen, for example, we’ve got links and calls to action and breadcrumbs and all these things that have become very familiar to us as users, but really have no analogue within the voice interface realm. How do you present links? How do you present arbitrary things like text wrapping or columns or things like buttons and calls to action in a voice interface. All of these kind of concerns become almost unmapped territory for so many of us as designers and practitioners, because we have no clue what a lot of these equivalents can truly be, due to the nature of the oral and verbal medium that voice operates in.

Gerry

Do you think voice will supplant text and the screen?

Preston

I think in some ways it already has. You know, there’s certain areas. I think where we see, certainly there’s been a lot of steam, a lot of momentum behind the ways in which voice has taken over. Not only let’s say automated interactions or human computer interaction, but also human output in general. I was just on a podcast a few weeks ago speaking about how the nature of podcasting itself has also reinvented a lot of the transmission of important information and content, the ways in which we hear voice announcements in transit systems or, you know, within the realm of a lot of the public spaces that we operate in is a very important consideration. However, I don’t think that there’s going to be so much as supplanting or so much of outmoding of written and text media, as I see a bit of a, sort of more of this multimodal evolution happening where text and print just as we’ve seen already, right, I think the web, you know, it’s this very interesting trend where we’ve seen the web completely outmode the text world and the, and the publishing world. And, you know, everyone thought back in the late nineties, the early nineties, the web is going to actually, you know, kind of be the death knell of print media, but, you know, nothing could be further from the truth, we see the web extending and providing even, let’s say another axis along, which we can operate as, as designers and practitioners. And I think voice is just one of those other mechanisms that will begin to kind of add to our palette and add to our kaleidoscope of different user experiences that we’ll now have to contend with in a, in a very interesting way, which, you know, obviously is very promising for us in the user experiences industry given that we’re now seeing the field expand in very interesting dimensions, but also because I believe very strongly, and I wrote about this recently in A List Apart for my article Immersive Content Strategy, I think voice, web, mobile, all of these different realms are now just parts and single panes within a very, very large kind of, you know, atmosphere where we have to deal with not just voice, but also immersive experiences like augmented virtual reality. How do we mediate a lot of these interfaces that are now becoming not just physical and visual, not just oral and verbal, but also spatial and locational and immersive?

Gerry

Interesting new world all right, isn’t it. Now you, you write about the reuse or repurposing of pre-existing content, but you also ride that and I quote, ‘voice content needs to be free-flowing and organic context-less and concise - everything written content isn’t,’ and that ‘any content we already have isn’t in any way, ready for this new habitat.’ How do you reconcile this conflict of saying that we need to reuse our content, but it’s not suitable for reuse, essentially?

Preston

It’s a constant tug of war between different stakeholders. As people who work in user experience I think we understand the very different predilections, that very different prejudices, the very different priorities that many of us have when it comes to not only the nature of how we interact as, you know, engineers and designers, but also product managers and designers and all of these sorts of interactions that we have to think about. One of the challenges I think is it’s a constant tug of war between the notion of manageability, maintainability of content. This is a very, very important consideration for those in content strategy, content architecture as well, and of course, information architecture, and this notion of, well, we need to build the best experience for the interface or for the particular medium that it’s, that this information is going to end up on.

And so I think this really becomes a very interesting struggle, a very interesting paradox that I highlight in my book, which is voice content is fundamentally about delivering content in those bites-sized, microcosmic chunks of content, microcontent, obviously being a coinage of technologist Anil Dash and not what I call macro content, which is this more, you know, these Russian novels of content or these pages of content. And if you look at a typical webpage, you know, and let’s take, for example, you know, the Victoria government and, you know, certain COVID-19, you know, guidance that they might issue. The big issue that I think a lot of written content has is the very privilege and luxury that it has of being in and for a written medium, which is that all of the written words on the page refer back to other areas of context that were already settled previously, the page, they might have links and references to other places on the page or other locations on the same website, but those are things that are much easier to, much harder to realize, and much more difficult to interpret within the context of a voice interface.

One, an example of this is let’s say that you’ve got a page that has frequently asked and some of the questions assume you’ve already read earlier a frequently asked questions on the page. Well, in a voice interface setting where you might be landed on one of these out of context, pieces of content without necessarily having that settled foundation in concrete of what it is that you were originally looking for in terms of your topical or subject area, it becomes a very challenging concern. How do we actually get to that content without introducing problems of ambiguity, problems with discoverability, which are of course, much more relevant in a voice setting due to the fact that we’re not, we’re no longer reading and skimming words on a page, we’re actually hearing words, and there’s no way to scheme or speed, skim or speed read a voice interface utterance.

That being said now, I think there’s a lot to be wary of when it comes to a lot of organizations, when they first got into voice interfaces and conversation design, they started down this path, this very interesting trajectory where they started building parallel or side-along, almost sidecar attractions that had their own content, their own information and their own user experience, their own user interaction flows, their own journeys.

Now, the problem with that of course, is when we start to look at this multimodal UX kind of world that we’re about to enter into after the pandemic or whenever of course you know, our, our trials in the pandemic are over. It’s going to be very interesting to hear from users themselves about how exactly their perception of a website or their perception of a mobile app matches that organization’s voice interface or that organization’s AR/VR interface. How will those expectations that users have set with regard to how they traverse these new experiences become things that carry over or things that don’t, and in the content strategy world to put a bookend on this point, it’s a very, very challenging thing because here in the United States, for example, as you know very well state and local governments are very cash strapped. Their budgets are very limited. And one of the prerogatives that our client came to us with Digital Services Georgia, which is of course the digital arm of the state of Georgia is they said we have only one editorial team. We have only one editorial staff, our prerogative, our jurisdiction, our edict is over the website. We can’t add an additional editorial team or additional staff just to handle the voice content or the voice manifestation of that content.

So I think there’s this really interesting kind of balance that will have to be struck between the dream obviously, which is to have content that’s really scintillating and really pinpointed at the nature of voice and the way that voice operates with, of course, the reality economically of the fact that many of us don’t have the luxury of managing 15 different types of content for 15 different types of devices. It’s simply not realistic. And at some point you have to draw the line. And my book really toils, you know, sort of in chapter 2, really toils with this idea of where exactly do you draw that line in the sand that forges that equilibrium that allows for content to live well enough on both the web and then a voice setting.

Gerry

Yeah, it’s funny because if you go back to you know, I guess, SGML, and the dream of you know, single source content that we would have something in one place and we would publish it, you know, on a, on a mobile screen and we’d publish on the web, we publish it in a book, you know, all these multiple formats. And, you know, that dream was really never realized and, and, you know, going multimodal or multichannel really does drive a nail into, into that coffin. Doesn’t it?

Preston

Absolutely. Absolutely. And I think it raises another concern, which you just briefly alluded to there, which is that, you know, I think one of the big reasons why SGML, VRML, some of these really interesting standards that were obviously rooted in XML, why they failed is partially because of the centralization and the oligopoly that a lot of these new realms of user experience have fallen victim to. If you look at the web, you know, HTML, XML, CSS, all these standards are not under the control of let’s say single corporations or single entities. But if you look at now, the way that the mobile world has evolved while you were where you have these two warring worlds, so to speak between Android and iOS, there’s no real way to build a single kind of application necessarily very easily that can operate along both of those dimensions and voice is same problem. Voice has the same issue where I think a lot of us expected that well with voice XML, for example, with a lot of these standard technologies that are out there, we should be able to write HTML or write, you know, XML as easily as we do right now for user experiences that are not in these realms.

But today, if you look at the ways that you build voice interfaces on Amazon or Google or Oracle, or you know, or any of these other platforms, they’re worlds away from each other. And I think that really suggests that there’s a little bit of a problem here that, you know, we have a bit of a privilege on the web of having access to these baseline technologies that are, that are so great for the things that we want to do and also percolate out to these multimodal experiences potentially. That was the, the ultimate sort of undergirding rationale for these standards. But now that the specifications have not become adopted by these large corporations, we’re seeing a little bit more of that fragmentation happening. And I, and I do worry significantly that we are entering into even more of that fragmentation when it comes to other domains as well.

Gerry

You don’t look like a worried man, Preston, you looked pretty chilled.

Preston

Ah well, you know, it’s not necessarily for me to worry about all day [laughs.]. I mean, you know, I think it’s you know, we, we, as practitioners, we have to deal with the, with the, with the hand that we’re dealt, so to speak. Right. And a lot of the promise I see today in voice interface is actually is that just like the mobile revolution that we’re seeing, where a lot of new technologies are emerging, allowing developers to build one single application that becomes an Android app or an iOS app, we’re seeing a little bit of that same Renaissance happening right now in the voice world, which is why I’m a little bit more on the optimistic side than the pessimistic side right now, because of companies like for example, Bot Society, which basically promised this idea of build once, and you can have your chat bot or your voice bot manifests in Slack or Facebook Messenger or WhatsApp, or what have you. But of course these platforms really paper over, I think, the very important distinction that we’ve been talking about since the beginning of our time together, which is really about, well, how do you reconcile the written and spoken form of information, which really couldn’t be in many regards, more different from each.

Gerry

Let’s put that aside for the moment then. You talk about voice interactions as falling into one or more of three categories. Can you tell us about those three categories perhaps briefly?

Preston

Sure. We all know what a conversation is. You know, a conversation is basically an arbitrary chat, length of time that really doesn’t necessarily have a beginning or an end. Doesn’t really necessarily have a certain narrative progression or, you know, many of us can sit for example, on the side of the road and have a conversation with somebody that’s sitting there waiting for their bus. And that’s not really a conversation that fulfills a certain goal. I would say that a voice interaction is very different from what we consider to be human conversation. And that’s because voice interactions generally are about fulfilling some, some form of a mission and voice interactions or conversational interactions to be more broad, really focus on three different categories.

And two of those are what Amir Shevat in his book Designing Bots calls task led, or what I call transactional and topic led or what I call informational.

And if you look at every single voice interface, they generally operate along one of these two modes, or they might enable both in the case of voice bots like Siri or Google Assistant, they might enable both. But fundamentally when you actually reduce all of the interactions you could have with a conversational interface down to the most irreducible atomic component of what we would consider a conversation that really comes down to an informational interaction or a transactional interaction and a transactional interaction can be something like ordering a pizza or booking a hotel room, or let’s say booking a COVID-19 test, since it’s all on our mind today, or it can be informational interaction finding out about symptoms of COVID-19, for example, about the current lockdown regulations right now in Melbourne, or looking at, for example, what hotel rooms are available or what sorts of ingredients there are, in a pizza.

And as you can see, there’s very different kind of considerations and questions that you have to answer when it comes to a voice interaction that’s more informational.

Transactional voice interfaces have been around for a time you call up Qantas, you call up Delta Airlines, you’re going to get an interactive voice response system or a form of an interface that really is just about handling those tasks for you or doing certain actions on your behalf. But an informational voice interface has to be much better at almost a more idle form of conversation. A more, let’s say encyclopaedic form of conversation that requires you to kind of delve into this investigative journey to find out information that the user might be looking for. And then of course, pass that information on.

Now there’s a third form of interaction that I think is in some ways the most human of the three. And also of course, the ones that voice interfaces and chat bots are the worst at which is what I call a pro-social conversation, what we would consider to be small talk, or just kind of checking in on each other, asking how someone’s day was, asking about, let’s say how the week has been going. A lot of these interactions tend to be very challenging for these synthesized speech interfaces, because there is no way for a chat bot or an Amazon Alexa or an Apple Siri to really want to know how we’re doing or how our day’s been going. And that’s a really interesting challenge that I think a lot of corporations have papered over when it comes to some of these voice interfaces that are intended to be almost like human, like concierges. And, and it really portends some really interesting, let’s say next steps for a lot of conversation designers and makers out there to explore in the future.

Gerry

You write that ‘every voice interface and thus every dialogue deals with four primary dialogue elements.’ Can you tell us about those four elements?

Preston

Sure. Four elements is definitely a bit of a misnomer. I mean, I would definitely call it three. The fourth one is really probably an element that appears only once within a single dialogue, but…

Gerry

OK. Tell us about these three element. [laughs.]

Preston

Well I’ll start with the fourth one too, because, because it is important. It is, it is core. It is crucial and that is of course onboarding. And I think a lot of people oftentimes kind of are a bit loosey goosey about onboarding, but onboarding is, is, as we know, as user experience professionals, one of the most important and essential aspects of any user interface and in a voice interface in particular, it’s very, very crucial because when it comes to onboarding, you don’t want to alienate those users who have already spent enough time with the interface to understand how it works, but at the same time, you don’t want to exclude or push away those who have never used a voice interface for the first time. When we asked Georgia gov, which was the first voice interface for residents of the state of Georgia here in the United States and we built out one of the first content-driven Alexa interfaces for this purpose, what we discovered is that, you know, a lot of people who were using this particular voice interface were elderly Georgians, disabled Georgians, who didn’t necessarily have a lot of experience or a lot of exposure to voice interfaces. And so a lot of them were using an Alexa for the very first time to acquire information and you have to adhere to certain limitations in that regard. Obviously we’re still in a kind of chunk of time, even though 35% of Americans now have a smart speaker or voice assistant at home, there is still a very, very significant majority of people that have never used a voice interface before. And so having an onboarding process, that’s short to the point, brief, but also spirits the user straight away into that initial interlocution or that initial interaction that must happen is a very important concern that establishes trust, that establishes credibility, authority.

But of course, every onboarding step, which is essentially in a voice setting, just a greeting and just an explanation of how the interface works is of course, an initial prompt and a prompt in the case of a voice interface, the closest analogue that I would say to what we see in the user interface world in the visual realm is something like a dialog box or something like a request for the user to do something and to put in something, it could be a form for example, on a website. And the prompt is very crucial because when it comes to a transactional or a informational voice interface, you need to have a capability to acquire certain information, whether it’s to filter out certain results in a search, or whether it’s to filter out certain options in a form fill, you want to be able to make sure that the user provides you some information, provides that input.

Now, the prompt is what also constitutes what’s called an intent. Now, this is a very interesting part of user interface design when it comes to voice interfaces, because we hear about intent quite often in the user experience realm on the web or in mobile, but intent is a term that’s imbued with a lot more weight and a lot more meaning in voice because for voice interfaces identifying the users intent and what they want to accomplish within the voice interface is extremely challenging. And that can take place through multiple prompts. It can take place through a variety of different approaches that the voice interface might use. And of course, the final one, which is really about delivering the correct response to the user, giving the user feedback, maybe the intent was not clear, maybe the response to the prompt that the user issued was not clear, maybe there was an out of domain error or some sort of a non-understanding or no match error within the voice interface.

Giving feedback is of course, one of the most important functions of user experiences and having a voice interface that issues very good responses is something that’s very, very critical. Now, of course, responses also fall under the umbrella of content issuances as well, content delivery. And if you’re, for example, answering questions about COVID-19 or in the case of AskGeorgia Gov in the state of Georgia answering questions about how to register to vote or how to actually enrol your child in pre-kindergarten, those are all things that are part of this whole realm of a response. And so one of the things that I think is really crucial to note here is that these elements are very much things that are like puzzle pieces that you put together and fit together into a flow journey diagram, just like we see in the web world, but there’s very crucial differences.

And that is that we generally, as designers that operate in visual and physical realms, we generally put together a sketch or some sort of a prototype of a visual interface or a visual layout of how we want these, these elements to work together in a visual interface.

In a voice interface however, there is no visual component, especially when it comes to pure voice interfaces that have no screen, that have no visual feedback delivery mechanism whatsoever. For that reason, designing for voice interfaces is not so much opening up Figma or opening up Sketch and putting pieces in place. It’s really about opening up a Google document or opening up even a screenwriting program like Celtx and writing out a dialogue. And these dialogues are very, very crucial because those are the building blocks that will make up a very effective dialogue that either succeeds or fails in actually fulfilling the user’s task. And these sorts of dialogues are then what you use to create the information flow diagrams, the flow journey diagrams that we often see today in voice interfaces, and of course in web and mobile settings as well. But I think one of the things that’s most crucial to recognize is that just because we’re moving into the voice realm doesn’t mean that our skills with writing and especially good UX writing, those are things that never go away.

Gerry

Indeed. You do talk in the book too about decision flow diagrams, dialogue flow diagrams, and you’ve got examples of them. And you know, that there seem to be very useful techniques. One of the things that you’ve mentioned a couple of times is the Georgia government service. And you use that throughout the book as a kind of framework to describe the processes of auditing your current content to check its suitability for a conversion to a voice interface. You know, things, how to strip the, what you call the vestigial web content from those elements and so on. But one of the things that amused me about it was you know the need to monitor these conversations. And you talked about there was a story in the book about Lawson’s. Do you want to tell us what that was?

Preston

Yeah, it’s, it’s really one of the funniest stories in the book, I think, and this is really an example of how much voice and conversation design and conversational, you know, user experience remains very much in its infancy. One of the really interesting things that we did for AskGeorgia Gov, and I think every conversation, you know, every single conversational interface a creator should do is to introduce logging and analytics and, and make sure that you have a sort of reporting mechanism to understand how well voice interface is performing, not just for the voice interface implementation itself, but also for the purpose that you’re trying to fulfill which in our case, of course, was delivering content that reflected the user search back to the user. And we had basically a really interesting mechanism within the Drupal content management system which of course it’s a very popular solution for the public sector.

And one of the most important things that we did was to introduce a parallel reporting mechanism where if someone was searching for a term on the website, they would also have that logged. But if someone was searching for a term on the voice interface, asking about driver’s licenses or asking about you know, certain small business loans or things of that nature, they could acquire that information through the voice interface as well. Of course, we introduced a log that was actually literally sat next to and alongside the web analytics so that people could see, okay, well, this was the main, the, the most searched for topic on the website, but this was the most searched for topic on the voice interface. And the results were totally different. It was very interesting. We also, with the voice interface, given that we wanted to also detect, you know, okay, how’s our performance doing with the voice interface? How accurately are we actually portraying and presenting the search result or the search query over to the content management system? We actually discovered that a lot of times we were finding people who would say certain things and have those things registered within the logging mechanism that we really had to scratch our heads with. And there was one instance in particular that I think is, is really instructive and illustrative of some of the challenges that we still face in the industry, which is number one, we found that there was a strange result that kept on popping up 404 errors, no match errors. And it was happening maybe dozens of times. I mean, this person was very, very adamant that they got to their result, but it kept on being recorded within the analytic system and the logging mechanism as Lawson’s - as in L A W S O N apostrophe S.

And we just thought, okay, well, who’s searching for this. I mean, you know, Lawson’s is a proper noun. It’s somebody’s whose name. It’s also a convenience store chain in Japan who, who could be searching for this on AskGeorgia Gov, what is the point of this? And so we actually sat and had a meeting where we said, okay, we gotta figure this out because this could be a really big flaw in our interface. And we sat around scratching our heads and eventually one of the native Georgians in the meeting room, she perked up and she said, you know, I think this might be somebody from Georgia who’s trying to say license. And sure enough, when we thought about it, it was very clear to us that it was somebody who was trying to say license as a driver’s license in her native, Southern drawl. And wasn’t able to make Amazon Alexa understand what she was saying. And I think this is a really good example of how, the ways in which our interfaces today are meant to mediate a lot of the human error that we have, sometimes our humanity isn’t itself, so rich and so diverse and so complicated that the machines are the ones that have the issues. And if you think about, for example the richness of dialects that we have in English from American, you know, very rich array of American English to Australian English, to Kiwi English, to of course all of the kinds of dialects that we have in the south here in the United States, it becomes a very, very challenging consideration, but it also gave us a little bit of that kind of internal chuckle, because we said, you know, this isn’t actually our mistake. This is not our bad, because frankly it should be Amazon. It should be the Alexa device that we’re aligning on that should have absolutely no trouble understanding somebody who comes from Macon, Georgia, who could, or who comes from Athens, Georgia, or who’s speaking using any sort of dialect in English that really should be accounted for in any voice interface. And I think this really highlights this a kind of interesting paradox right now with voice interfaces, which is that, you know, we as designers and as practitioners, we try as hard as we can to bring our user experiences within an inch of perfection. But sometimes it’s the underlying technology, that foundational element or that foundational setting that actually presents the problems rather than our own work as designers and implementation experts.

And so this is where I see this really interesting kind of this, this interesting kind of realm that we’re in right now, where voice interfaces are very much still not ready for prime time in many regards, because they can’t understand somebody simply saying the word license in rural Georgia.

Gerry

And of course that’s a, I mean, that’s I find that amusing story, but it’s very disenfranchising, isn’t it? And disempowering, discounting somebody’s speech because it doesn’t match what the machine wants to say. And I know there has been some research that said that you know, I mean, I know that for example, in the States, people of colour have found that they’re not understood by a lot of these devices and they have to change to be more like white people in order to get the devices to understand them. And you do talk about that in the book. You’ve got towards the end of the book, you’ve got some discussion on representation and in describing Siri, Alexa and Cortana, you write that ‘we treat such executive assistants as cis-gender white women, despite the misogyny and racism inherent in such characterizations.’ And also that ‘it’s no surprise that most voice assistants are to the user’s ear straight cisgender white women who speak in a general American dialect.’ I mean, you know, obviously we haven’t got time to, to completely, you know, disinter everything that’s buried in that, but maybe you could talk briefly to it.

Preston

Absolutely. I think this is, of course, I think one of the, one of the hardest problems that we face in technology today. Also, of course, as people who work in user experience, it’s a very, very challenging problem, which is that fundamentally voice interfaces don’t really give two hoots about the racism or oppression that many people in our society face today. And whether it’s people who speak with bilingual cadence, they potentially switch between different languages in mid-sentence, or those who use indigenous languages to speak with one another but can’t find that representation within a voice interface. I think one of the big challenges that we face today is a lot of this breathless excitement that’s coming up about what Mark Curtis calls the conversational singularity. And I think it’s a very, very lofty and really exciting notion of this point in the future potentially where a conversation that we have with Alexa or with, with Siri, or with Cortana will eventually be indistinguishable from the kind of conversation you and I are having right now, Gerry.

But one of the really big concerns that I have is, well, that’s going to mean a conversational singularity for whom and those conversations will become indistinguishable for whom? Here in the United States I think we’ve had a lot of interesting examples of linguistic oppression, where you see, for example, not only the forms of automated racism that happened with some of the soap dispensers or some of the, you know, machine vision that we see making mistakes with people of colour, but also the fact that a lot of times these interfaces were predominantly built by those who are in positions of privilege within our society. They are generally white heterosexual, cisgender men who also speak American English, and who also were brought up in a broadcast world where everyone had to speak with the middle American or general American dialect. I think one of the things, a lot of people, especially here in United States forget is that even in the early 20th century, there was a huge effort to standardize the ways in which anchormen and anchor women, and also radio hosts across the country spoke English in order for us to paper over a lot of these differences.

Now, of course you can kind of argue about some of the advantages or disadvantages thereof, but there’s a significant problem here, which is that there is fundamentally a lack of representation, not only in the upper echelons of conversational technology, but also in the conversational technology itself. Because if you think about it, when was the last time that you heard a voice interface like Alexa or Cortana speak with anything other than a white Australian or white British, white American accent, there is no room for Indian English. There is no room for Singlish or Hong Kong English, or for Indian English or for a lot of the ways in which we know that people speak, especially when it comes to the ways in which people of colour speak, especially when it comes to African-American vernacular English AAVE, or the code switching that often happens in bilingual communities. And this is what worries me the most.

Gerry

Preston, you had a story in the book about the Lyft driver. Tell us that.

Preston

You know, this is, this is the, I hope to see, and I think there’s definitely a few interesting steps in this regard. Amazon Polly, for example, is a service that allows you to put in a piece of text and that’s going to be read out by different voices with different accents. But I think there’s a really interesting notion of customization that has to happen here. And I was on Lyft at one point humming down an expressway. And I remember that you know, my driver had these recordings of his daughter reading out some of those instructions you hear on Waze like, you know accident reported ahead or keep left, keep right. And the way he phrased to me is, you know, I was wanting to have her with me on every ride. I wanted her to have her voice.

And I think this really kind of represents a lot of the challenges that people of color and oppressed communities face in general in society, which is, you know, oftentimes you don’t hear yourself in that way. You don’t hear your, you know, your own daughter and you’re your own daughter will not hear her own voice in some of these lifts or some of these voice interfaces that we interact with on a daily basis. And what does that do in terms of cultivating a potentially damaging and potentially insidious monoculture of user experience that we’ve already seen develop over the course of the last several years with regard to not only the echo chambers of misinformation on social media, but also of course, some of the automated racism that we see with something as simple as a soap dispenser. And I think this really portends a very interesting era in our kind of realm of how we operate in the technology industry, because first and foremost, when you think about voice interfaces, a lot of these organizations, a lot of these large multinationals, they’re building voice interfaces in a sense to substitute and replace and take the jobs of the call centre staffers, the frontline customer service workers, those who are predominantly from the global south, those who are predominantly from the Philippines or Indonesia or India or Pakistan, who are now losing their jobs and losing their jobs in favour of what? in favour of a monoculture white voice that is mechanical and doesn’t reflect the same humanity that we’ve already begun a lot to lose a little bit of over the course of the past few decades. So my question to a lot of the people who control voice technology and speech technology is how will we still account for, of course, the richness and the menagerie of experiences that we have here in the United States, that we have in American cities, that we have across Australia and that we have across Europe and Asia? How are we going to make sure that people of colour are represented and also continue to hear themselves in the same way that we do today? Because one of the things that I think is very worrisome about how voice technology is progressing is that we’re seeing a large-scale erasure silencing not only of individual languages that are endangered. I mean, you know, for example, there is still no real way to speak an Australian Aboriginal language with a voice interface. There is for example, now voice interfaces that have emerged to speak Welsh with, for example, or speak Irish Gaelic with, but where are the ones for black indigenous, for black and brown communities, for those communities to be able to hear themselves and actually hear their own dialects in their own languages, their own languages represented and not only represented, but also honoured and valued? It’s a very important consideration, I think, for us to come in the voice industry.

Gerry

Now I’m very conscious of time Preston, but one thing I did want to hark back to, when we talked about the four dialogue elements intense and responses you know, and we, we hear engineering people talk about intent-response pairs, and it just occurred to me. That’s, I know you discussed in the book, the feeling that voice is an engineering problem.

Preston

Yeah. It’s, you know, where to draw the line, I think is a really interesting concern and really interesting questions, it’s a big quandary, I think for a lot of designers who have been hoping to get into voice interfaces and into conversation design.

Let’s start with a little bit of history and then I’ll kind of move into sort of the idea that voice is, you know, solely the engineering dominion so to speak. Back in the early nineties, when the first let’s say true voice interfaces developed these interactive voice response or IVR systems emerged, for the most part these implementations were really rooted in engineering considerations, in very low level hardware, very low level synthesis of speech, very low level natural language processing. And in many regards, the ones who built the first IVR systems, the ones who built the first let’s say true computers capable of human speech needed to have a PhD in computational linguistics or a PhD in computer science to be able to even begin to build some of these things.

What we’re seeing now in the last few years is a democratization in the same way that we’ve seen that the, the democratization of web design and web development towards a new state of affairs where anyone and their dog can build a voice interface by using a low-code or a no-code platform, just like you can today with a lot of the old WYSIWYG programs that we saw before, like Dreamweaver and Tront Page. But of course, I think one of the really important considerations that we have to consider as well is that while a lot of the platforms don’t necessarily work well with each other, there are certain foibles, there are certain nuances, and it’s still an area within user experience where I think there is a lot of leaning still on engineering. For example, for AskGeorgia Gov with the state of Georgia, we decided to focus just on Alexa, because there wasn’t really this cross platform, low code tooling available for us to be able to manifest this voice interface as multiple kinds of voice interfaces.

At the same time however, some of these lower level aspects that you and I just spoke, to drive into, for example, how we can make sure that a voice interface can understand things like ‘Lawson’s’ or how we can make sure that a voice interface can actually deploy some of the voices that we want to make sure represent the users that we serve, who might be oppressed or marginalized. That’s something that’s still very much in the hands of engineering teams and architecture teams and those who, who really have control over the technology. Now, this is what I see as a long-term trend. And one of the really positive things I see about voice interface design and conversation design in general is as the years progress, I think as we’ve seen in the web world, the mobile world, we’ve seen this democratization and this citizen developer or citizen designer approach take hold where now more and more people are able to put together a website or put together a user experience that is compelling without even tapping on the shoulder of a single developer or a single IT person. And I’m very, very confident that we’ll get there soon with voice, but the question is how low-level will that get and how fundamental can that get in terms of the ways in which we can configure or gauge or twist some of the nuts and bolts around for how the voices actually perform, how we actually hear some of the same kinds of dialectal switches or code switches that we engage in on a daily basis in our conversations with each other as humans.

Gerry

One thing I liked about the book was the extent to which you referenced, what others have done before. For example, Jim Lewis and Amir Shevat, who you mentioned a few minutes ago, and both of whom have been previous guests on UXpod, but besides reading your book and dipping into some of those other sources, what can the average web or app designer do to engage with potential future work and voice content design or delivery?

Preston

That’s a great question. And I think this really reflects, first of all, that voice content is obviously a totally new discipline. There isn’t really a whole lot of writing that’s been done on voice content itself. And I think this is a very exciting time because for all intents and purposes, I think the world of content strategy, content design is going to become a very, very important part of the voice industry at large in the near future. But there’s many ways to get involved. I mean you know, first and foremost conversation design today is a very vast realm. That’s very quickly growing. There are many different meetups and many different organizations that are out there. There are also many conferences and many different events that are out there actually later this [northern] fall, I’ll be speaking at two of two events. The first is An Event Apart where I’ll be giving the only voice session there about how you can take your talents on web to voice. So it’s actually specifically oriented towards those who are web designers or, or those who are used to the web UX paradigms who want to move into understanding how voice works from that standpoint. Also I'll be giving a talk about the realm and the whole field of voice content design and voice content strategy at Button coming up here in October as well. So super exciting times. And one of the things that I will say is for anyone looking to get their hands dirty, there are many events out there, you know, voice lunch is one of those you know, tons of meetups out there. And there are also a lot of interesting books that are coming out.

One that I’ll recommend is from two very you know, wonderful authors called Conversations with Things. That’s one that just came out recently just at the same time as my book as well. And of course, some of these tools that are out there, you don’t really need to have any sort of underlying knowledge about let’s say the mechanics of conversation or the underlying nuances of natural language generation, natural language processing because of these low code tools that are out there. So if you try out something like Bot Society or VoiceBot.ai, you know, I think is another one of those you can actually go and try to build a voice interface right now, but of course my biggest kind of I think recommendation is a lot of us still have not as much exposure to voice interfaces as we’d like to think. And so my biggest advice also is go out there, try some smart speakers out, go out there and try some of these voice applications out. AskGeorgia Gov actually now has a chat bot as well that you can use at askgeorgia.gov to kind of explore that written conversational realm. And I think it’s a very exciting time for us as designers to begin to work on not just that job security, but also scratch that itch that might’ve been sitting there for awhile.

Gerry

Yeah. And indeed, I’ve very recently had the opportunity for, you know, once again to listen into a lot of actual conversations. And I think if you can listen to the conversations between machines and humans, you learn so much about, you know, how, how poor that whole environment is. But yeah, it’s an interesting area.

I’ll remind listeners again, the Preston’s book is called Voice Content and Usability, and I’d recommend it for anyone interested in getting into this area.

Preston So, thanks for joining me today on the User Experience podcast.

Preston

Hey Gerry, thank you so much for having me today.

If you wanted to get a hold of me, you can do so at preston.so.

Check out my writing of course, there, as well as at A List Apart. And once again, my book is on sale - Voice Content and Usability at abookapart.com and I’ll soon have a couple of other books coming out, Gatsby the Definitive Guide on O’Reilly and the sequel to this book, which I’ll hope you’ll have me back on the User Experience podcast for, Immersive Content and Usability coming out with A Book Apart next year.

Gerry

Okay. Nice plug there. Well done.

Preston

Thank you so much.