Beyond the Usability Lab: An interview with Bill Albert

Audio (mp3: 5.19MB, 30:14)

Published: August 2010

Gerry Gaffney interviews Bill Albert, co-author of “Beyond the Usability Lab: Conducting Large-Scale User Experience Studies.”

Gerry Gaffney:

This is Gerry Gaffney with Episode 55 of the User Experience podcast.

My guest today is Director of the Design and Usability Center at Bentley University in the USA. Previously he was Director of User Experience at Fidelity Investments.

He has a BA and an MA in Geographic Information Systems, as well as a PhD from Boston University in Spatial Cognition.

He is co-author, with Tom Tullis, of Measuring the User Experience. His most recent book, co-authored with Tullis and with Donna Tedesco is Beyond the Usability Lab: Conducting Large-scale Online User Experience Studies.

Bill Albert, welcome to the User Experience podcast.

Bill Albert:

Thanks for having me.

Gerry:

Before we get to talking about online or remote usability studies, I’d like to get a picture of what the Usability Center that you work at is.

Bill:

Essentially we’re a consulting organisation within the university, and we consult out to a lot of different types of clients worldwide. As part of what we do, we also train graduate students in the field. It’s kind of a med-school model, where they are getting practical training in our centre, as well as learning some of the theory in the Masters program in Human Factors and Information Design at Bentley.X

Gerry:

To move on to the topic of your most recent book, which is why I wanted to talk to you initially, called “Beyond the Usability Lab”. I guess many of the listeners will be familiar with traditional usability tests, in which generally we’d be talking about six to eight participants in one-on-one moderated sessions that attempt to uncover critical issues. How does that differ from what you describe in the book?

Bill:

What we talk about in “Beyond the Usability Lab” is what we call un-moderated usability testing. It’s doing large-scale testing with hundreds or even thousands of people over a very short period of time. Essentially setting up the usability study and, through the internet, just launching it out to all your target users and collecting data from them – a lot of different types of usability metrics, as well as comments that people write, a lot of qualitative data, behavioural data – everything that helps us understand what’s going on with the entire user experience from the point of having a large sample size.

Gerry:

You’re talking specifically in the book about un-moderated remote activities, but there is a set of activities referred to as moderated user research or moderated usability testing.

Are there similarities between the moderated and the un-moderated?

Bill:

Yeah, certainly at a high level, in that the overall goals might be similar, in that you’re trying to understand the user experience of a particular product or design, and that you may want to identify the most significant usability issues in using those.

But there are a lot of differences as well.

Gerry:

I think this whole area of remote user experience is becoming extremely important and extremely popular. I noticed Gerry McGovern, a former interviewee on the podcast in fact, in one of his recent newsletter said: “What we will find is an explosion in testing and observation. It will become central to how teams work. The day will be planned based on the results of yesterday’s test results.” He talks about the web being “a forum for constant interaction with customers.” Do you feel that that’s a valid vision?

Bill:

Absolutely. And we’re seeing it more and more. A lot of organisations are coming to us, really in desperate need of different types of usability testing or user research. It’s really becoming indispensable for any medium and certainly large organisation. It’s no longer a nice-to-have, it’s absolutely critical. When I was at Fidelity it was very much engrained in the design and development process. We had a seat at the table where we could basically stop development if we knew that there was a big problem with the usability.

So it’s certainly mainstream and widely accepted. I think the whole conversation about how you sell usability is becoming antiquated.

Gerry:

Which is nice to hear…

Bill:

Yeah.

Gerry:

Can you describe to us the difference between “within subjects” and “between subjects” research? Because I guess this really only comes into play once you have a reasonable number of participants. So many people who are conducting traditional usability testing may not be familiar with the practicalities of those two different types of research.

Bill:

Sure. “Between subjects”, at a very high level, is when you have two different groups. You’re comparing across two different groups. For example, you might want to compare them on Design A versus Design B, and you want to see are there any differences.

“Within subjects” is when you’re using the same participants or subjects to look at both Design A and Design B, and you want to see how they compare within themselves on those two different designs.

Gerry:

And so when people talk about A/B testing, what are they typically talking about?

Bill:

That case would be a flavour of between subjects. So you’re taking somebody and randomly putting them into one of two different groups, and then you’re comparing the two different groups.

That decision is very important as a researcher because there’s trade offs in each of those approaches. For example, in “between subjects” you don’t have to deal with learning one design or the other. There’s no order effects. But you have to make sure that the two groups are very comparable to one another.

In within subjects, you can have people that can learn something over time, in which case there’s an order effect. But you don’t have to worry about the participants, because everybody is being compared against themselves.

Gerry:

I was going to ask you which is better, but I guess you’ve answered that in that description. It depends!

Bill:

Yeah. It’s sort of a trade off. If you have an unlimited, very, very large sample size, a cleaner way to do it is through between subjects, as long as you feel very confident that you’ve used a very thorough recruit strategy and that there’s really no differences between the two populations.

Gerry:

One of the things you talk about in the book, one of the advantages of remote un-moderated research is the ability to explore true intent. Can you tell us what true intent is?

Bill:

True intent is a phrase that… I don’t know if it came from marketing or not, but the basic idea is that there’s different goals for doing un-moderated testing, and one of them is true intent, where you’re just simply trying to understand: “What are people doing on my website?”

The basic idea is that you intercept somebody when they come to a website or when they’re exiting and you say: “Hey, we just want to understand what you’re trying to do”. There’s no tasks involved, it’s really kind of a user discovery effort to understand why did they come there, what are they trying to do? Did they accomplish what they did? And you’re just trying to understand what were their major pain points and what are the things that they like and don’t like.

It’s very open-ended, usually very qualitative. There are very few metrics that are associated with true intent studies, but they’re quite popular because they really are ground truth in a way – what’s going on with the end user.

Gerry:

I guess that to some extent is related to another question I’ve got here. I was thinking… when one conducts any sort of user research the quality of the participants is very important. I guess the quality of the recruitment, the whole logistics of doing this sort of research.

How do you recommend that people should recruit for the types of research that you’re describing?

Bill:

Well, there’s different approaches. Some people have access to customer lists with email addresses. Some people can do kind of a “friends and family” which I don’t recommend.

Probably the most common way is utilising different online panels. There’s a lot of market research companies out there that basically offer panels of literally hundreds of thousands of people from all over the world, and they know a lot of details about them, and they can essentially do the recruit for you, handle all the incentives, and it’s a pretty easy, convenient way to get a lot of people very quickly.

It does cost some money, obviously, but it’s… one of the more popular ways. And some of the un-moderated testing tools have relationships with different panel companies, so they can even act as a middle-man for you. You can just say: “I need 500 of these types of people”, and they’ll go out and get those people, and you’re charged by the number of completed surveys for that.

Gerry:

How does one judge the quality of the panels that those organisations are going to maintain?

Bill:

Well that’s a really good question. I think it’s hard to do that, and the quality of those panels can vary quite a bit. Those companies try to do a very good job of weeding out people that misrepresent themselves on samples, or do something called “flat-lining”, where they’re essentially going through surveys as quick as they can in order to get the incentive.

So they do, I think, a decent job of making sure that they’re giving you good quality participants, but then there’s things that you can do at your end to make sure that in fact those people that were participating in your study were representing themselves correctly and took the study in earnest.

Gerry:

And in fact if anyone’s interested in that particular aspect, I… well I recommend the book in any case, but if they’re interested in that particular aspect I highly recommend that they read “Beyond the Usability Lab” because it goes into that in a reasonable amount of detail. As somebody who hasn’t conducted much in the way of extensive remote testing I found that it was a fascinating discussion, very interesting stuff.

Bill:

One little bit about that is we found roughly anywhere from 5% to 10% of your participants are going through a study in a non-legitimate way. They’re going through it as fast as they can to get to the end. And in the book we do highlight different little tips and techniques for identifying and removing those types of participants.

It does happen, but you can certainly mitigate that negative impact.

Gerry:

I guess the key is to identify them and remove their data from the data set, yeah?

Bill:

Absolutely.

Gerry:

Here’s a difficult question for you. How many participants?

Bill:

It really depends on the goal of the study.

If the goal is really just to identify usability issues, I think I’d fall in line with a lot of other people, saying six to eight users is plenty to identify the significant usability issues with a particular design.

We’ve seen it over and over again, and there’s been a lot of research and a lot of data on that.

If the goal is to understand the magnitude of those issues, how big or small are they, that’s a whole different type of metric. That requires a much larger sample size. That requires 30, 40, 50, even 100 people, to get something reliable. It really goes to how much error are you as a researcher willing to accept? How reliable do you need to be in your estimates? And if you need to be really reliable, then you’re going to need to have a larger sample size. You might need to have more than 100 people.

If you don’t need to be that reliable but you need to gather some metrics that have some meaning, then maybe 30, 40, 50 people is enough. But that number of participants is a difficult question because it’s so context-dependent.

Gerry:

One of the things, Bill, that I really enjoyed about the book was the information on statistics was very, very clearly presented and explained, particularly for people like me for whom statistics is kind of a scary area, I guess [laughs], or at least not a key strength. Can you attempt, and you were touching on it there with some of the things you’re talking about, can you attempt do you think to explain both statistical significance and confidence intervals in this perhaps not ideal medium?

Bill:

Sure. Just to take a step back, when Tom Tullis and I were writing “Measuring the User Experience” that was something that we really talked about in the planning of that writing, was to make it very approachable, not to give a lot of theory behind it, but really show people, here are some of the basic statistical tests that you can do. Here’s how to do them and here’s how to interpret them. We wanted to make it very useful, because we found that to be a very, very important part of our research.

Now to answer your question about statistical significance and confidence intervals.

Statistical significance is really a concept that looks at whether a particular result was obtained randomly or due to some other factors. So, to give this as an example, perhaps you want to know is there a difference in how people are rating the ease of use of a particular design between two different groups, and you want to know is there a difference between the novices and the experts, for example.

You might see a small difference in their average ratings, but through a significance test, like doing a t-test or analysis of variants or what have you, you’re able to say, you know what, there is something going on here. It wasn’t just that this result was obtained randomly. There’s something going on that the novices are rating it much more difficult than the experts, for example. And that significance test, there’s a lot of different types of test that you can do, and statistical significance is usually represented by something called a p-value. Hopefully I’m not going into too much detail here, that’s just really the probability that a [result] was obtained randomly or not. It’s one of these nice things because you can sort of hang your hat on it and say, yeah, there’s a statistical significance here, or no there’s not.

Gerry:

And in fact in the book you guys are very – instructional, I guess is the word – in saying “do exactly these things” in Excel is the spreadsheet that you talk about, and you give people exactly the steps to go through to measure these things.

Bill:

Yeah, and to answer your other question about confidence intervals. Confidence intervals are a really nice thing, and that we’ve found the students that we teach at Bentley really appreciate a lot learning about that because that’s kind of a nice way to visualise the reliability of a particular estimate.

So if you have a time to completion and you want to know… the average was 90 seconds but if you plot a confidence interval around that you can say I know that… I’m 90% certain that the true estimate, the true time to complete that task is somewhere between 45 and 135 seconds. So you can see what the reliability is of that particular average, for example.

And what’s nice is you can compare, when you start looking at confidence intervals you can visualise where there are statistical differences.

Gerry:

And in fact in the charts in the book you show this very clearly where you have the upper and lower bounds and you can point out whether they overlap and so on.

Bill:

This is something that we hear all the time on polls on the radio where it says Candidate X has an approval rating of this plus or minus. That plus or minus is a confidence interval.

Gerry:

Obviously when you are conducting remote un-moderated testing there are various technological things to think about. For example, participants may have to install plug-ins, and I know, some of my experience has been this can be quite problematic if you’re, for example, trying to do work with people who are not very au fait with, well the term “plug-in” for starters, I guess.

There are also browser compatibility issues and so on. What’s the current state of play in this regard?

Bill:

Good question. I think it’s changing all the time. There’s a bunch of different tools out there, some much more robust than others.

My understanding, and this is not my area of expertise, but my understanding is that those tools that want to capture a lot of the click-stream data may in fact require some type of plug-in or download.

But you can also run those studies if you don’t gather all the click-stream data, without requiring the participant to download something on their end.

And I think there may be tools that don’t have any requirement at all and still capture some click-stream data.

So it’s something that’s certainly worth looking at depending on what tool you’re thinking of using.

Gerry:

And of course there are many vendors of tools designed to support this sort of remote user experience research. How does one go about choosing the best tool, to you think?

Bill:

Oh gosh, that’s such a great question. I’m just in the middle of writing a paper on this topic right now, so it’s right on the top of my mind. Every day it seems like there’s some new tools out there… The main decision that a researcher needs to make is: Am I going to go large-scale, I want large sample size and collect a lot of data? Or do I want to do something that’s more qualitative in nature and just have a smaller sample size going through a product and capture something about, whether it’s a video file or transcripts or something of more qualitative nature?

If it’s on the qualitative end there’s companies like UserZoom, I think is very good, Loop11, also Keynote’s Web Effective. There’s a company called Imperium also. There’s something called UTE – Usability Testing Environment, I believe.

There’s a handful that really specialise in large-scale un-moderated testing.

Gerry:

So I guess it’s up to the people who are interested in conducting this sort of research to survey the field out there…

Bill:

Yeah. On our Measuring UX website we have a link to a lot of the different vendors out there, so you can at least take a look, because they are very different in terms of their pricing structure, some of the functionality, things like requiring downloads.

Gerry:

In terms of cost, how does remote testing compare to in-lab testing?

Bill:

I guess there’s remote and then there’s un-moderated. So in un-moderated, with large scales, it an be quite inexpensive or very, very expensive. Just to give you some ball-park, you can run a study for a few hundred dollars, easily up to $25,000, depending on how much assistance you need, and what types of metrics you need to capture. It can vary quite a bit, and in fact in the book Tom Tullis wrote a chapter on discount un-moderated testing, where you essentially can do it for free if you know just a tiny bit of HTML and JavaScript.

To be honest, I think it shouldn’t be an excuse that you can’t afford getting large sample sizes.

Gerry:

… Is there even a rule of thumb? For example, if I’m doing un-moderated remote testing, is the cost per participant going to be half of the cost of moderated in-lab testing, or a tenth, or what, per person?

Bill:

There’s the cost of accessing or licensing the technology, but how much you pay per participant… I think it actually can be quite less expensive than pulling into the lab. At least in the US, you might give a participant $100 for an hour to an hour-and-a-half. Doing un-moderated the tests are going to be much shorter, they might be only 15 or 20 minutes on average, and you might give somebody $5 or $10, or you might just have people enter a chance to win something…

Gerry:

I guess we won’t try and pin you down any further than that because…

Bill:

No, because it’s… I’ve run many un-moderated studies that were a fraction of the cost of doing something in the lab. And I’ve run un-moderated studies that were really quite expensive as well.

Gerry:

To get to something very specific. There was a heading in the book. I can’t remember the exact wording, but it said something like “Pivot tables are your friend.” [Laughs.] What did you mean by that?

Bill:

Well, it was Tom Tullis. [Laughs.] He loves pivot tables, and pivot tables are just kind of a handy trick in Excel to be able to play with and visualise your data in different ways. So in the analysis section we talk about different ways of analysing the data you get from an un-moderated study, and pivot tables essentially allow you to drag and drop different variables you have, creating essentially different types of tables to see how do the younger novices do against the older experts, or whether it’s different types of groups, or breaking up different types of variables in different ways.

You’ve got to read the book to find out, it’s hard for me to go into detail on it. But it’s a nice little Excel trick that people find very helpful in the course of the data analysis stage.

Gerry:

Something that people will possibly be quite familiar with already as an issue is that when one has a large data set it can be quite difficult to organise what it all means, and you talk about this in the book, when you have open-ended questions. If you have questions that people can answer beyond just yes or no, but give opinions and commentary and so on, you end up with a lot of undifferentiated text quite easily. And you had some very interesting suggestions in the book for dealing with this. Can you tell us about that?

Bill:

As you said, there’s a lot of different open-ended comments and you don’t know what to do with them. One of the nice things to do is to for example to create word clouds. There are different websites out there where you can essentially copy and paste all the different comments, and it looks at the frequency of occurrence of different terms, and then you can see that formed in a word cloud. It’s a really beautiful visualisation. But you get a sense of for example, if terms like “navigation” really pop out, or “terminology”, or “slow”… it’s a real handy high-level visualisation.

[Gerry's note: An example is Wordle]

There’s other techniques you can do too, to do things like alphabetise all the comments per person and start looking for different themes and start to categorise those issues based on some high-level types of comments, like things around the navigation or content, what have you, to get a good sense of what’s going on.

And then there’s a lot of different statistical software that attempts to parse out many of the verbatims that participants provide us, usually a little bit more sophisticated, but I still think it’s helpful to explore that data, and certainly use them as quotes to kind of bring to life some of the metrics that you might be presenting to your management.

Gerry:

Bill, do you have any advice for people whose interested may have been piqued by this topic or this discussion and they’re thinking: “Oh, it would be interesting and useful for us to get out and do un-moderated research with more than a handful of people.”

How do they go about – besides, obviously, buying and reading the book, which I would certainly recommend – how do they go about putting a toe in the water, as it were?

Bill:

I don’t think it should be very scary. I think that one of the best ways is obviously to find a sponsor, somebody who’s interested in getting some type of metrics around the use of a critical product or design in their organisation. Once they’ve done that, and they know what the want to test, and there’s somebody who’s very interested in the result, I would just go out to one of these more self-service tools like a UserZoom or a Loop11 and start to craft together a study, make sure that folks in the organisation know what you’re doing, and that they agree on what the tasks are and what metrics you’ll be using, and figure out a way to find the right people, get enough people so whatever you present is statistically reliable.

And then launch the study. I think the mechanics of doing these studies is not very difficult for anyone who has used something like SurveyMonkey, it’s only probably a tad bit more challenging, and not even that.

So doing the studies is not the problem. Probably the more challenging piece will be in analysing the data and presenting it in a way that’s very meaningful, and making sure that you’ve cleaned the data up properly to make sure that you’re measuring what you hoped to be measuring.

Gerry:

I don’t normally do plugs on the User Experience podcast, but I do want to mention two forthcoming conferences. The first is UX Australia in Melbourne in August, at which Daniel Szuc and I will be doing a workshop. But also Daniel and his company are organising User Experience Hong Kong in February of next year, and I think if people want to go to a really interesting conference, UX Hong Kong looks to me to be one that – if nothing else gets you to Hong Kong for a good interesting experience!

Bill:

Can I add one more plug to the conference list?

Gerry:

By all means, yeah.

Bill:

There’s something called the UX Masterclass in Montreal on September 20, and it’s organised by the UX Alliance.

Gerry:

And you’re going to be presenting at that, Bill?

Bill:

I will, on a very similar topic.

Gerry:

So whether you’re in Canada, Australia or Hong Kong, there’s something coming up in the near future that should be of interest.

…Bill Albert, thanks so much for joining me today on the User Experience podcast.

Bill:

Thanks so much for having me, it was a real pleasure talking to you.

Published: August 2010