Sunday, September 6, 2009

Z is for Z Test

Today we discuss the Wonderful World of Z Tests. I started hearing from co-workers last year, about testing means with Z-tests, and I wondered why one would use a Z-test when one has perfectly good t-tests. Z-tests can also be used to test proportions, such as people who prefer Coke vs. Pepsi. Why not use a Chi-square in this case?

The main reason to use Z-tests is when you have a population's mean or proportion of responses. Thus, you are learning about a particular sample by testing against a known universe of scores. Afterwards, you are completely done: you are not going to try to apply what you learned about this sample to other situations. When we use t-tests and chi-squares, we're doing inferential testing so that we can apply our test results to future samples. Think drug tests, where we want to know whether a drug is truly effective for a trial group so that it can be used on other patients.

Example A: At eAcmeWidget.com, they ask customers about their annual income when purchasing a widget. They want to know whether their customers are the same, richer, or poorer than the national average. To do this, they have to compare their customers' mean income against the national mean income, which they know from the last national census. To learn more about this type of Z-test, see this tutorial video about Z-test for the mean.

Usually, however, we don't know population means for the things we study in user experience. We might have time scores for users installing a program on their computer, but we don't have a national average to compare with.

Example B: Suppose an online multiplayer game called Funhouse lets the user play as either Bozo or JP Patches. A spin-off game, Funhouse 2: The Revenge, has the same two character choices but with a different story. The company wants to know whether JP Patches is as popular in Funhouse 2 as in the original Funhouse, based on which character users choose to play as.
Out of the 1,000 Funhouse players, 750 choose to play as JP Patches.
Out of the 500 Funhouse 2 players, 300 choose to play as JP Patches.
A Z-test of proportions shows that JP Patches is significantly less popular in Funhouse 2 than the original Funhouse. To calculate this kind of test, see the Z-test for two proportions calculator

In user experience, we would have more opportunities to run this type of Z-test on our various Yes/No and preference questions. Anything with two answer choices where you do not need to extrapolate to other samples would be fair game.

Thursday, August 6, 2009

Possibly useful books

I have been poking around for possible stats textbooks and discovered two that might be good for user experience researchers based on the sneak peeks I got through Amazon. I ordered copies for review.
But if you want to continue your online learning, I found the Social Science Statistics Blog published by Harvard's Institute for Quantitative Social Science. The blog is on the geeky side—it's from Harvard after all—but is still accessible. The posts seem to be a mix of announcements and critiques of studies and news reports containing statistical data.

Wednesday, August 5, 2009

Post hoc tests

In the last blog entry, we learned about comparing three groups. If we know there is a significant difference between the groups, then we get to run post-hoc tests to see what the exact differences are. They're "post hoc" because it's not legal to run them until after you find a significant effect in the "omnibus" test.

Several kinds of post-hoc tests exist, including Tukey, Scheffe, Bonferroni, and LSD (which makes me smile to this day). The post-hoc tests differ in their strictness: some tests find significant effects more easily but the trade-off is that they are prone to experiment-wise error, e.g., you goof and say there's a significant effect when there isn't.

The post-hoc tests also differ in their method of comparison. Tukey and LSD are pairwise tests. Scheffe compares every possible combination of groups. Bonferroni lets you pick and choose specific groups to compare.

I often use Tukey which is fairly liberal, letting me see significant effects more often. Since no one's life depends on my UX research, I'm willing to accept a higher risk of falsely finding statistically significant results.

Post-hoc tests are easy to run. In SPSS, the ANOVA test gives you options to pick post-hoc tests, and the ANOVA and post hocs are run simultaneously. Keep in mind it's not legal to look at the post hoc unless the ANOVA is significant.

Our post-hoc results might look like this:
We found Getting Started Guides had a significant effect on time to account sign-up, F (4, 95) = 10.27, p = .03. A post-hoc Tukey's test revealed significantly shorter account sign-up times for both Getting Started Guides 3 and 4 than either Getting Started Guides 1, 2, or 5. Therefore, we recommend implementing either Getting Started Guide 3 or 4 in the new product.

Sunday, August 2, 2009

Comparing three or more groups

It is possible to compare as many groups or variables as we like. A common situation is testing the effectiveness of several UX treatments. For example, we may wish to test five kinds of Getting Started Guides to find the one that helps the user complete account sign-up most quickly.

When we compared Design A and Design B on sales, we compared the mean per-person sales for Design A and Design B with a t-test. One way (the not smart way) to compare three or more groups would be to run several pair-wise tests for each Getting Started Guide and its average time to sign-up, e.g., GSG 1 vs. GSG 2, GSG 1 vs. 3, GSG 2 vs. 3, etc. We don't want to do this method because it's tedious and hard to summarize (we would have to run 10 sets of tests) and our chance of falsely believing we have a significant effect increases with each test we run. Most researchers try to minimize the number of statistical tests they run to reduce the chance of error.

For this situation, we can run an ANOVA (ANalysis Of VAriance) or F-test which compares all groups at once. Back when we ran an independent samples t-test, Design A and B significantly differed only if their pots of scores did not overlap too much. An ANOVA also compares pots of scores by considering the variability within each pot and between all the pots. If the scores of one pot are pretty much the same as the scores of all the other pots, then there cannot be a significant effect. Only if at least one group differs from the others would there be a significant effect.

Example: We test five versions of our web site with a different Getting Started Guide for each. Using log files, we are able to see how long it took each user between opening the GSG to completing account sign-up. We have 20 people in each GSG treatment for a total of 100 people. Using SPSS, we run a one-way ANOVA that gives us an F score that says there was a significant difference somewhere between the groups. We would have to do post-hoc tests to figure out exactly where the differences were: between GSG 1 and 2? or GSG 2 and 3? or GSG 1, 2, and 3? etc.

The exact write-up of results would have the means and standard deviations for each individual group. The F-test scores would have two types of degrees of freedom (the number of observed scores/groups minus the number of estimates) in parentheses: # of groups minus 1, and # of scores minus # of groups. As usual, the p-level would also be listed so we know the probability that we goofed and falsely believed there was a significant effect. For the purpose of this entry, I won't include post-hoc tests that describe the exact differences in the data set.

See Table 1 for the means and standard deviations of each Getting Started Guide group's time to account sign-up. We found Getting Started Guides had a significant effect on time to account sign-up, F (4, 95) = 10.27, p = .03. A post-hoc Tukey's test revealed....


Friday, July 31, 2009

103 degrees (of freedom)

In honor of the record 103 degree heat in Seattle this week, I write about degrees of freedom, referenced in the results of every inferential stat. Degrees of freedom are the number of actual observed data points in an inferential calculation minus the number of estimated data points. Suppose we compared 50 users of original Excel with 50 users of redesigned Excel on the time required to create and save a new spreadsheet. We would have 100 observed time scores and 2 estimated means (1 for the group of original Excel users and 1 for the redesigned users). Thus we have 98 degrees of freedom. The results of an independent samples t-test would look like t (98) = 2.53, p < 0.05, where the number in parentheses is the degrees of freedom.

Degrees of freedom address the effect of calculating a statistical estimate (the 5% or less statistical probability of a false positive) based on another estimate (mean score of a group). The more degrees of freedom, the more statistical power you have to find a significant effect. To see this power, consider if you were comparing original Excel vs. redesigned Excel time on task with 2 users in each group. The degrees of freedom would be 4 observed scores minus 2 means = 2. With so few degrees of freedom, the redesigned and original Excel times on task would have to be hugely different to find statistical significance. With more degrees of freedom, the difference between means would not have to be so exaggerated in order to find statistical significance.

Thursday, July 30, 2009

Comparing two variables

Sometimes we want to know how the scores in Variable 1 compare to Variable 2. A common situation is comparing your study participants' pre-test and post-test scores. Or you might compare your participants' level of satisfaction with two things such as photos manipulated in Pretty Photo Premier versus Super Snapshot Suite. For these situations, we would use Mr. T-Test again, this time a paired samples t-test.

In the last blog entry, we walked through the example of deciding whether a picture improved sales per visitor on the MowBee site. Using an independent samples t-test, we compared the means and variability of the purchases by Pic vs. No Pic visitors. The less the two pots of scores overlap, the more confident we can be that the Pic condition did statistically increase sales over the No Pic condition. When we use a paired samples t-test, we are also comparing means and variability but of two variables for one group rather than one variable for two groups.

Example: We ran a survey asking participants to rate satisfaction on a scale of 1 to 7 with the results produced in Pretty Photo Premier versus Super Snapshot Suite. We have a data file of 100 cases (participants) and three variables: participant id number, PPP satisfaction, and SSS satisfaction. We want to know whether the higher ratings for SSS are statistically significant. If they were, we would write the results with the t-test score (t) and the level of probability (p) that we falsely found significant difference between the two products. The means and standard deviations for each variable (PPP and SSS) would also be included so you could see which group was high and which was low. Our results could look like this:
Participants were significantly more satisfied, t (99) = 4.85, p < 0.01, with the results of Super Snapshot Suite (M=5.60, SD=1.90) than Pretty Photo Premier (M=3.50, SD = 1.43).

Wednesday, July 29, 2009

Comparing two test groups

In the UX world, we often compare how people perform in Design A vs. Design B on certain measures such as pageviews, bounce rate, clickthrough, or sales. To decide whether Design A or Design B is better, we need to compare the means and variability of the users' scores in these two groups.

Sometimes you can tell which design is better by eyeballing the data. Suppose we split our site traffic between versions A and B of our web site. Design A used neon-colored blinking links, and Design B used standard links. After looking at our analytical reports, we might see that A visitors spent an average of 1 minute on the site, while B visitors spent 3 minutes. Both groups had similarly low variability. We can probably say that Design B was better than Design A for getting users to stay on the site because the scores for both groups are so different without extreme outliers skewing the results.

It is when the two groups look much more alike that we really need a statistical test. When comparing two groups, we can run an independent samples t-test. (Fun fact: The t-test was invented at the Guinness brewery.)

A t-test works by comparing the means and variability of two groups of interest. It essentially considers how much the scores for each of the two groups overlap. If they overlap completely, then the two groups are not different from one another. They less they overlap, the more likely the two groups are statistically different.

Example. We currently sell our flagship product, The MowBee, on a web page with no pictures. We want to know if a picture will increase sales. We split traffic equally between picture and no picture versions of the site. After a week, we funneled 100 people to the Picture page, and 100 people to the No Picture page. Each person's purchase (if any) is recorded. Our data file at the end has 200 cases and three variables: participant id number, condition (pic, no pic), and dollar amount of sales. If we ran this data through SPSS' independent samples t-test and found statistical significance, the results would have a t-test score (t) and the level of probability (p) that we wrongly found statistical difference between the two groups. The means and standard deviation for each group would also be reported so we can see which was higher. We might have a result like this:
We made significantly more money per visitor, t (198) = 2.47, p < 0.05, with the Picture page (M=$85, SD=44.41) than the No Picture page (M=$47.50, SD=18.45).

Your friend, the data file

Before you can use a program like SPSS, you need a spreadsheet of study data. Here are some basics to help you prepare a data file. Let's use the example that we ran a usability study with 20 participants on a new personal finance web site.

Each row of your spreadsheet represents one case (or participant) in the study. For this file, you would have 20 cases. Each column represents one variable. Your variables would be questions or data that you collected for each person such as participant id number, age, computer ownership, time on tasks, satisfaction ratings, and comprehension test scores.

Since some statistical packages accept only numerical data, a best practice is to code everything as numbers. Suppose you asked participants for their age group: 18-24, 25-34, 35-44, etc. You can recode their responses as 1 (=18-24), 2 (=25-34), 3 (=35-44), etc.

Maintaining a consistent numbering scheme is also a good practice. I like to code No and Yes answers as 0 and 1 and use 0 for any kind of No answer such as "I don't own X" or "I don't use Y."

I like to code Likert scale answers starting at 1, where 1 is the most negative option and n is most positive. If the question is worded negatively, you may have to recode the answers in semantically negative to positive order. Using a consistent scale makes it easier to analyze related questions.

Example A: Imagine you asked: What is your level of agreement with these statements? where 1 = I definitely disagree, 3 = I feel neutral, and 5 = I definitely agree.
  1. I balance my checkbook.
  2. I never keep my pay check stubs.

To make these answers semantically parallel, you would code question 1 exactly as the participants answered it: 1=1, 2=2, etc. You would flip the scale (1=5, 2=4, etc.) for question 2 because the statement is negative: when participants say they disagree, they mean they do keep their pay stubs. Thus, if you test for a statistical correlation, you in essence look for a positive correlation between balancing a checkbook and keeping pay stubs - much simpler to think about than negative correlation with not keeping pay stubs.

If you have missing answers from your participants, keep those spreadsheet cells blank so you do not accidentally run calculations on them. Any answer you want to exclude from calculations should also be treated as blanks.

Example B: Suppose you ask, "How satisfied are you with your bank?" where 1 = Very unsatisfied, 3 = Neutral, and 5 = Very satisfied. You also offer an "I don't know" option. Any "I don't know's" should be coded as blank. If you ran a stat on satisfaction level, you would want to exclude people who don't know because they cannot speak to this question.

Monday, July 27, 2009

Making predictions

We can describe a data set's mean and standard deviation with perfect certainty. However, sometimes we want to make conclusions that stretch beyond the data set, such as when we use a sample to represent a whole population. How do we know we didn't draw a sample of people who like X or perform Y purely by chance? For these cases, we use inferential stats, which include ANOVA, regression, correlations, chi-square, MANOVA, and more. The general premise of inferential stats is to say with a level of confidence whether a phenomenon occurred because of a relationship between our variables or complete chance.

Let's say we asked 250 people on the street whether they preferred Search Engine A or B. We would expect the answers to split about 50/50 if the search engines did not differ from each other. Using an inferential statistic, we could test whether our responses of 150 people who prefer Search Engine A and 100 Search Engine B were due to chance or an actual preference for Search Engine A.

As part of this test we would state our tolerance for error or the level that we require for statistical significance. In social sciences, the alpha level, or bar for statistical significance, is often 0.05, that is, we are willing to accept a 5% chance that we incorrectly believe that people have a preference for Search Engine A. The tolerance for error will change depending on the test context. A clinical trial of a new drug will probably have lower tolerance for error than a taste test of chicken nuggets.

In this example, we run a one-way chi-squre, which tells us whether the preference for Search Engine A was statistically significant. If significant, the p value (probability value) would be less than what we set as the alpha level. The results of a chi-square and any other inferential stat would have the statistical test score, in this case a χ2, degrees of freedom and/or sample size in parentheses, and the level of probability (p). If appropriate, the means and standard deviations of the comparison groups would be reported. Below, the sample report shows the χ2 value as 10 based on 1 degree of freedom and a sample of 250 scores, with less than 1 percent probability that the results were due to chance

Our street survey showed a statistically significant preference by women for Search Engine A, χ2 (1, N=250) = 10, p < 0.01. Future research can investigate the design characteristics that make it more preferable Search Engine A more preferable than B.

Sunday, July 26, 2009

Describing your data set

Sometimes you want to be able to talk about what your data set or group of users looked like as a whole. For example, you might want to talk about how technically knowledgable the users in your last usability study were. Use descriptive stats for this purpose. Some common descriptive stats include frequency, central tendency, and variability.

==========
Frequency
==========
One way to describe your data set is to report the size of each group that you care about. These groups might be based on gender, visual appeal rating, or time on task. Frequency (or count) is the number of times these various items appear in your data set.

Example A: Suppose you took a survey of 50 people asking them what word processing package they used. 30 people said Microsoft Word, 10 WordPerfect, 5 Google Docs, 3 Notepad, and 2 Wordpad. You could report the frequencies as Ns (30 Microsoft Word, 10 WordPerfect, etc.) or as percents (60% of respondents used Microsoft Word).

The number of people, or N, in your sample is also an important frequency to report because it usually matters whether your conclusions are based on a sample of 2 or 2,000 people.

A sample frequency report:
We tested 25 participants. They were technically savvy. Almost all had used computers before (n=24). In addition, all participants scored at least 60 on Dr. Wei's Computer Competency Test, and 15 participants scored 90 or higher.

================
Central Tendency
================
Sometimes we want to describe our data in a more shorthand way than frequency. In these cases, we talk about the average or central tendency of a group. Oftentimes, we use the mean to describe the average of a group, but the mode or the median could also be averages. Let's look closer at when and why we use each.

  • We calculate the mean (M) when we have interval data such as time, number of visits, comprehension test scores, or satisfaction ratings. The mean = the sum of all scores divided by # of scores.

    Example B:
    Suppose we have 5 participants in a usability test, and they each took 10, 12, 40, 20, or 15 seconds to complete a task. Their mean time was (10+12+40+20+15)/5 = 19.4 seconds.
  • We can use the median to describe the central point of a set of ordinal data such as rankings. The median = the midpoint of a set of scores.

    Example C:
    Suppose we have our 5 participants rate the importance of Feature X on a scale of 1 to 7 where 1 = "Not at all important" and 7 = "Extremely important." This scale is ordinal rather than interval because only the two end points are labeled, so each participant's interpretation of the values of 2-6 varies. Your participants' scores are 1, 6, 6, 7, and 7. The median is 6, which is the midpoint of the dataset when it is in numerical order. If there had been a 6th participant who gave a rating of 7, the median would be the mean of the middle two numbers, or (6+7)/2 = 6.5.

    You can also use the median to describe interval data if they are extremely skewed with extreme spikes or dips. Note: Using a mean and standard deviation to describe interval data is probably more common than the median. The skewness or variability of the data is reported in the standard deviation (keep reading this post).

    Example D:
    Suppose in Example B, the time scores had been 10, 12, 180, 20, and 15 seconds. (Participant 3 had stopped in the middle of the task to ask questions and share an anecdote.) The mean would have been 47.4 seconds, suggesting the task was very time-consuming to complete. The median score of 15 seconds is a much more representative description of how long it takes to complete that task.
  • We can use the mode to describe the average of a set of nominal data such as the types of computers that people own. The mode = the most frequently occurring score.

    Example E:
    Suppose we gave a survey to 100 people asking them which computers they own. In our spreadsheet for responses, we record responses as follows: PC=1, Mac=2, Linux=3 (other computer types did not matter). We found that the participants owned 90 PCs, 30 Macs, and 2 Linux boxes. The mode is 1 or PC, the most frequently occurring response.
===========
Variability
==========
When we report the mean, we also report standard deviation (SD) which reflects the variability of the data set. The standard deviation tells us how spread apart a data set is. Example D had a huge outlier, so M=47.4 seconds, SD = 74.22. The standard deviation tells us that the data set had a tremendous amount of variability. If all the time scores had been exactly 15 seconds, the standard deviation would have been 0. Standard deviation is annoying to calculate by hand, but it is worth looking at an example calculation to understand how variance works.

A sample description of central tendency and variability:
The 55 survey respondents on average rated Widget W's usefulness positively: M=5.34, SD=1.51 (on a scale of 1 to 7 where 1 = "Extremely unuseful," 7 = "Extremely useful," and all other points are labeled similarly to the visual appeal question).

Know your quantitative data

Preparing data for statistical analysis involves creating a spreadsheet of numerical values. A classic spreadsheet would be one that is completely coded in numbers including things like participant ID, gender, satisfaction scores, or time on task.

In the UX world, quantitative data can come in three flavors: interval, ordinal, and nominal. It is important to recognize each type of data because they determine the stats that you can run.

  • Interval data are composed of equally distributed units. Weight is an example of interval data, where everyone weighs a certain number of pounds. Each pound is uniform, so 149 pounds is smaller than 150 pounds in exactly the same way that 150 pounds is smaller than 151 pounds. Common UX interval data include the time that a user requires to complete a task, the number of visitors that a web site has, or a satisfaction scale with every point carefully labeled.

  • Ordinal data are rank ordered but may not be in equally spaced units. Class rank is a common example of an ordinal scale, e.g., every school has students who are ranked 1st, 2nd, 3rd, etc. in their class. However, in School A, students 1, 2, and 3 may have 4.0, 3.9, and 3.4 GPAs, while in School B, the top 3 students might have 3.8, 3.0, and 2.9 GPAs. In the UX world, ordinal data may include user rankings of the five most important features in a product. Satisfaction scales with only a few labeled points are ordinal as well.

  • Nominal data measure discrete categories. The genders of your study participants are nominal data. In a spreadsheet you could code gender as a number, where 0 = male, and 1 = female. You can't do any math directly on the nominal data because it wouldn't make sense. For example, you can't calculate the average of gender even if you coded it as 0s and 1s because there's no sense in reporting that, "On average, the study participants had a gender of 0.6."