Sunday, September 6, 2009

Z is for Z Test

Today we discuss the Wonderful World of Z Tests. I started hearing from co-workers last year, about testing means with Z-tests, and I wondered why one would use a Z-test when one has perfectly good t-tests. Z-tests can also be used to test proportions, such as people who prefer Coke vs. Pepsi. Why not use a Chi-square in this case?

The main reason to use Z-tests is when you have a population's mean or proportion of responses. Thus, you are learning about a particular sample by testing against a known universe of scores. Afterwards, you are completely done: you are not going to try to apply what you learned about this sample to other situations. When we use t-tests and chi-squares, we're doing inferential testing so that we can apply our test results to future samples. Think drug tests, where we want to know whether a drug is truly effective for a trial group so that it can be used on other patients.

Example A: At eAcmeWidget.com, they ask customers about their annual income when purchasing a widget. They want to know whether their customers are the same, richer, or poorer than the national average. To do this, they have to compare their customers' mean income against the national mean income, which they know from the last national census. To learn more about this type of Z-test, see this tutorial video about Z-test for the mean.

Usually, however, we don't know population means for the things we study in user experience. We might have time scores for users installing a program on their computer, but we don't have a national average to compare with.

Example B: Suppose an online multiplayer game called Funhouse lets the user play as either Bozo or JP Patches. A spin-off game, Funhouse 2: The Revenge, has the same two character choices but with a different story. The company wants to know whether JP Patches is as popular in Funhouse 2 as in the original Funhouse, based on which character users choose to play as.
Out of the 1,000 Funhouse players, 750 choose to play as JP Patches.
Out of the 500 Funhouse 2 players, 300 choose to play as JP Patches.
A Z-test of proportions shows that JP Patches is significantly less popular in Funhouse 2 than the original Funhouse. To calculate this kind of test, see the Z-test for two proportions calculator

In user experience, we would have more opportunities to run this type of Z-test on our various Yes/No and preference questions. Anything with two answer choices where you do not need to extrapolate to other samples would be fair game.

Thursday, August 6, 2009

Possibly useful books

I have been poking around for possible stats textbooks and discovered two that might be good for user experience researchers based on the sneak peeks I got through Amazon. I ordered copies for review.
But if you want to continue your online learning, I found the Social Science Statistics Blog published by Harvard's Institute for Quantitative Social Science. The blog is on the geeky side—it's from Harvard after all—but is still accessible. The posts seem to be a mix of announcements and critiques of studies and news reports containing statistical data.

Wednesday, August 5, 2009

Post hoc tests

In the last blog entry, we learned about comparing three groups. If we know there is a significant difference between the groups, then we get to run post-hoc tests to see what the exact differences are. They're "post hoc" because it's not legal to run them until after you find a significant effect in the "omnibus" test.

Several kinds of post-hoc tests exist, including Tukey, Scheffe, Bonferroni, and LSD (which makes me smile to this day). The post-hoc tests differ in their strictness: some tests find significant effects more easily but the trade-off is that they are prone to experiment-wise error, e.g., you goof and say there's a significant effect when there isn't.

The post-hoc tests also differ in their method of comparison. Tukey and LSD are pairwise tests. Scheffe compares every possible combination of groups. Bonferroni lets you pick and choose specific groups to compare.

I often use Tukey which is fairly liberal, letting me see significant effects more often. Since no one's life depends on my UX research, I'm willing to accept a higher risk of falsely finding statistically significant results.

Post-hoc tests are easy to run. In SPSS, the ANOVA test gives you options to pick post-hoc tests, and the ANOVA and post hocs are run simultaneously. Keep in mind it's not legal to look at the post hoc unless the ANOVA is significant.

Our post-hoc results might look like this:
We found Getting Started Guides had a significant effect on time to account sign-up, F (4, 95) = 10.27, p = .03. A post-hoc Tukey's test revealed significantly shorter account sign-up times for both Getting Started Guides 3 and 4 than either Getting Started Guides 1, 2, or 5. Therefore, we recommend implementing either Getting Started Guide 3 or 4 in the new product.

Sunday, August 2, 2009

Comparing three or more groups

It is possible to compare as many groups or variables as we like. A common situation is testing the effectiveness of several UX treatments. For example, we may wish to test five kinds of Getting Started Guides to find the one that helps the user complete account sign-up most quickly.

When we compared Design A and Design B on sales, we compared the mean per-person sales for Design A and Design B with a t-test. One way (the not smart way) to compare three or more groups would be to run several pair-wise tests for each Getting Started Guide and its average time to sign-up, e.g., GSG 1 vs. GSG 2, GSG 1 vs. 3, GSG 2 vs. 3, etc. We don't want to do this method because it's tedious and hard to summarize (we would have to run 10 sets of tests) and our chance of falsely believing we have a significant effect increases with each test we run. Most researchers try to minimize the number of statistical tests they run to reduce the chance of error.

For this situation, we can run an ANOVA (ANalysis Of VAriance) or F-test which compares all groups at once. Back when we ran an independent samples t-test, Design A and B significantly differed only if their pots of scores did not overlap too much. An ANOVA also compares pots of scores by considering the variability within each pot and between all the pots. If the scores of one pot are pretty much the same as the scores of all the other pots, then there cannot be a significant effect. Only if at least one group differs from the others would there be a significant effect.

Example: We test five versions of our web site with a different Getting Started Guide for each. Using log files, we are able to see how long it took each user between opening the GSG to completing account sign-up. We have 20 people in each GSG treatment for a total of 100 people. Using SPSS, we run a one-way ANOVA that gives us an F score that says there was a significant difference somewhere between the groups. We would have to do post-hoc tests to figure out exactly where the differences were: between GSG 1 and 2? or GSG 2 and 3? or GSG 1, 2, and 3? etc.

The exact write-up of results would have the means and standard deviations for each individual group. The F-test scores would have two types of degrees of freedom (the number of observed scores/groups minus the number of estimates) in parentheses: # of groups minus 1, and # of scores minus # of groups. As usual, the p-level would also be listed so we know the probability that we goofed and falsely believed there was a significant effect. For the purpose of this entry, I won't include post-hoc tests that describe the exact differences in the data set.

See Table 1 for the means and standard deviations of each Getting Started Guide group's time to account sign-up. We found Getting Started Guides had a significant effect on time to account sign-up, F (4, 95) = 10.27, p = .03. A post-hoc Tukey's test revealed....


Friday, July 31, 2009

103 degrees (of freedom)

In honor of the record 103 degree heat in Seattle this week, I write about degrees of freedom, referenced in the results of every inferential stat. Degrees of freedom are the number of actual observed data points in an inferential calculation minus the number of estimated data points. Suppose we compared 50 users of original Excel with 50 users of redesigned Excel on the time required to create and save a new spreadsheet. We would have 100 observed time scores and 2 estimated means (1 for the group of original Excel users and 1 for the redesigned users). Thus we have 98 degrees of freedom. The results of an independent samples t-test would look like t (98) = 2.53, p < 0.05, where the number in parentheses is the degrees of freedom.

Degrees of freedom address the effect of calculating a statistical estimate (the 5% or less statistical probability of a false positive) based on another estimate (mean score of a group). The more degrees of freedom, the more statistical power you have to find a significant effect. To see this power, consider if you were comparing original Excel vs. redesigned Excel time on task with 2 users in each group. The degrees of freedom would be 4 observed scores minus 2 means = 2. With so few degrees of freedom, the redesigned and original Excel times on task would have to be hugely different to find statistical significance. With more degrees of freedom, the difference between means would not have to be so exaggerated in order to find statistical significance.

Thursday, July 30, 2009

Comparing two variables

Sometimes we want to know how the scores in Variable 1 compare to Variable 2. A common situation is comparing your study participants' pre-test and post-test scores. Or you might compare your participants' level of satisfaction with two things such as photos manipulated in Pretty Photo Premier versus Super Snapshot Suite. For these situations, we would use Mr. T-Test again, this time a paired samples t-test.

In the last blog entry, we walked through the example of deciding whether a picture improved sales per visitor on the MowBee site. Using an independent samples t-test, we compared the means and variability of the purchases by Pic vs. No Pic visitors. The less the two pots of scores overlap, the more confident we can be that the Pic condition did statistically increase sales over the No Pic condition. When we use a paired samples t-test, we are also comparing means and variability but of two variables for one group rather than one variable for two groups.

Example: We ran a survey asking participants to rate satisfaction on a scale of 1 to 7 with the results produced in Pretty Photo Premier versus Super Snapshot Suite. We have a data file of 100 cases (participants) and three variables: participant id number, PPP satisfaction, and SSS satisfaction. We want to know whether the higher ratings for SSS are statistically significant. If they were, we would write the results with the t-test score (t) and the level of probability (p) that we falsely found significant difference between the two products. The means and standard deviations for each variable (PPP and SSS) would also be included so you could see which group was high and which was low. Our results could look like this:
Participants were significantly more satisfied, t (99) = 4.85, p < 0.01, with the results of Super Snapshot Suite (M=5.60, SD=1.90) than Pretty Photo Premier (M=3.50, SD = 1.43).

Wednesday, July 29, 2009

Comparing two test groups

In the UX world, we often compare how people perform in Design A vs. Design B on certain measures such as pageviews, bounce rate, clickthrough, or sales. To decide whether Design A or Design B is better, we need to compare the means and variability of the users' scores in these two groups.

Sometimes you can tell which design is better by eyeballing the data. Suppose we split our site traffic between versions A and B of our web site. Design A used neon-colored blinking links, and Design B used standard links. After looking at our analytical reports, we might see that A visitors spent an average of 1 minute on the site, while B visitors spent 3 minutes. Both groups had similarly low variability. We can probably say that Design B was better than Design A for getting users to stay on the site because the scores for both groups are so different without extreme outliers skewing the results.

It is when the two groups look much more alike that we really need a statistical test. When comparing two groups, we can run an independent samples t-test. (Fun fact: The t-test was invented at the Guinness brewery.)

A t-test works by comparing the means and variability of two groups of interest. It essentially considers how much the scores for each of the two groups overlap. If they overlap completely, then the two groups are not different from one another. They less they overlap, the more likely the two groups are statistically different.

Example. We currently sell our flagship product, The MowBee, on a web page with no pictures. We want to know if a picture will increase sales. We split traffic equally between picture and no picture versions of the site. After a week, we funneled 100 people to the Picture page, and 100 people to the No Picture page. Each person's purchase (if any) is recorded. Our data file at the end has 200 cases and three variables: participant id number, condition (pic, no pic), and dollar amount of sales. If we ran this data through SPSS' independent samples t-test and found statistical significance, the results would have a t-test score (t) and the level of probability (p) that we wrongly found statistical difference between the two groups. The means and standard deviation for each group would also be reported so we can see which was higher. We might have a result like this:
We made significantly more money per visitor, t (198) = 2.47, p < 0.05, with the Picture page (M=$85, SD=44.41) than the No Picture page (M=$47.50, SD=18.45).