Thursday, August 6, 2009

Possibly useful books

I have been poking around for possible stats textbooks and discovered two that might be good for user experience researchers based on the sneak peeks I got through Amazon. I ordered copies for review.
But if you want to continue your online learning, I found the Social Science Statistics Blog published by Harvard's Institute for Quantitative Social Science. The blog is on the geeky side—it's from Harvard after all—but is still accessible. The posts seem to be a mix of announcements and critiques of studies and news reports containing statistical data.

Wednesday, August 5, 2009

Post hoc tests

In the last blog entry, we learned about comparing three groups. If we know there is a significant difference between the groups, then we get to run post-hoc tests to see what the exact differences are. They're "post hoc" because it's not legal to run them until after you find a significant effect in the "omnibus" test.

Several kinds of post-hoc tests exist, including Tukey, Scheffe, Bonferroni, and LSD (which makes me smile to this day). The post-hoc tests differ in their strictness: some tests find significant effects more easily but the trade-off is that they are prone to experiment-wise error, e.g., you goof and say there's a significant effect when there isn't.

The post-hoc tests also differ in their method of comparison. Tukey and LSD are pairwise tests. Scheffe compares every possible combination of groups. Bonferroni lets you pick and choose specific groups to compare.

I often use Tukey which is fairly liberal, letting me see significant effects more often. Since no one's life depends on my UX research, I'm willing to accept a higher risk of falsely finding statistically significant results.

Post-hoc tests are easy to run. In SPSS, the ANOVA test gives you options to pick post-hoc tests, and the ANOVA and post hocs are run simultaneously. Keep in mind it's not legal to look at the post hoc unless the ANOVA is significant.

Our post-hoc results might look like this:
We found Getting Started Guides had a significant effect on time to account sign-up, F (4, 95) = 10.27, p = .03. A post-hoc Tukey's test revealed significantly shorter account sign-up times for both Getting Started Guides 3 and 4 than either Getting Started Guides 1, 2, or 5. Therefore, we recommend implementing either Getting Started Guide 3 or 4 in the new product.

Sunday, August 2, 2009

Comparing three or more groups

It is possible to compare as many groups or variables as we like. A common situation is testing the effectiveness of several UX treatments. For example, we may wish to test five kinds of Getting Started Guides to find the one that helps the user complete account sign-up most quickly.

When we compared Design A and Design B on sales, we compared the mean per-person sales for Design A and Design B with a t-test. One way (the not smart way) to compare three or more groups would be to run several pair-wise tests for each Getting Started Guide and its average time to sign-up, e.g., GSG 1 vs. GSG 2, GSG 1 vs. 3, GSG 2 vs. 3, etc. We don't want to do this method because it's tedious and hard to summarize (we would have to run 10 sets of tests) and our chance of falsely believing we have a significant effect increases with each test we run. Most researchers try to minimize the number of statistical tests they run to reduce the chance of error.

For this situation, we can run an ANOVA (ANalysis Of VAriance) or F-test which compares all groups at once. Back when we ran an independent samples t-test, Design A and B significantly differed only if their pots of scores did not overlap too much. An ANOVA also compares pots of scores by considering the variability within each pot and between all the pots. If the scores of one pot are pretty much the same as the scores of all the other pots, then there cannot be a significant effect. Only if at least one group differs from the others would there be a significant effect.

Example: We test five versions of our web site with a different Getting Started Guide for each. Using log files, we are able to see how long it took each user between opening the GSG to completing account sign-up. We have 20 people in each GSG treatment for a total of 100 people. Using SPSS, we run a one-way ANOVA that gives us an F score that says there was a significant difference somewhere between the groups. We would have to do post-hoc tests to figure out exactly where the differences were: between GSG 1 and 2? or GSG 2 and 3? or GSG 1, 2, and 3? etc.

The exact write-up of results would have the means and standard deviations for each individual group. The F-test scores would have two types of degrees of freedom (the number of observed scores/groups minus the number of estimates) in parentheses: # of groups minus 1, and # of scores minus # of groups. As usual, the p-level would also be listed so we know the probability that we goofed and falsely believed there was a significant effect. For the purpose of this entry, I won't include post-hoc tests that describe the exact differences in the data set.

See Table 1 for the means and standard deviations of each Getting Started Guide group's time to account sign-up. We found Getting Started Guides had a significant effect on time to account sign-up, F (4, 95) = 10.27, p = .03. A post-hoc Tukey's test revealed....