Sunday, August 2, 2009

Comparing three or more groups

It is possible to compare as many groups or variables as we like. A common situation is testing the effectiveness of several UX treatments. For example, we may wish to test five kinds of Getting Started Guides to find the one that helps the user complete account sign-up most quickly.

When we compared Design A and Design B on sales, we compared the mean per-person sales for Design A and Design B with a t-test. One way (the not smart way) to compare three or more groups would be to run several pair-wise tests for each Getting Started Guide and its average time to sign-up, e.g., GSG 1 vs. GSG 2, GSG 1 vs. 3, GSG 2 vs. 3, etc. We don't want to do this method because it's tedious and hard to summarize (we would have to run 10 sets of tests) and our chance of falsely believing we have a significant effect increases with each test we run. Most researchers try to minimize the number of statistical tests they run to reduce the chance of error.

For this situation, we can run an ANOVA (ANalysis Of VAriance) or F-test which compares all groups at once. Back when we ran an independent samples t-test, Design A and B significantly differed only if their pots of scores did not overlap too much. An ANOVA also compares pots of scores by considering the variability within each pot and between all the pots. If the scores of one pot are pretty much the same as the scores of all the other pots, then there cannot be a significant effect. Only if at least one group differs from the others would there be a significant effect.

Example: We test five versions of our web site with a different Getting Started Guide for each. Using log files, we are able to see how long it took each user between opening the GSG to completing account sign-up. We have 20 people in each GSG treatment for a total of 100 people. Using SPSS, we run a one-way ANOVA that gives us an F score that says there was a significant difference somewhere between the groups. We would have to do post-hoc tests to figure out exactly where the differences were: between GSG 1 and 2? or GSG 2 and 3? or GSG 1, 2, and 3? etc.

The exact write-up of results would have the means and standard deviations for each individual group. The F-test scores would have two types of degrees of freedom (the number of observed scores/groups minus the number of estimates) in parentheses: # of groups minus 1, and # of scores minus # of groups. As usual, the p-level would also be listed so we know the probability that we goofed and falsely believed there was a significant effect. For the purpose of this entry, I won't include post-hoc tests that describe the exact differences in the data set.

See Table 1 for the means and standard deviations of each Getting Started Guide group's time to account sign-up. We found Getting Started Guides had a significant effect on time to account sign-up, F (4, 95) = 10.27, p = .03. A post-hoc Tukey's test revealed....


No comments:

Post a Comment