Thursday, August 6, 2009

Possibly useful books

I have been poking around for possible stats textbooks and discovered two that might be good for user experience researchers based on the sneak peeks I got through Amazon. I ordered copies for review.
But if you want to continue your online learning, I found the Social Science Statistics Blog published by Harvard's Institute for Quantitative Social Science. The blog is on the geeky side—it's from Harvard after all—but is still accessible. The posts seem to be a mix of announcements and critiques of studies and news reports containing statistical data.

Wednesday, August 5, 2009

Post hoc tests

In the last blog entry, we learned about comparing three groups. If we know there is a significant difference between the groups, then we get to run post-hoc tests to see what the exact differences are. They're "post hoc" because it's not legal to run them until after you find a significant effect in the "omnibus" test.

Several kinds of post-hoc tests exist, including Tukey, Scheffe, Bonferroni, and LSD (which makes me smile to this day). The post-hoc tests differ in their strictness: some tests find significant effects more easily but the trade-off is that they are prone to experiment-wise error, e.g., you goof and say there's a significant effect when there isn't.

The post-hoc tests also differ in their method of comparison. Tukey and LSD are pairwise tests. Scheffe compares every possible combination of groups. Bonferroni lets you pick and choose specific groups to compare.

I often use Tukey which is fairly liberal, letting me see significant effects more often. Since no one's life depends on my UX research, I'm willing to accept a higher risk of falsely finding statistically significant results.

Post-hoc tests are easy to run. In SPSS, the ANOVA test gives you options to pick post-hoc tests, and the ANOVA and post hocs are run simultaneously. Keep in mind it's not legal to look at the post hoc unless the ANOVA is significant.

Our post-hoc results might look like this:
We found Getting Started Guides had a significant effect on time to account sign-up, F (4, 95) = 10.27, p = .03. A post-hoc Tukey's test revealed significantly shorter account sign-up times for both Getting Started Guides 3 and 4 than either Getting Started Guides 1, 2, or 5. Therefore, we recommend implementing either Getting Started Guide 3 or 4 in the new product.

Sunday, August 2, 2009

Comparing three or more groups

It is possible to compare as many groups or variables as we like. A common situation is testing the effectiveness of several UX treatments. For example, we may wish to test five kinds of Getting Started Guides to find the one that helps the user complete account sign-up most quickly.

When we compared Design A and Design B on sales, we compared the mean per-person sales for Design A and Design B with a t-test. One way (the not smart way) to compare three or more groups would be to run several pair-wise tests for each Getting Started Guide and its average time to sign-up, e.g., GSG 1 vs. GSG 2, GSG 1 vs. 3, GSG 2 vs. 3, etc. We don't want to do this method because it's tedious and hard to summarize (we would have to run 10 sets of tests) and our chance of falsely believing we have a significant effect increases with each test we run. Most researchers try to minimize the number of statistical tests they run to reduce the chance of error.

For this situation, we can run an ANOVA (ANalysis Of VAriance) or F-test which compares all groups at once. Back when we ran an independent samples t-test, Design A and B significantly differed only if their pots of scores did not overlap too much. An ANOVA also compares pots of scores by considering the variability within each pot and between all the pots. If the scores of one pot are pretty much the same as the scores of all the other pots, then there cannot be a significant effect. Only if at least one group differs from the others would there be a significant effect.

Example: We test five versions of our web site with a different Getting Started Guide for each. Using log files, we are able to see how long it took each user between opening the GSG to completing account sign-up. We have 20 people in each GSG treatment for a total of 100 people. Using SPSS, we run a one-way ANOVA that gives us an F score that says there was a significant difference somewhere between the groups. We would have to do post-hoc tests to figure out exactly where the differences were: between GSG 1 and 2? or GSG 2 and 3? or GSG 1, 2, and 3? etc.

The exact write-up of results would have the means and standard deviations for each individual group. The F-test scores would have two types of degrees of freedom (the number of observed scores/groups minus the number of estimates) in parentheses: # of groups minus 1, and # of scores minus # of groups. As usual, the p-level would also be listed so we know the probability that we goofed and falsely believed there was a significant effect. For the purpose of this entry, I won't include post-hoc tests that describe the exact differences in the data set.

See Table 1 for the means and standard deviations of each Getting Started Guide group's time to account sign-up. We found Getting Started Guides had a significant effect on time to account sign-up, F (4, 95) = 10.27, p = .03. A post-hoc Tukey's test revealed....


Friday, July 31, 2009

103 degrees (of freedom)

In honor of the record 103 degree heat in Seattle this week, I write about degrees of freedom, referenced in the results of every inferential stat. Degrees of freedom are the number of actual observed data points in an inferential calculation minus the number of estimated data points. Suppose we compared 50 users of original Excel with 50 users of redesigned Excel on the time required to create and save a new spreadsheet. We would have 100 observed time scores and 2 estimated means (1 for the group of original Excel users and 1 for the redesigned users). Thus we have 98 degrees of freedom. The results of an independent samples t-test would look like t (98) = 2.53, p < 0.05, where the number in parentheses is the degrees of freedom.

Degrees of freedom address the effect of calculating a statistical estimate (the 5% or less statistical probability of a false positive) based on another estimate (mean score of a group). The more degrees of freedom, the more statistical power you have to find a significant effect. To see this power, consider if you were comparing original Excel vs. redesigned Excel time on task with 2 users in each group. The degrees of freedom would be 4 observed scores minus 2 means = 2. With so few degrees of freedom, the redesigned and original Excel times on task would have to be hugely different to find statistical significance. With more degrees of freedom, the difference between means would not have to be so exaggerated in order to find statistical significance.

Thursday, July 30, 2009

Comparing two variables

Sometimes we want to know how the scores in Variable 1 compare to Variable 2. A common situation is comparing your study participants' pre-test and post-test scores. Or you might compare your participants' level of satisfaction with two things such as photos manipulated in Pretty Photo Premier versus Super Snapshot Suite. For these situations, we would use Mr. T-Test again, this time a paired samples t-test.

In the last blog entry, we walked through the example of deciding whether a picture improved sales per visitor on the MowBee site. Using an independent samples t-test, we compared the means and variability of the purchases by Pic vs. No Pic visitors. The less the two pots of scores overlap, the more confident we can be that the Pic condition did statistically increase sales over the No Pic condition. When we use a paired samples t-test, we are also comparing means and variability but of two variables for one group rather than one variable for two groups.

Example: We ran a survey asking participants to rate satisfaction on a scale of 1 to 7 with the results produced in Pretty Photo Premier versus Super Snapshot Suite. We have a data file of 100 cases (participants) and three variables: participant id number, PPP satisfaction, and SSS satisfaction. We want to know whether the higher ratings for SSS are statistically significant. If they were, we would write the results with the t-test score (t) and the level of probability (p) that we falsely found significant difference between the two products. The means and standard deviations for each variable (PPP and SSS) would also be included so you could see which group was high and which was low. Our results could look like this:
Participants were significantly more satisfied, t (99) = 4.85, p < 0.01, with the results of Super Snapshot Suite (M=5.60, SD=1.90) than Pretty Photo Premier (M=3.50, SD = 1.43).

Wednesday, July 29, 2009

Comparing two test groups

In the UX world, we often compare how people perform in Design A vs. Design B on certain measures such as pageviews, bounce rate, clickthrough, or sales. To decide whether Design A or Design B is better, we need to compare the means and variability of the users' scores in these two groups.

Sometimes you can tell which design is better by eyeballing the data. Suppose we split our site traffic between versions A and B of our web site. Design A used neon-colored blinking links, and Design B used standard links. After looking at our analytical reports, we might see that A visitors spent an average of 1 minute on the site, while B visitors spent 3 minutes. Both groups had similarly low variability. We can probably say that Design B was better than Design A for getting users to stay on the site because the scores for both groups are so different without extreme outliers skewing the results.

It is when the two groups look much more alike that we really need a statistical test. When comparing two groups, we can run an independent samples t-test. (Fun fact: The t-test was invented at the Guinness brewery.)

A t-test works by comparing the means and variability of two groups of interest. It essentially considers how much the scores for each of the two groups overlap. If they overlap completely, then the two groups are not different from one another. They less they overlap, the more likely the two groups are statistically different.

Example. We currently sell our flagship product, The MowBee, on a web page with no pictures. We want to know if a picture will increase sales. We split traffic equally between picture and no picture versions of the site. After a week, we funneled 100 people to the Picture page, and 100 people to the No Picture page. Each person's purchase (if any) is recorded. Our data file at the end has 200 cases and three variables: participant id number, condition (pic, no pic), and dollar amount of sales. If we ran this data through SPSS' independent samples t-test and found statistical significance, the results would have a t-test score (t) and the level of probability (p) that we wrongly found statistical difference between the two groups. The means and standard deviation for each group would also be reported so we can see which was higher. We might have a result like this:
We made significantly more money per visitor, t (198) = 2.47, p < 0.05, with the Picture page (M=$85, SD=44.41) than the No Picture page (M=$47.50, SD=18.45).

Your friend, the data file

Before you can use a program like SPSS, you need a spreadsheet of study data. Here are some basics to help you prepare a data file. Let's use the example that we ran a usability study with 20 participants on a new personal finance web site.

Each row of your spreadsheet represents one case (or participant) in the study. For this file, you would have 20 cases. Each column represents one variable. Your variables would be questions or data that you collected for each person such as participant id number, age, computer ownership, time on tasks, satisfaction ratings, and comprehension test scores.

Since some statistical packages accept only numerical data, a best practice is to code everything as numbers. Suppose you asked participants for their age group: 18-24, 25-34, 35-44, etc. You can recode their responses as 1 (=18-24), 2 (=25-34), 3 (=35-44), etc.

Maintaining a consistent numbering scheme is also a good practice. I like to code No and Yes answers as 0 and 1 and use 0 for any kind of No answer such as "I don't own X" or "I don't use Y."

I like to code Likert scale answers starting at 1, where 1 is the most negative option and n is most positive. If the question is worded negatively, you may have to recode the answers in semantically negative to positive order. Using a consistent scale makes it easier to analyze related questions.

Example A: Imagine you asked: What is your level of agreement with these statements? where 1 = I definitely disagree, 3 = I feel neutral, and 5 = I definitely agree.
  1. I balance my checkbook.
  2. I never keep my pay check stubs.

To make these answers semantically parallel, you would code question 1 exactly as the participants answered it: 1=1, 2=2, etc. You would flip the scale (1=5, 2=4, etc.) for question 2 because the statement is negative: when participants say they disagree, they mean they do keep their pay stubs. Thus, if you test for a statistical correlation, you in essence look for a positive correlation between balancing a checkbook and keeping pay stubs - much simpler to think about than negative correlation with not keeping pay stubs.

If you have missing answers from your participants, keep those spreadsheet cells blank so you do not accidentally run calculations on them. Any answer you want to exclude from calculations should also be treated as blanks.

Example B: Suppose you ask, "How satisfied are you with your bank?" where 1 = Very unsatisfied, 3 = Neutral, and 5 = Very satisfied. You also offer an "I don't know" option. Any "I don't know's" should be coded as blank. If you ran a stat on satisfaction level, you would want to exclude people who don't know because they cannot speak to this question.