Friday, July 31, 2009

103 degrees (of freedom)

In honor of the record 103 degree heat in Seattle this week, I write about degrees of freedom, referenced in the results of every inferential stat. Degrees of freedom are the number of actual observed data points in an inferential calculation minus the number of estimated data points. Suppose we compared 50 users of original Excel with 50 users of redesigned Excel on the time required to create and save a new spreadsheet. We would have 100 observed time scores and 2 estimated means (1 for the group of original Excel users and 1 for the redesigned users). Thus we have 98 degrees of freedom. The results of an independent samples t-test would look like t (98) = 2.53, p < 0.05, where the number in parentheses is the degrees of freedom.

Degrees of freedom address the effect of calculating a statistical estimate (the 5% or less statistical probability of a false positive) based on another estimate (mean score of a group). The more degrees of freedom, the more statistical power you have to find a significant effect. To see this power, consider if you were comparing original Excel vs. redesigned Excel time on task with 2 users in each group. The degrees of freedom would be 4 observed scores minus 2 means = 2. With so few degrees of freedom, the redesigned and original Excel times on task would have to be hugely different to find statistical significance. With more degrees of freedom, the difference between means would not have to be so exaggerated in order to find statistical significance.

Thursday, July 30, 2009

Comparing two variables

Sometimes we want to know how the scores in Variable 1 compare to Variable 2. A common situation is comparing your study participants' pre-test and post-test scores. Or you might compare your participants' level of satisfaction with two things such as photos manipulated in Pretty Photo Premier versus Super Snapshot Suite. For these situations, we would use Mr. T-Test again, this time a paired samples t-test.

In the last blog entry, we walked through the example of deciding whether a picture improved sales per visitor on the MowBee site. Using an independent samples t-test, we compared the means and variability of the purchases by Pic vs. No Pic visitors. The less the two pots of scores overlap, the more confident we can be that the Pic condition did statistically increase sales over the No Pic condition. When we use a paired samples t-test, we are also comparing means and variability but of two variables for one group rather than one variable for two groups.

Example: We ran a survey asking participants to rate satisfaction on a scale of 1 to 7 with the results produced in Pretty Photo Premier versus Super Snapshot Suite. We have a data file of 100 cases (participants) and three variables: participant id number, PPP satisfaction, and SSS satisfaction. We want to know whether the higher ratings for SSS are statistically significant. If they were, we would write the results with the t-test score (t) and the level of probability (p) that we falsely found significant difference between the two products. The means and standard deviations for each variable (PPP and SSS) would also be included so you could see which group was high and which was low. Our results could look like this:
Participants were significantly more satisfied, t (99) = 4.85, p < 0.01, with the results of Super Snapshot Suite (M=5.60, SD=1.90) than Pretty Photo Premier (M=3.50, SD = 1.43).

Wednesday, July 29, 2009

Comparing two test groups

In the UX world, we often compare how people perform in Design A vs. Design B on certain measures such as pageviews, bounce rate, clickthrough, or sales. To decide whether Design A or Design B is better, we need to compare the means and variability of the users' scores in these two groups.

Sometimes you can tell which design is better by eyeballing the data. Suppose we split our site traffic between versions A and B of our web site. Design A used neon-colored blinking links, and Design B used standard links. After looking at our analytical reports, we might see that A visitors spent an average of 1 minute on the site, while B visitors spent 3 minutes. Both groups had similarly low variability. We can probably say that Design B was better than Design A for getting users to stay on the site because the scores for both groups are so different without extreme outliers skewing the results.

It is when the two groups look much more alike that we really need a statistical test. When comparing two groups, we can run an independent samples t-test. (Fun fact: The t-test was invented at the Guinness brewery.)

A t-test works by comparing the means and variability of two groups of interest. It essentially considers how much the scores for each of the two groups overlap. If they overlap completely, then the two groups are not different from one another. They less they overlap, the more likely the two groups are statistically different.

Example. We currently sell our flagship product, The MowBee, on a web page with no pictures. We want to know if a picture will increase sales. We split traffic equally between picture and no picture versions of the site. After a week, we funneled 100 people to the Picture page, and 100 people to the No Picture page. Each person's purchase (if any) is recorded. Our data file at the end has 200 cases and three variables: participant id number, condition (pic, no pic), and dollar amount of sales. If we ran this data through SPSS' independent samples t-test and found statistical significance, the results would have a t-test score (t) and the level of probability (p) that we wrongly found statistical difference between the two groups. The means and standard deviation for each group would also be reported so we can see which was higher. We might have a result like this:
We made significantly more money per visitor, t (198) = 2.47, p < 0.05, with the Picture page (M=$85, SD=44.41) than the No Picture page (M=$47.50, SD=18.45).

Your friend, the data file

Before you can use a program like SPSS, you need a spreadsheet of study data. Here are some basics to help you prepare a data file. Let's use the example that we ran a usability study with 20 participants on a new personal finance web site.

Each row of your spreadsheet represents one case (or participant) in the study. For this file, you would have 20 cases. Each column represents one variable. Your variables would be questions or data that you collected for each person such as participant id number, age, computer ownership, time on tasks, satisfaction ratings, and comprehension test scores.

Since some statistical packages accept only numerical data, a best practice is to code everything as numbers. Suppose you asked participants for their age group: 18-24, 25-34, 35-44, etc. You can recode their responses as 1 (=18-24), 2 (=25-34), 3 (=35-44), etc.

Maintaining a consistent numbering scheme is also a good practice. I like to code No and Yes answers as 0 and 1 and use 0 for any kind of No answer such as "I don't own X" or "I don't use Y."

I like to code Likert scale answers starting at 1, where 1 is the most negative option and n is most positive. If the question is worded negatively, you may have to recode the answers in semantically negative to positive order. Using a consistent scale makes it easier to analyze related questions.

Example A: Imagine you asked: What is your level of agreement with these statements? where 1 = I definitely disagree, 3 = I feel neutral, and 5 = I definitely agree.
  1. I balance my checkbook.
  2. I never keep my pay check stubs.

To make these answers semantically parallel, you would code question 1 exactly as the participants answered it: 1=1, 2=2, etc. You would flip the scale (1=5, 2=4, etc.) for question 2 because the statement is negative: when participants say they disagree, they mean they do keep their pay stubs. Thus, if you test for a statistical correlation, you in essence look for a positive correlation between balancing a checkbook and keeping pay stubs - much simpler to think about than negative correlation with not keeping pay stubs.

If you have missing answers from your participants, keep those spreadsheet cells blank so you do not accidentally run calculations on them. Any answer you want to exclude from calculations should also be treated as blanks.

Example B: Suppose you ask, "How satisfied are you with your bank?" where 1 = Very unsatisfied, 3 = Neutral, and 5 = Very satisfied. You also offer an "I don't know" option. Any "I don't know's" should be coded as blank. If you ran a stat on satisfaction level, you would want to exclude people who don't know because they cannot speak to this question.

Monday, July 27, 2009

Making predictions

We can describe a data set's mean and standard deviation with perfect certainty. However, sometimes we want to make conclusions that stretch beyond the data set, such as when we use a sample to represent a whole population. How do we know we didn't draw a sample of people who like X or perform Y purely by chance? For these cases, we use inferential stats, which include ANOVA, regression, correlations, chi-square, MANOVA, and more. The general premise of inferential stats is to say with a level of confidence whether a phenomenon occurred because of a relationship between our variables or complete chance.

Let's say we asked 250 people on the street whether they preferred Search Engine A or B. We would expect the answers to split about 50/50 if the search engines did not differ from each other. Using an inferential statistic, we could test whether our responses of 150 people who prefer Search Engine A and 100 Search Engine B were due to chance or an actual preference for Search Engine A.

As part of this test we would state our tolerance for error or the level that we require for statistical significance. In social sciences, the alpha level, or bar for statistical significance, is often 0.05, that is, we are willing to accept a 5% chance that we incorrectly believe that people have a preference for Search Engine A. The tolerance for error will change depending on the test context. A clinical trial of a new drug will probably have lower tolerance for error than a taste test of chicken nuggets.

In this example, we run a one-way chi-squre, which tells us whether the preference for Search Engine A was statistically significant. If significant, the p value (probability value) would be less than what we set as the alpha level. The results of a chi-square and any other inferential stat would have the statistical test score, in this case a χ2, degrees of freedom and/or sample size in parentheses, and the level of probability (p). If appropriate, the means and standard deviations of the comparison groups would be reported. Below, the sample report shows the χ2 value as 10 based on 1 degree of freedom and a sample of 250 scores, with less than 1 percent probability that the results were due to chance

Our street survey showed a statistically significant preference by women for Search Engine A, χ2 (1, N=250) = 10, p < 0.01. Future research can investigate the design characteristics that make it more preferable Search Engine A more preferable than B.

Sunday, July 26, 2009

Describing your data set

Sometimes you want to be able to talk about what your data set or group of users looked like as a whole. For example, you might want to talk about how technically knowledgable the users in your last usability study were. Use descriptive stats for this purpose. Some common descriptive stats include frequency, central tendency, and variability.

==========
Frequency
==========
One way to describe your data set is to report the size of each group that you care about. These groups might be based on gender, visual appeal rating, or time on task. Frequency (or count) is the number of times these various items appear in your data set.

Example A: Suppose you took a survey of 50 people asking them what word processing package they used. 30 people said Microsoft Word, 10 WordPerfect, 5 Google Docs, 3 Notepad, and 2 Wordpad. You could report the frequencies as Ns (30 Microsoft Word, 10 WordPerfect, etc.) or as percents (60% of respondents used Microsoft Word).

The number of people, or N, in your sample is also an important frequency to report because it usually matters whether your conclusions are based on a sample of 2 or 2,000 people.

A sample frequency report:
We tested 25 participants. They were technically savvy. Almost all had used computers before (n=24). In addition, all participants scored at least 60 on Dr. Wei's Computer Competency Test, and 15 participants scored 90 or higher.

================
Central Tendency
================
Sometimes we want to describe our data in a more shorthand way than frequency. In these cases, we talk about the average or central tendency of a group. Oftentimes, we use the mean to describe the average of a group, but the mode or the median could also be averages. Let's look closer at when and why we use each.

  • We calculate the mean (M) when we have interval data such as time, number of visits, comprehension test scores, or satisfaction ratings. The mean = the sum of all scores divided by # of scores.

    Example B:
    Suppose we have 5 participants in a usability test, and they each took 10, 12, 40, 20, or 15 seconds to complete a task. Their mean time was (10+12+40+20+15)/5 = 19.4 seconds.
  • We can use the median to describe the central point of a set of ordinal data such as rankings. The median = the midpoint of a set of scores.

    Example C:
    Suppose we have our 5 participants rate the importance of Feature X on a scale of 1 to 7 where 1 = "Not at all important" and 7 = "Extremely important." This scale is ordinal rather than interval because only the two end points are labeled, so each participant's interpretation of the values of 2-6 varies. Your participants' scores are 1, 6, 6, 7, and 7. The median is 6, which is the midpoint of the dataset when it is in numerical order. If there had been a 6th participant who gave a rating of 7, the median would be the mean of the middle two numbers, or (6+7)/2 = 6.5.

    You can also use the median to describe interval data if they are extremely skewed with extreme spikes or dips. Note: Using a mean and standard deviation to describe interval data is probably more common than the median. The skewness or variability of the data is reported in the standard deviation (keep reading this post).

    Example D:
    Suppose in Example B, the time scores had been 10, 12, 180, 20, and 15 seconds. (Participant 3 had stopped in the middle of the task to ask questions and share an anecdote.) The mean would have been 47.4 seconds, suggesting the task was very time-consuming to complete. The median score of 15 seconds is a much more representative description of how long it takes to complete that task.
  • We can use the mode to describe the average of a set of nominal data such as the types of computers that people own. The mode = the most frequently occurring score.

    Example E:
    Suppose we gave a survey to 100 people asking them which computers they own. In our spreadsheet for responses, we record responses as follows: PC=1, Mac=2, Linux=3 (other computer types did not matter). We found that the participants owned 90 PCs, 30 Macs, and 2 Linux boxes. The mode is 1 or PC, the most frequently occurring response.
===========
Variability
==========
When we report the mean, we also report standard deviation (SD) which reflects the variability of the data set. The standard deviation tells us how spread apart a data set is. Example D had a huge outlier, so M=47.4 seconds, SD = 74.22. The standard deviation tells us that the data set had a tremendous amount of variability. If all the time scores had been exactly 15 seconds, the standard deviation would have been 0. Standard deviation is annoying to calculate by hand, but it is worth looking at an example calculation to understand how variance works.

A sample description of central tendency and variability:
The 55 survey respondents on average rated Widget W's usefulness positively: M=5.34, SD=1.51 (on a scale of 1 to 7 where 1 = "Extremely unuseful," 7 = "Extremely useful," and all other points are labeled similarly to the visual appeal question).

Know your quantitative data

Preparing data for statistical analysis involves creating a spreadsheet of numerical values. A classic spreadsheet would be one that is completely coded in numbers including things like participant ID, gender, satisfaction scores, or time on task.

In the UX world, quantitative data can come in three flavors: interval, ordinal, and nominal. It is important to recognize each type of data because they determine the stats that you can run.

  • Interval data are composed of equally distributed units. Weight is an example of interval data, where everyone weighs a certain number of pounds. Each pound is uniform, so 149 pounds is smaller than 150 pounds in exactly the same way that 150 pounds is smaller than 151 pounds. Common UX interval data include the time that a user requires to complete a task, the number of visitors that a web site has, or a satisfaction scale with every point carefully labeled.

  • Ordinal data are rank ordered but may not be in equally spaced units. Class rank is a common example of an ordinal scale, e.g., every school has students who are ranked 1st, 2nd, 3rd, etc. in their class. However, in School A, students 1, 2, and 3 may have 4.0, 3.9, and 3.4 GPAs, while in School B, the top 3 students might have 3.8, 3.0, and 2.9 GPAs. In the UX world, ordinal data may include user rankings of the five most important features in a product. Satisfaction scales with only a few labeled points are ordinal as well.

  • Nominal data measure discrete categories. The genders of your study participants are nominal data. In a spreadsheet you could code gender as a number, where 0 = male, and 1 = female. You can't do any math directly on the nominal data because it wouldn't make sense. For example, you can't calculate the average of gender even if you coded it as 0s and 1s because there's no sense in reporting that, "On average, the study participants had a gender of 0.6."