Comparing Means
When we compare results from two or more groups, we often would like to know if one group “outperformed” the other(s).
When we say “outperformed,” we’re talking at the group level. Very rarely do we get results where every single member of group A did “better” (whatever that means) than every single member of group B. Instead we look to see if there was a tendency for group A to do better than group B.
This tendency can often be captured by looking at the mean result of each group. If we think the group means represent typical members of each group, then it would make sense that if group A has a tendency to outperform group B, group A’s mean would indicate a “better” performance than group B’s mean.
Comparing Means Graphically
This is where an illustration can really be helpful. Let’s look at an example:
Example: Stickgold, James, and Hobson (2000) investigated whether a person could “make up” for a lack of sleep by getting extra sleep on subsequent nights[1]. In their study, 21 volunteer participants completed a visual discrimination task. Afterward the subjects were randomly assigned to one of two groups: a treatment group (where the subjects were sleep-deprived for one night, followed by two nights of unrestricted sleep) and a control group (where the subjects were permitted unrestricted sleep on all three nights). The subjects were then re-tested on the visual discrimination task and the difference of scores (post-pre) was calculated to measure change over the three days.
Let’s look at the scores in a dotplot:
We can see that while not every person in the unrestricted group outperformed every person in the sleep-deprived group, there was a tendency for them to perform better.
To summarize the data, we might want to calculate the mean improvement for each group. We can then graph the means using a single dot for each. (Since the improvement scale is continuous we’ll use dots instead of bars. See the Graphing Data module for further discussion.)
The mean of the unrestricted group was higher, conveying the tendency. However, we lost information in the process of reducing each group to a single mean. So in addition to graphing the means, we should also include some information about the variation within each group. We can do so by including “deviation bars” which give an interval of one standard deviation (abbreviated s.d. or SD) below the mean to one s.d. above the mean:
This conveys the variation within the groups.
This works well, but typically when we compare means we want to convey the variation of the group means. This is measured by the standard error (see the Thinking about Variation module). Plots of intervals of 1 standard error (abbreviated s.e. or SEM) below the mean to one standard error above are called error bars:
Comparing Means Statistically
So how can we tell if two group means are significantly different? The difference in the two group means above (19.82 – 3.90 = 15.92) seems large, but it’s possible that we ended up with two groups just by random chance. In our example, suppose the people in both groups were going to perform the same regardless of whether or not they were sleep deprived.
This is a scenario we can do the simulation ourselves. Take a look at our original data:
Now shuffle up the subjects, randomly assigning half of the subjects to the sleep-deprived treatment and half to the unrestricted control. Again, assuming the treatment had no effect (and the subjects were going to get the same scores regardless of treatment) we keep the improvement scores the same:
Now let’s look at the means:
These two randomized groups have a mean difference of -10.7309
, which is a smaller difference than our original 15.92.
We can keep doing this process over and over again; shuffle the groups (which simulates there being no treatment effect) and record the difference in group means. Doing this 1,000 times gives the following results:
So if the experimental results really were random we would see a difference in means as large as 15.92 less that 1% of the time. Since we actually did get 15.92 for our mean difference, we can conclude that the results probably weren’t random and that our results are significant. Or, in other words, there was an effect of sleep deprivation on the visual discrimination task outcomes.
The t-Test
Randomization results are cool (no really, they are) but they are inconsistent: each of us could run our own randomizations and get slightly different results (since they’re random). Ideally our conclusions should be consistent – we should each be able to take the same data and get the same result.
To do this we’ll need to use a probability model. A probability model is a mathematical description of what (in this case) our randomization results would look like if we could do them an infinite number of times. In this case, our results would look like the randomization distribution above: mound-shaped and centered at 0. A normal distribution model (the usual bell-curve) might fit the bill here, but a better choice is the t-distribution model:
We can use the area under the t-distribution curve to approximate the proportion of times we would expect to get a certain result. In this case, the area under the t-curve to the right of 15.92 is about 0.0077, which means that if our experimental results were random (and there was no treatment effect), we would expect to get a mean difference of 15.92 or larger only 0.77% of time. Since this is pretty unlikely, we can conclude that our results probably weren’t a result of random assignment, so there is most likely a treatment effect (an effect of the sleep-deprivation independent variable on the visual discrimination task response variable).
The proportion we just calculated is called a p-value. The p-value represents the proportion of times we would expect to see our experimental results (or even stronger results) if there were no treatment effect. So if the p-value is “small” we see this as strong evidence against the no treatment effect hypothesis (and hence, strong evidence for a treatment effect). If the p-value is “not small” then the results could have plausibly happened by chance. Calculating a p-value using a t-distribution model is called a t-test.
References
1. Chance BL and Rossman AJ (2005). Investigating Statistical Concepts, Applications, and Methods, 1st edition. Duxbury Press. ISBN 0495050644.