Chapter 11 Statistics and Effect Size
11.1 Recap: Inferential Statistics
Throughout this course, we have looked at how researchers design their studies (their research methods). Along the way, we have considered how statistics are used as a tool to draw conclusions from study data. Essentially, we use statistics to draw conclusions about populations (one example of a construct in our study) from data observed from samples (participants are one example of a study operation). Using statistics to make conclusions about populations based on samples is called inferential statistics.
11.2 Choosing the Right Statistic
We have emphasized two different statistical techniques in this course, the t-test and correlation. We use t-tests when we have two separate groups. Put another way, t-tests require a discrete IV with two levels. We use correlation when our IV is either dichotomous or continuous. Sometimes, we could use either technique. If we ran a simple experiment with a treatment group and a control group, and we measured a continuous DV, we could use either a t-test or a correlation.
Here’s the big idea: The kinds of variables (the level of measurement and whether variables are continuous or discrete) are the primary consideration in the selection of the statistical technique. Our statistics do not care if we are running an experiment, a quasi-experiment, or a non-experiment. When we sit down to do statistics, the first question we should ask is, “what kinds of variables do I have?” The second question should be “What is my research question?” t-tests are designed to help us discover mean differences; is there evidence that our two conditions were different from each other? Correlation is designed to help us discover whether two variables have a linear relationship; if we know the value of one variable, can we use it to predict the value of the other variable?
T-tests and correlation are not the only stat techniques you will encounter. However, they are good representatives of the two kinds of tests. We have tests to compare mean differences (including t-tests, one-way ANOVA, and factorial ANOVA) and tests to look for linear relationships (including correlation, simple regression, and multiple regression). ANOVA is a more complex version of a t-test because it allows us to have two, three, or more conditions in our study. Multiple regression is a more complex version of correlation because it allows us to include more than one IV in our study. Multiple regression lets us see if multiple variables, together, can be used to predict another.
Bottom line: Choose the best statistic based on your research question and the types of data that you have.
11.3 p-Values, Effect Size, and Sample Size
When we do inferential statistics (no matter if it’s a t-test or a correlation), we commonly use null hypothesis significance testing (NHST). NHST sets up two opposite hypotheses. In NHST, you are trying to prove that one of the two hypotheses, the null hypothesis, is unlikely.
To do this, we calculate our statistic (e.g., t or r) and then find the value p. The interpretation of p is as follows:
Assuming there is no effect, the probability of obtaining a statistic at least this large is p.
When we see a p-value below alpha (alpha is .05), we reject the null hypothesis and conclude that an effect is present. Another name for rejecting the null is statistical significance. When we see a p-value greater than alpha (.05) we retain the null hypothesis and conclude nothing. Another name for retaining is that the results were not significant (don’t use the term “insignificant” which means something else).
The phrase “assuming there is no effect” is how we assume the null hypothesis is true. We imagine a scenario in which there is no effect. Effect is a general name for either mean differences (e.g., how different were our conditions) or a linear relationship. When doing a t-test, the bigger the mean differences, the larger our effect. When doing a correlation, the stronger the relationship (r closer to +1 or -1), the larger our effect. In fact, r is a measure of effect size. An effect size of r = .9 is a larger effect size than r = .7.
Effect size is the same thing as the strength of the correlation. A strong correlation means that the data points of the scatterplot lie close to the line. A weak correlation means that the data points of the scatterplot are scattered far from the line. When you report a correlation, you should give the proper interpretation of its effect size (small, medium, or large).
11.4 Interpretation of \(r\) (Cohen, 1988; Note: This is not \(r^2\))
These are reference points, not firm cutoffs. For example, .28 is a medium effect size.
Effect Size | Interpretation |
---|---|
\(r\pm.10\) Small effe | ct |
\(r\pm.30\) Medium e | ffect |
\(r\pm.50\) Large effe | ct |
Did you know that p-values, effect size, and sample size are all related? Let’s use some examples to illustrate (r is effect size, N is sample size, and p is the p-value we use to determine statistical significance):
- \(r\) = .30, \(N\) = 20, \(p\) = .19 (retain, not significant)
- \(r\) = .80, \(N\) = 20, \(p\) < .001 (reject, significant)
- \(r\) = .30, \(N\) = 70, \(p\) = .01 (reject, not significant)
What patterns do you notice? Examples 1 and 2 have different p-values. Why is example 2’s p-value lower? It is because it has a larger effect size. That’s the first way these things are related. Everything else being equal, a larger effect size will give a lower p-value.
Second, notice the difference between Examples 1 and 3. They have the same effect size but different p-values. In fact, the first one is not significant, but the third one is. Why? The difference is the sample size. Everything else being equal, a larger sample size will give a lower p-value.
In all, this means that p-values are affected by both sample size and effect size. The larger the sample size, the more reliable our observation. If you did a study to measure stress in college students, which is more believable: Evidence from a sample of 10 students showing high stress on average or evidence from a sample of 100 students showing high stress on average?
And, the larger the effect size, the more evidence there is of a relationship. If college students report extremely high levels of stress on average, that provides more evidence that they are experiencing stress than if the survey showed small levels of stress.
As you read Chapter 10 in the text, think about how sample size and effect size are used in probabilistic reasoning. When we reject the null hypothesis, we conclude an effect exists. But, we are never 100% confident that we are right. We are using the data to make a conclusion, but all conclusions are probabilistic. If we’re right with our statistical conclusions, then we can say we have statistical conclusion validity. But in the real world of doing research, we have no guarantees we are correct (otherwise, we wouldn’t need to do the research!). Our best tool is always to collect more evidence and integrate all past studies to form a better understanding of the phenomena that we study.