Tuesday, April 5, 2016

How big can a dyslexic deficit be?

TL;DR: Thinking about effect size can help to plan experiments and evaluate existing studies. Bigger is not always better!

In grant applications, requests for ethics permission, article introductions or pre-registered reports it is often expected that the researcher provides an estimate of the expected effect size. This should then be used to determine the sample size that the researcher aims to recruit. At first glance, it may seem counter-intuitive that one has to specify an effect size before one collects the data – after all, the aim of (most) experiments is to establish what this effect size could be, and/or whether it is different from zero. At second glance, specifying an expected effect size may be one of the most important issues to consider in the planning of an experiment. Underpowered studies reduce the researcher’s chance to find an effect, even if the effect exists in the population – running an experiment with 30% power gives one a worse chance to find the effect than tossing a coin to determine the experiment’s outcome. Significant effects in underpowered studies overestimate the true effect size, because the z-score cut-off that is needed to reach significance is larger than the population effect (Gelman & Carlin, 2014; Gelman & Weakliem, 2009; Schmidt, 1992).

So, given that it is important to have an idea of the effect before running the study, what is the best way to come up with a reasonable a priori effect size estimate? One possibility is to run a pilot test; another is to specify the smallest effect size that would be of (theoretical and/or practical) significance; the third would be to consider everything we know about effects that are kind of like the one we are interested in, and to use that to guide our expectations. All three have both benefits and drawbacks – my aim is not to discuss these. My aim here is to focus on the third possibility: knowing how big the most stable effects in your research are can be used to constrain your expectations – it is unlikely that your “new” effect, a high-hanging fruit, is bigger than the low-hanging fruit that has been plucked decades ago. Specifically, I will try to provide an effect size estimate which could be used by researchers seeking to find potential deficits underlying developmental dyslexia, and discuss the limitations of this approach in this particular instance.

Case study: Developmental dyslexia
A large proportion of the literature on developmental dyslexia focuses on finding potentially causal deficits. Such studies generally test dyslexic and control participants on a task which is not directly related to reading, such as visual attention, implicit learning, balance, etc., etc. If dyslexic participants perform significantly worse than control participants, it is suggested that there is the possibility that the task might tap an underlying deficit, which causes dyslexia, can be used as an early marker of dyslexia, and can be treated to improve reading skills. The abundance of such studies has lead to an abundance of theories of developmental dyslexia, which has lead to – well – a bit of a mess. I will not go into details here; the interested reader can scan the titles of papers from any reading- or dyslexia-related journal for some examples.

A problem arises now that we know that many published studies are Type-I errors (that find a significant group difference in their sample that is absent in the population; see Ioannidis, 2005; Open Science Collaboration, 2015). Sifting through the vast and often contradictory literature to find out whether there is a sound case for a dyslexic deficit on each of the tasks that have been studied to date would be a mammoth task. Yet, this would be an important step for an integrative theory of dyslexia. Making sense of the existing literature would not only constrain theory, but also prevent researchers from wasting resources on treatment studies which are not likely to yield fruitful results.

Both to evaluate the existing literature, and to plan future studies, it would be helpful to have an idea of how big an effect can be reasonably expected. To this end, I decided to calculate an effect size estimate of the phonological awareness deficit. The phonological awareness deficit relates to the consistent finding that participants at dyslexia perform more poorly than controls on tasks involving the manipulation of phonemes and other sublexical spoken units (Melby-Lervåg, Lyster, & Hulme, 2012; Snowling, 2000; for a discussion about potential causality see Castles & Coltheart, 2004).  

Studies which are not about phonological awareness often include phonological awareness tasks in their test battery when comparing dyslexic participants to controls. In these instances, phonological awareness is not the variable of interest, therefore there is little reason to suspect that there is any p-hacking (e.g., including covariates, removing outliers, or adding participants until the group difference becomes significant). For this reason, the effect size estimate is based only on studies where phonological awareness was not the critical experimental task. To this end, I looked through all papers that are lying around on my desk on the subject of statistical learning and dyslexia, and on magnocellular/dorsal functioning in dyslexia. I included, in the analysis, all papers which provided a table with group means and standard deviations on tasks of phonological awareness. This resulted in 15 papers, 6 testing children and 9 testing adults, from 5 different languages. Mostly, the phonological awareness measures were spoonerism tasks. A full list of studies with the tasks, participant characteristics, and full references to the papers can be found in the dropbox folder linked below.

I generated a forest plot with the metafor package for R (Viechtbauer, 2010). I used a random-effect model, and the Sidik-Jonkman method (Sidik & Jonkman, 2005; see Edit below). The R analysis script, and the datafile (including full references to all papers that were included) can be found here: osf.io/69a2p

The results are shown in the figure below. All studies show a negative effect (i.e., worse performance on phonological awareness tasks for dyslexic compared to control groups). The effect size estimate is d = 1.24, with a relatively narrow confidence interval, 95% CI [0.95, 1.54].

Conclusions and limitations
When comparing dyslexic participants and controls on phonological awareness tasks, one can expect a large effect size of around d = 1.2. A phonological awareness deficit is the most established phenomenon in dyslexia research; given that all other effects (as far as I know) are contentious, and have often been successfully replicated by some labs, but not by others, it is likely that a group difference on any other task (which is not directly related to reading) will be smaller. For researchers, peer reviewers, and editors, the obtained effect size of a new experiment can be used as a red flag: if a study examines whether or not there is a group difference in, say, chewing speed between dyslexic and control groups*, and obtains an effect size of d > 1, one should raise an eyebrow. This would be a clear indicator that one has obtained a Magnitude error (Gelman & Carlin, 2014). Obtaining several effect sizes of this magnitude in a multi-experiment study could even be an indicator of p-hacking or worse (Schimmack, 2012).

The are two limitations of the current blog post. The first is methodological, the second is theoretical. First, I have deliberately chosen to include studies where phonological awareness was not of primary interest. This means that the original authors would not have had an incentive to make this effect look bigger than it actually is. However, there is a potential alternative source of bias: it is possible that (at least some) researchers use phonological awareness as a manipulation check: given that this effect is well-established, a failure to find it in one’s sample could be taken to suggest that the research assistant who collected the data screwed up. This could lead to a file-drawer problem, where researchers discard all datasets where the group difference in phonological awareness was not significant. It is also plausible that researchers would simply not report having tested this variable if it did not yield a significant difference, as the critical comparison in all papers was on other cognitive tasks. If other studies have obtained non-significant differences in phonological awareness but did not report these, the current analysis presents a hugely inflated effect size estimate.

The second issue relates less to the validity of the effect size, and more to its utility: the effect size estimate is really, really big. A researcher paying attention to effect sizes will be sceptical, anyway, if they see an effect size this big. Thus, in this case, the effect size of the most stable effect in the area cannot be used to constrain our expectation of potential effect sizes of other experiments. To guide the expectation of an effect size for future studies, one might therefore want to turn to alternative approaches. Pilot testing could be useful, but its limitation is that it is often difficult to recruit enough participants to get a meaningful pilot study plus a well-powered full experiment. At this stage, it would also be difficult to define a minimum effect size that would be of interest. This could change, however, if we develop models of reading and dyslexia that make quantitative predictions. (This is unlikely to happen anytime soon, though.) Currently, the most useful approach to determining whether or not there is an effect, given limited resources and no effect size estimate, seems to be optional stopping. This can be done if it is planned in advance; in the frequentist framework, alpha-levels need to be adjusted a priori (Lakens, 2014; for Bayesian approaches see Rouder, 2014; Schönbrodt, 2015).  

Despite the limitations of the current analysis, I hope the readers of this blog post will be encouraged to consider issues in estimating effect sizes, and will find some useful references below.  

Collaboration, O. S. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.
Gelman, A., & Carlin, J. (2014). Beyond Power Calculations Assessing Type S (Sign) and Type M (Magnitude) Errors. Perspectives on Psychological Science, 9(6), 641-651.
Gelman, A., & Weakliem, D. (2009). Of beauty, sex and power: Too little attention has been paid to the statistical challenges in estimating small effects. American Scientist, 97(4), 310-316.
Hartung, J., & Makambi, K. H. (2003). Reducing the number of unjustified significant results in meta-analysis. Communications in Statistics-Simulation and Computation, 32(4), 1179-1190.
IntHout, J., Ioannidis, J. P., & Borm, G. F. (2014). The Hartung-Knapp-Sidik-Jonkman method for random effects meta-analysis is straightforward and considerably outperforms the standard DerSimonian-Laird method. BMC medical research methodology, 14(1), 1.
Ioannidis, J. P. A. (2005). Why most published research findings are false. Plos Medicine, 2(8), 696-701. doi:10.1371/journal.pmed.0020124
Lakens, D. (2014). Performing highpowered studies efficiently with sequential analyses. European Journal of Social Psychology, 44(7), 701-710.
Melby-Lervåg, M., Lyster, S.-A. H., & Hulme, C. (2012). Phonological skills and their role in learning to read: a meta-analytic review. Psychological Bulletin, 138(2), 322.
Rouder, J. N. (2014). Optional stopping: No problem for Bayesians. Psychonomic Bulletin & Review, 21(2), 301-308.
Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551.
Schmidt, F. L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47(10), 1173.
Schönbrodt, F. D. (2015). Sequential Hypothesis Testing with Bayes Factors: Efficiently Testing Mean Differences. Available at SSRN 2604513.
Sidik, K., & Jonkman, J. N. (2005). Simple heterogeneity variance estimation for metaanalysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(2), 367-384.
Snowling, M. J. (2000). Dyslexia. Malden, MA: Blackwell Publishers.
Viechtbauer, W. (2010). Conducting meta-analyses in R with the metafor package. Journal of Statistical Software, 36(3), 1-48.  Retrieved from http://www.jstatsoft.org/v36/i03/.


* It’s not easy coming up with a counterintuitive-sounding “dyslexic deficit” which hasn’t already been proposed!

Edit 5.4.16 #1: Added a new link to data (to OSF rather than dropbox).
Edit 5.4.16 #2: I used the "SJ" method for ES estimation, which I thought referred to the one recommended by IntHout et al., referenced above. Apparently, it doesn't. Using, instead, the REML method slightly reduces the effect size, to d = 1.2, 95% CI [1.0, 1.4]. The estimate is similarly large to the one described in the text, therefore it does not change the conclusions of the analyses. Thanks to Robert Ross for pointing this out!

No comments:

Post a Comment