Wednesday, April 19, 2017

How much statistics do psychological scientists need to know? Also, a reading list


TL;DR: As much as possible.

The question of how much statistics psychological scientists should know has been discussed numerous times on twitter and psychology method groups on facebook. The consensus seems to be that psychologists need to know some stats, but they don’t need to be statisticians. When it comes to specifics, though, there does not seem to be any consensus: some argue that knowing the basics of the tests that are useful for one’s specific field is enough, while others argue that a thorough understanding of the concepts is important.

Here, I argue, based on my own experience, that a thorough understanding of statistics substantially enhances the quality of one’s work. The reason why I think statistical knowledge is really important is that the amount of knowledge you have constrains the experiments you can conduct: If one’s only tool is ANOVA, there is only a limited set of possible experiments that fit within the mould of this statistical test. *

First, a little bit about my stats background. I don’t remember much from high school maths: I think I had a lot of motivation to repress any memories about it. During my undergraduate course, one of the biggest mysteries is how I even passed my statistics courses. I guess they had to scale everyone’s marks to avoid failing too many students. After these experiences, though, I have spent a lot of time learning about the tools that are the corner stone of making sense of my experiments. As my supervisor told me: the best way to learn about statistical analyses is when you have some data that you care about. When I had started my PhD, I dreaded the day when I would be asked to do anything more complex than a correlation matrix. But during the PhD, I learned, through trial and error and with a lot of guidance from experienced colleagues, to analyse data in R with linear mixed effect models and Bayes Factors. When I started my post-doc, driven by my curiosity about how it is possible that we can get two identical experiments with completely different results (i.e., with p-values on different sides of the significance threshold), I decided to learn more about how this stats thing actually works. My interest was further sparked by several papers I read on this topic, and a one-day workshop given by Daniël Lakens in Rovereto, which I happened to hear about via twitter. It culminated with my signing up for a part-time distance course, a graduate certificate in statistics, which I’m due to finish in June.

This learning process has taken a lot of time. Cynically speaking, I would not recommend it to early career researchers, who would probably maximise their chances of success in academia by focussing on publishing lots of papers (quantity is more important than quality, right?). If you have only a short-term contract (anywhere between 6 months and 2 years), you probably won’t have time to do both. Besides, you will never again want to do N=20 studies, and unless your department is rich, conducting a high-powered experiment might not be feasible during a short-term post-doc contract. Ideologically speaking, I would recommend this learning process to every social scientist who feels that they don’t know enough. In my experience, it’s worth putting one’s research on hold to learn about stats: moving from following a set of arbitrary conventions** to understanding why these conventions make sense is a liberating experience, not to mention the increase of the quality of your work, and the ability to design studies that maximise the chance of getting meaningful results.

Useful resources
For anyone who is reading this blog post because they would like to learn more about statistics, I have compiled a list of resources that I found useful. They contain both statistics-oriented material, and material which is more about philosophy of science. I see them as two sides of the same coin, so I don’t make a distinction between them below.

First of all: If you haven’t already done so, sign up on twitter, and follow people who tweet about stats. Read their blogs. I am pretty sure that I learned more about stats this way than I have during my undergraduate degree. Some people I’ve learned from (it’s not a comprehensive list, but they’re all interconnected: if you follow some, you’ll find others through their discussions): Daniel Lakens (@lakens), Dorothy Bishop (@deevybee), Andrew Gelman (@StatModeling), Hilda Bastian (@hildabast), Alexander Etz (@AlxEtz), Richard Morey (@richardmorey), Deborah Mayo (@learnfromerror) and Uli Schimmack (@R_Index). If you’re on facebook, join some psychological methods groups. I frequently lurk on PsycMAP and the Psychological Methods Discussion Group.

Below are some papers (again, a non-comprehensive and somewhat sporadic list) that I found useful. I tried to sort them in order of difficulty, but I didn’t do it in a very systematic way (also, some of those papers I read a long time ago, so I don’t remember how difficult they were). I think most papers should be readable to most people with some experience with statistical tests. As an aside: even those who don’t know much about statistics may be aware of frequent discussions and disagreements among experts about how to do stats. The readings below contain a mixture of views, some of which I agree with more than with others. However, all of them have been useful for me in the sense that they helped me understand some new concepts.

Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7-29.

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12), 1304.

Savalei, V., & Dunn, E. (2015). Is the call to abandon p-values the red herring of the replicability crisis? Frontiers in Psychology, 6, 245.

Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. Perspectives on Psychological Science, 9(6), 641-651.

Gelman, A., & Weakliem, D. (2009). Of beauty, sex and power: Too little attention has been paid to the statistical challenges in estimating small effects. American Scientist, 97(4), 310-316.

Lakens, D., & Evers, E. R. (2014). Sailing from the seas of chaos into the corridor of stability: Practical recommendations to increase the informational value of studies. Perspectives on Psychological Science, 9(3), 278-292.

Cramer, A. O., van Ravenzwaaij, D., Matzke, D., Steingroever, H., Wetzels, R., Grasman, R. P., ... & Wagenmakers, E. J. (2016). Hidden multiplicity in exploratory multiway ANOVA: Prevalence and remedies. Psychonomic Bulletin & Review, 23(2), 640-647.

Luck, S. J., & Gaspelin, N. (2017). How to get statistically significant effects in any ERP experiment (and why you shouldn't). Psychophysiology, 54(1), 146-157.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34(2), 103-115.

Meehl, P. E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66(1), 195-244.

Schmidt, F. L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47(10), 1173.

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods, 1(2), 115-129.

Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.

Schimmack, U. (2012). The ironic effect of significant results on the credibility of multiple-study articles. Psychological Methods, 17(4), 551.

Wagenmakers, E. J., Verhagen, J., Ly, A., Bakker, M., Lee, M. D., Matzke, D., ... & Morey, R. D. (2015). A power fallacy. Behavior Research Methods, 47(4), 913-917.

Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E. J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23(1), 103-123.

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). Robust misinterpretation of confidence intervals. Psychonomic Bulletin & Review, 21(5), 1157-1164.

Kline, R. B. (2004). What's Wrong With Statistical Tests--And Where We Go From Here. (Chapter 3 from Beyond Significance Testing. Reforming data analysis methods in behavioural research. Washington, DC: APA Books.)

Royall, R. M. (1986). The effect of sample size on the meaning of significance tests. The American Statistician, 40(4), 313-315.

Schönbrodt, F. D., Wagenmakers, E. J., Zehetleitner, M., & Perugini, M. (2015). Sequential hypothesis testing with Bayes factors: Efficiently testing mean differences. Psychological Methods.

Westfall, J., & Yarkoni, T. (2016). Statistically controlling for confounding constructs is harder than you think. PloS One, 11(3), e0152719.

Forstmeier, W., & Schielzeth, H. (2011). Cryptic multiple hypotheses testing in linear models: overestimated effect sizes and the winner's curse. Behavioral Ecology and Sociobiology, 65(1), 47-55.

In terms of books, I recommend Dienes’ “Understanding Psychology as a Science” and McElreath’s “Statistical Rethinking”.

Then there are some online courses and videos. First, there is Daniel Lakens’ Coursera course in statistical inferences. From a more theoretical perspective, I like this MIT Probability course by John Tsiskilis, and Meehl’s Lectures. For something more serious, you could also try a university course, like a distance education graduate certificate in statistics. Here is a very positive review of the Sheffield University course which I am currently doing. However, at least if your maths skills are as bad as mine, I would not recommend to do it on top of a full-time job.

Conclusion
Learning stats is a long and never-ending road, but if you are interested in designing strong and informative studies and being flexible with what you can do with data, it is a worthwhile investment. There is always more to learn, and no matter how much I learn I continue to feel like I know less than I should. However, it’s a steep learning curve, so even investing a little bit of time and effort can already have beneficial effects. This is possible, even if you have only fifteen minutes to spare each day, through the resources that I tried to do justice to in my list above.

I should conclude, I think, by thanking all those who make these resources available, be it via published papers, lectures, blog posts, or discussions on social media.  

------------------------------------
* To provide an example, I will do some shameless self-advertising: in a paper that came out of my PhD, we got data which seemed uninterpretable at first. However, thanks to collaboration with a colleague with a mathematics background, Serje Robidoux, we could make sense of the data with an optimisation procedure. While an ANOVA would not have given us anything useful, the optimisation procedure allowed us to conclude that readers use different sources of information when they read aloud unfamiliar words, and that there is individual variation in the relative degree to which they rely on these different sources of information. This is one of my favourite papers I’ve published so far, but it’s only been cited by myself to date. (*He-hem!*) Here is the reference:
Schmalz, X., Marinus, E., Robidoux, S., Palethorpe, S., Castles, A., & Coltheart, M. (2014). Quantifying the reliance on different sublexical correspondences in German and English. Journal of Cognitive Psychology, 26(8), 831-852.

** Conventions such as:
“Control for multiple comparisons.”
“Don’t interpret non-significant p-values as evidence for the null hypothesis.”
“If you have a marginally significant p-value, don’t collect more data to see if the p-value drops below the threshold.”
“The p-value relates to the probability of the data, not the hypothesis.”
“For Meehl’s sake, don’t mess up the exact wording of the definition of a confidence interval!”

Tuesday, April 11, 2017

Selective blindness to null results

A while ago, to pass time on a rainy Saturday afternoon, I decided to try out some publication bias detection techniques. I picked the question of gender differences in multitasking. After all, could there be a better question for this purpose than this universally known ‘fact’? I was surprised, however, to find not two, not one, but zero peer-reviewed studies that found that women were better at multitasking than men. The next surprise came when I started sharing my discovery with friends and colleagues. In response to my “Did you know that this women-are-better-at-multitasking-thing is a myth?” I would start getting detailed explanations about possible causes of a gender difference in multitasking.

Here is another anecdote: last year, I did a conference talk where I presented a null-result. A quick explanation of the experiment: a common technique in visual word recognition research is masked priming, where participants are asked to respond to a target word, which is preceded by a very briefly presented prime. The duration of the prime is such that participants don’t consciously perceive it, but the degree and type of overlap between the prime and the target affects the response times to the target. For example, you can swap the order of letters in the prime (jugde – JUDGE), or substitute them for unrelated letters (julme – JUDGE). I wanted to see if it matters whether the transposed letters in the prime create a letter pair that does not exist in the orthography. As it turns out, it doesn’t. But despite my having presented a clear null result (with Bayes factors), several people came up to me after my talk, and asked me if I thought this effect may be a confounding variable for existing studies using this paradigm!

Though I picked only two examples, such selective blindness (or deafness) to being told that an effect is not there seems to be prevalent in academia. I’m not just talking about instances of papers citing those articles which support their hypothesis, and conveniently forgetting that a handful of studies failed to find evidence for it (or citing them as providing evidence for it even when they don’t). In this case, my guess would be that there are numerous factors at play, including confirmation bias and deliberate strategies. In addition to this, however, we seem to have some mechanism to preferentially perceive positive results over null-results. This seems to go beyond the common knowledge that non-significant p-values cannot be interpreted as evidence for the null, or the (in many cases well-justified) argument that a null-result may simply reflect low power or incorrect auxiliary hypotheses. The lower-level blindness that I’m talking about could reflect our expectations: surely, if someone writes a paper or does a conference presentation, they will have some positive results to report? Or perhaps we are naturally tuned to understand the concept of something being there more readily than the concept of something not being there.


I’ve argued previously that we should take null results more seriously. It does happen that null results are uninterpretable or uninformative, but a strong bias towards positive results at any stage of the scientific discourse will provide a skewed view of the world. If selective blindness to null results exists, we should become aware of it: we can only evaluate the evidence if we have a full picture of it.

Tuesday, November 29, 2016

On Physics Envy

I followed my partner to a workshop in plasma physics. The workshop was held in a mountain resort in Poland – getting there was an adventure worthy, perhaps, of a separate blog post.

“I’m probably the only non-physicist in the room”, I say, apologetically, at the welcome reception, when professors come up to me to introduce themselves, upon seeing an unfamiliar face in the close-knit fusion community. 
Remembering this XKCD comic, I ask my partner: “How long do you think I could pretend to be a physicist for?” I clarify: “Let’s say, you are pretending to be a psychological scientist. I ask you what you’re working on. What would you say?”
“I’d say: ‘I’m working on orthographic depth and how affects reading processes, and also on statistical learning and its relationship to reading’.” Pretty good, that’s what I would say as well.
“So, if you’re pretending to be a physicist, what would you say if I ask you what you’re working on?”, he asks me.
“I’m, uh, trying to implement… what’s it called… controlled fusion in real time.”
The look on my partner’s face tells me that I would not do very well as an undercover agent in the physics community.

The attendees are around 50 plasma physicists, mostly greying, about three women among the senior scientist, perhaps five female post-docs or PhD students. Halfway through the reception dinner, I am asked about my work. In ten sentences, I try to describe what a cognitive scientist/psycholinguist does, trying to make it sound as scientific and non-trivial as possible. Several heads turn, curious to listen to my explanation. I’m asked if I use neuroimaging techniques. No, I don’t, but a lot of my colleagues and friends do. For the questions I’m interested in, anyway, I think we know too little about the relationship between brain and mind to make meaningful conclusions.
“It’s interesting”, says one physicist, “that you could explain to us what you are doing in ten sentences. For us, it’s much more difficult.” More people join in, admitting that they have given up trying to explain to their families what it is they are doing.
“Ondra gave me a pretty good explanation of what he is doing”, I tell them, pointing at my partner. I sense some scepticism. 

Physics envy is a term coined by psychologists (who else?), describing the inferiority complex associated with striving to be taken serious as a field in science. Physics is the prototypical hard science: they have long formulae, exact measurements where even the fifth decimal places matter, shiny multi-billion-dollar machines, and stereotypical crazy geniuses who would probably forget their own head if it wasn’t attached to them. Physicists don’t always make it easy for their scientific siblings (or distant cousins)* but, admittedly, they do have a right to be smug towards psychological scientists, given the replication crisis that we’re going through. The average physicist, unsurprisingly, finds it easier to grasp concepts associated with mathematics than the average psychologist. This means that physicist have, in general, a better understanding of probability. When I tell physicists about some of the absurd statements that some psychologists have made (“Including unpublished studies in the meta-analysis erroneously biases an effect size estimate towards zero.”; “Those replicators were just too incompetent to replicate our results. It’s very difficult to create the exact conditions under which we get the effect: even we had to try it ten times before we got this significant result!”), physicists start literally rolling on the floor with laughter. “Why do you even want to stay in this area of research?” I was asked once, after the physicist I was talking to had wiped off the tears of laughter. The question sounded neither rhetorical nor snarky, so I gave a genuine answer: “Because there are a lot of interesting questions that can be answered, if we improve the methodology and statistics we use.”

In physics, I am told, no experiment is taken seriously until it has been replicated by an independent lab. (Unless it requires some unique equipment, in which case it can't be replicated by an independent lab.) Negative results are still considered informative, unless they are due to experimental errors. Physicists still have issues with researchers who make their results look better than they actually are by cherry-picking the experimental results that fit best within one’s hypothesis and with post-hoc parameter adjustments – after all, the publish-or-perish system looms over all of academia. However, the importance of replicating results is a lesson that physicists have learnt from their own replication crisis: in the late 1980s, there was a shitstorm about cold fusion, set off by experimental results that were of immense public interest, but theoretically implausible, difficult to replicate, and later turned out to be due to sloppy research and/or scientific misconduct. (Sounds familiar?)

Physicists take their research very seriously, probably to a large extent because it is often of great financial interest. There are those physicists who work closely with industry. Even for those who don’t, their work often involves very expensive experiments. In plasma physics, a shot on the machine of Max Planck Institute for Plasma Physics, ASDEX-Upgrade, costs several thousand dollars. The number of shots required for an experiment depends on the research aims, and whether there is other data available, but can go up to 50 or more. This gives very strong motivation to make sure that one’s experiment is based on accurate calculations and sound theories which are supported by replicable studies. Furthermore, as there is only one machine – and only a handful of similar machines all over Europe – it needs to be shared with all other internal and external projects. In order to ensure that shots (and experimental time) are not wasted, any team wishing to perform an experiment needs to submit an application; the call for proposals opens only once a year. A representative of the team will also need to do a talk in front of the committee, which consists of the world’s leading experts in the area. The committee will decide whether the experiment is likely to yield informative and important results. In short, it is not possible – as in psychology – to spend one’s research career testing ideas one has on a whim, with twenty participants, and publish only if it actually ‘works’. One would be booed of the stage pretty quickly.

It’s easy to get into an us-and-them mentality and feelings of superiority and inferiority. No doubt all sciences have something of importance and of interest to offer to society in general. But it is also important to understand how we can maximise the utility of the research that we produce, and in this sense we can take a leaf out of physicists’ books. The importance of replication should be adopted also into the psychological literature: arguably, we should simply forget all theories that are based on non-replicable experiments. Perhaps more importantly, though, we should start taking our experiments more seriously. We need to increase our sample sizes; this conclusion seems to be gradually coming through as a consensus in psychological science. This means that also our experiments will become more expensive, both in terms of money and time. By conducting sloppy studies, we may still not loose thousands of dollars of taxpayers’ (or, even worse, investors’) money for each blotched experiment, but we will waste the time of our participants, the time, nerves and resources of researchers who try to make sense of or replicate our experiments, and we stall progress in our area of research, which has strong implications for policy makers in areas ranging from education through improving social equality, prisoners’ rehabilitation, and political/financial decision making, to mental health care.
--------------------------------------

* Seriously, though, I haven’t met a physicist who is as bad as the linked comic suggests.   


Acknowledgement: I'd like to thank Ondřej Kudláček, not only for his input into this blogpost and discussions about good science, but also for his unconditional support in my quest to learn about statistics.

Thursday, November 24, 2016

Flexible measures and meta-analyses: The case of statistical learning


On a website called flexiblemeasures.com, Malte Elson lists 156 dependent measures that have been used in the literature to quantify the performance on the Competitive Reaction Time Task. A task which has this many possible ways of calculating the outcome measure is, in a way, convenient for researchers: without correcting for multiple comparisons, the probability that the effect of interest will be significant in at least one of the measures skyrockets.

So does, of course, the probability that a significant result is a Type-I error (false positive). Such testing of multiple variables and reporting only the one which gives a significant result is an instance of p-hacking. It becomes problematic when another researcher tries to establish whether there is good evidence for an effect: if one performs a meta-analysis of the published analyses (using standardised effect sizes to be able to compare the different outcome measures across tasks), one can get a significant effect, even if each study reports only random noise and one creatively calculated outcome variable that ‘worked’.

Similarly, it becomes difficult for a researcher to establish how reliable a task is. Take, for example, statistical learning. Statistical learning, the cognitive ability to derive regularities from the environment and apply them to future events, has been linked to everything from language learning to autism. The concept of statistical learning ties to many theoretically interesting and practically important questions, for example, about how we learn, and what enables us to be able to use an abstract, complex system such as languages before we even learn to tie a shoelace.

Unsurprisingly, many tasks have been developed that are supposed to measure this cognitive ability of ours, and to correlate performance on these tasks to various everyday skills. Let us set aside the theoretical issues with the proposition that a statistical learning mechanism underlies the learning of statistical regularities in the environment, and concentrate on the way statistical learning is measured. This is an important question for someone who wants to study this statistical learning process: before running an experiment, one would like to be sure that the experimental task ‘works’.

As it turns out, statistical learning tasks don’t have particularly good psychometric properties: when the same individuals perform different tasks, the correlations between performance on different tasks are rather low; the test-retest reliability varies across tasks, but ranges from pretty good to pretty bad (Siegelman & Frost, 2015). For some tasks, performance on statistical learning tasks is not above chance for the majority of the participants, meaning that they cannot be used as valid indicators of individual differences in the statistical learning skill. This raises questions about why such a large proportion of published studies find that individual differences in statistical learning are correlated with various life-skills, and explains anecdotal evidence from myself and colleagues of conducting statistical learning experiments that just don’t work, in the sense that there is no evidence of statistical learning.* Relying on flexible outcome measures increases the researcher’s chances of finding a significant effect or correlation, which can be especially handy when the task has sub-optimal psychometric properties (low reliability and validity reduce the statistical power to find an effect if it exists). Rather than trying to improve the validity or reliability of the task, it is easier to continue analysing different variables until something becomes significant.

The first example of a statistical learning tasks is the Serial Reaction Time Task. Here, the participants respond to a series of stimuli, which appear on different positions on a screen. The participant presses buttons which correspond to the location of the stimulus. Unbeknown to the participant, the sequence of the locations repeats – the participants’ error rates and reaction times decrease. Towards the end of the experiment, normally in the penultimate block, the order of the locations is scrambled, meaning that the learned sequence is disrupted. Participants perform worse in this scrambled block compared to the sequential one. Possible outcome variables (which can all be found in the literature) are:
- Comparison of accuracy in the scrambled block to the preceding block
- Comparison of accuracy in the scrambled block to the succeeding (final) block
- Comparison of accuracy in the scrambled block compared to an average of the preceding and succeeding blocks
- The increase in accuracy across the sequential blocks
- Comparison of reaction times in the scrambled block to the preceding block
- Comparison of reaction times in the scrambled block to the succeeding (final) block
- Comparison of reaction times in the scrambled block compared to an average of the preceding and succeeding blocks
- The increase in reaction times across the sequential blocks.

This can hardly compare to the 156 dependent variables from the Competitive Reaction Time Task, but it already gives the researcher increased flexibility in selectively reporting only the outcome measures that ‘worked’. As an example of how this can lead to conflicting conclusions about the presence or absence of an effect: in a recent review, we discussed the evidence for a statistical learning deficit in developmental dyslexia (Schmalz, Altoè, & Mulatti, in press). In regards to the Serial Reaction Time Task, we concluded that there was insufficient evidence to decide whether or not there are differences in performance on this task across dyslexic participants and controls. Partly, this is because researchers tend to report different variables (presumably the one that ‘worked’): as it is rare for researchers to report the average reaction times and accuracy per block (or to respond to requests for raw data), it was impossible to pick the same dependent measure from all studies (say, the difference between the scrambled block and the one that preceded it) and perform a meta-analysis on it. Today, I stumbled across a meta-analysis on the same question: without taking into account differences between experiments in the dependent variable, Lum, Ullman, and Conti-Ramsden (2013) conclude that there is evidence for a statistical learning deficit in developmental dyslexia.

As a second example: in many statistical learning tasks, participants are exposed to a stream of stimuli which contain regularities. In a subsequent test phase, the participants then need to make decisions about stimuli which either follow the same patterns or not. This task can take many shapes, from a set of letter strings generated by a so-called artificial grammar (Reber, 1967) to strings of syllables with varying transitional probabilities (Saffran, Aslin, & Newport, 1996). It should be noted that both the overall accuracy rates (i.e., the observed rates of learning) and the psychometric properties varies across different variants of this tasks (see, e.g., Siegelman, Bogaerts, & Frost, 2016, who specifically aimed to create a statistical learning task with good psychometric properties). In these tasks, accuracy is normally too low to allow an analysis of reaction times; nevertheless, different dependent variables can be used: overall accuracy, the accuracy of grammatical items only, or the sensitivity index (d’). And, if there is imaging data, one can apparently interpret brain patterns in the complete absence of any evidence of learning on the behavioural level.

In summary, flexible measures could be an issue for evaluating the statistical learning literature: both in finding out which tasks are more likely to ‘work’, and in determining to what extent individual differences in statistical learning may be related to everyday skills such as language or reading. This does not mean that statistical learning does not exists, or that all existing work on this topic is flawed. However, it creates cause for healthy scepticism about the published results, and many interesting questions and challenges for future research. Above all, the field would benefit from increased awareness of issues such as flexible measures, which would lead to the pressure to increase the probability of getting a significant result by maximising the statistical power, i.e., decreasing the Type-II error rate (through larger sample sizes and more reliable and valid measures), rather than using tricks that affect the Type-I error rate.

References
Lum, J. A., Ullman, M. T., & Conti-Ramsden, G. (2013). Procedural learning is impaired in dyslexia: Evidence from a meta-analysis of serial reaction time studies. Research in Developmental Disabilities, 34(10), 3460-3476.
Reber, A. S. (1967). Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6(6), 855-863.
Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science, 274(5294), 1926-1928.
Schmalz, X., Altoè, G., & Mulatti, C. (in press). Statistical learning and dyslexia: a systematic review. Annals of Dyslexia. doi:10.1007/s11881-016-0136-0
Siegelman, N., Bogaerts, L., & Frost, R. (2016). Measuring individual differences in statistical learning: Current pitfalls and possible solutions. Behavior Research Methods, 1-15.
Siegelman, N., & Frost, R. (2015). Statistical learning as an individual ability: Theoretical perspectives and empirical evidence. Journal of Memory and Language, 81, 105-120.

------------------------------------------------------------------
* In my case, it’s probably a lack of flair, actually.

Wednesday, September 21, 2016

Some thoughts on methodological terrorism


Yesterday, I woke up to a shitstorm on Twitter, caused by an editorial-in-press by social psychologist Susan Fiske (who wrote my undergraduate Social Psych course textbook). The full text of the editorial, along with a superb commentary from Andrew Gelman, can be found here. This editorial, which launches an attack against so-called methodological terrorists who have the audacity to criticise their colleagues in public, has already inspired blog posts such as this one by Sam Schwarzkopf and this one which broke the time-space continuum by Dorothy Bishop

However, I would like to write about about one aspect of Susan Fiske’s commentary, which also emerged in a subsequent discussion with her at the congress of the German Society for Psychology (which, alas, I followed only on twitter). In the editorial Fiske states that psychological scientists at all stages of their career are being bullied; she seems especially worried about graduate students who are leaving academia. In the subsequent discussion, as cited by Malte Elson, she specifies that >30 graduate students wrote to her, in fear of cyberbullies.*

Being an early career researcher myself, I can try to imagine myself in a position where I would be scared of “methodological terrorists”. I can’t speak for all ECRs, but for what it’s worth, I don’t see any reason to stifle public debate. Of course, there is internet harassment which is completely inexcusable and should be punished (as covered by John Oliver in this video). But I have never seen, nor heard of, a scientific debate which dropped to the level of violence, rape or death threats.

So, what is the worst thing that can happen in academia? Someone finds a mistake in your work (or thinks they have found a mistake), and makes it public, either through the internet (twitter, blog), a peer-reviewed paper, or by screaming it out at an international conference after your talk. Of course, on a personal level, it is preferable that before or instead of making it public, the critic approaches you privately. On the other hand, the critic is not obliged to do this – as others build on your work, it is only fair that the public should be informed about a potential mistake. It is therefore, in practice, up to the critic to decide whether they will approach you first, or whether they think that a public approach would be more effective in getting an error fixed. Similarly, it would be nice of the critic to adopt a kind, constructive tone. It would probably make the experience more pleasant (or less unpleasant) for both parties, and be more effective in convincing the person who is criticised to think about the criticiser’s point and to decide rationally whether or not this is a valid point. But again, the critic is not obliged to be nice – someone who stands up at a conference to publicly destroy an early career researcher’s work is an a-hole, but not a criminal. (Though I can even imagine scenarios where such behaviour would be justified, for example, if the criticised researcher has been unresponsive to private expressions of concern about this work.)

As an early career researcher, it can be very daunting to face an audience of potential critics. It is even worse if someone accuses you of having done something wrong (whether it’s a methodological shortcoming of your experiment, or a possibly intentional error in your analysis script). I have received some criticism throughout my five-year academic career; some of it was not fair, though most of it was (even though I would sometimes deny it, in the initial stages). Furthermore, there are cultural differences in how researchers express their concern with some aspect of somebody’s work: in English-speaking countries (Australia, UK, US), much softer words seem to be used for criticising than in many mainland European countries (Italy, Germany). When I spent six months during my PhD in Germany, I was shocked at some of the conversations I had overheard between other PhD students and their supervisors – being used to the Australian style of conversation it seemed to me that German supervisors could be straight-out mean. Someone who is used to being told about a mistake with the phrase: “This is good, but you might want to consider…” is likely to be shocked and offended if they go to an international conference and someone tells them straight out: “This is wrong.” This could lead to some people feeling personally attacked due to what is more or less a cultural misunderstanding.

In any event, it is inevitable that one makes mistakes from time to time, and that someone finds something to criticise about your work. Indeed, this is how science progresses. We make mistakes, and we learn from them. We learn from others’ mistakes. Learning is what science is all about. Someone who doesn’t want to learn cannot be a scientist. And if nobody ever tells you that you made a mistake, you cannot learn from it. Yes, criticism stings, and some people are more sensitive than others. However, responding to criticism in a constructive way, and being aware of potential cultural differences in how criticism is conveyed, is part of the job description of an academic. Somebody who reacts explosively or defensively to criticism cannot be a scientist just like someone who is afraid of water cannot be an Olympic swimmer.

---------------------------
In response to this, Daniël Lakens wrote, in a series of tweets (I can’t phrase it better): “100+ students told me they think of quitting because science is no longer about science. [… They are the] ones you want to stay in science, because they are not afraid, they know what to do, they just doubt if a career in science is worth it.”

Monday, June 27, 2016

What happens when you try to publish a failure to replicate in 2015/2016

Anyone who has talked to me in the last year would have heard me complain about my 8-times-failure-to-replicate which nobody wants to publish. The preprint, raw data and analysis scripts are available here, so anyone can judge for themselves if they think the rejections to date are justified. In fact, if anyone can show me that my conclusions are wrong – that the data are either inconclusive, or that they actually support an opposite view – I will buy them a bottle of drink of their choice*. So far, this has not happened.

I promise to stop complaining about this after I publish this blog post. I think it is important to be aware of the current situation, but I am, by now, just getting tired of debates which go in circles (and I’m sure many others feel the same way). Therefore, I pledge that from now on I will stop writing whining blog posts, and I will only write happy ones – which have at least one constructive comment or suggestion about how we could improve things.

So, here goes my last ever complaining post. I should stress that the sentiments and opinions I describe here are entirely my own; although I’ve had lots of input from my wonderful co-authors in preparing the manuscript of my unfortunate paper, they would probably not agree with many of the things I am writing here.

Why is it important to publish failures to replicate?

People who haven’t been convinced by the arguments put forward to date will not be convinced by a puny little blogpost. In fact, they will probably not even read this. Therefore, I will not go into details about why it is important to publish failures to replicate. Suffice it to say that this is not my opinion – it’s a truism. If we combine a low average experimental power with selective publishing of positive results, we – to use Daniel Lakens’ words – get “a literature that is about as representative of real science as porn movies are representative of real sex”. We get over-inflated effect sizes across experiments, even if an effect is non-existent; or, in the words of Michael Inzlicht, “meta-analyses are fucked”.

Our study

The interested reader can look up further details of our study in the OSF folder I linked above (https://osf.io/myfk3/). The study is about the Psycholinguistic Grain Size Theory (Ziegler & Goswami, 2005)**. If you type the name of this theory into google – or some other popular search terms, such as “dyslexia theory”, “reading across languages”, or “reading development theory” – you will see this paper on the first page. It has 1650 citations, at the time of writing of this blogpost. In other words, this theory is huge. People rely on it to interpret their data, and to guide their experimental designs and theories in diverse topics of reading and dyslexia.

The evidence for the Psycholinguistic Grain Size Theory is summarised in the preprint linked above; the reader can decide for themselves if they find it convincing. During my PhD, I decided to do some follow-up experiments on the body-N effect (Ziegler & Perry, 1998; Ziegler et al., 2001; Ziegler et al., 2003). Why? Not because I wanted to build my career on the ruins of someone else’s work (which is apparently what some people think of replicators), but because I found the theory genuinely interesting, and I wanted to do further work to specify the locus of this effect. So I did study after study after study – blaming myself for the messy results – until I realised: I had conducted eight experiments, and the effect just isn’t there. So I conducted a meta-analysis on all of our data, plus an unpublished study by a colleague with whom I’d talked about this effect, wrote it up and submitted it.

Surely, in our day and age, journals should welcome null-results as much as positive results? And any rejections would be based on flaws in the study?

Well, here is what happened:

Submission 1: Relatively high-impact journal for cognitive psychology

Here is a section directly copied-and-pasted from a review:

“Although the paper is well-written and the analyses are quite substantial, I find the whole approach rather irritating for the following reasons:

1. Typically meta-analyses are done one [sic] published data that meet the standards for publishing in international peer-reviewed journals. In the present analyses, the only two published studies that reported significant effects of body-N and were published in Cognition and Psychological Science were excluded (because the trial-by-trial data were no longer available) and the authors focus on a bunch of unpublished studies from a dissertation and a colleague who is not even an author of the present paper. There is no way of knowing whether these unpublished experiments meet the standards to be published in high-quality journals.”

Of course, I picked the most extreme statement. Other reviewers had some cogent points – however, nothing that would compromise the conclusions. The paper was rejected because “the manuscript is probably too far from what we are looking for”.

Submission 2: Very high-impact psychology journal

As a very ambitious second plan, we submitted the paper to one of the top journals in psychology. It’s a journal which “publishes evaluative and integrative research reviews and interpretations of issues in scientific psychology. Both qualitative (narrative) and quantitative (meta-analytic) reviews will be considered, depending on the nature of the database under consideration for review” (from their website). They have even announced a special issue on Replicability and Reproducibility, because their “primary mission […] is to contribute a cohesive, authoritative, theory-based, and complete synthesis of scientific evidence in the field of psychology” (again, from their website). In fact, they published the original theoretical paper, so surely they would at least consider a paper which argues against this theory? As in, send it out for review? And reject it based on flaws, rather than the standard explanation of it being uninteresting to a broad audience? Given that they published the original theoretical article, and all? Right?

Wrong, on all points.

Submission 3: A well-respected, but not huge impact factor journal in cognitive psychology

I agreed to submit this paper to a non-open-access journal again, but only under the condition that at least one of my co-authors would have a bet with me: if it got rejected, I would get a bottle of good whiskey. Spoiler alert: I am now the proud owner of a 10-year aged bottle of Bushmills.

To be fair, this round of reviews brought some cogent and interesting comments. The first reviewer provided some insightful remarks, but their main concern was that “The main message here seems to be a negative one.” Furthermore, the reviewer “found the theoretical rationale [for the choice of paradigm] to be rather simplistic”. Your words, not mine! However, for a failure to replicate, this is irrelevant. As many researchers rely on what may or may not be a simplistic theoretical framework which is based on the original studies, we need to know whether the evidence put forward by the original studies is reliable.

I could not quite make sense of all of the second reviewer’s comment, but somehow they argued that the paper was “overkill”. (It is very long and dense, to be fair, but I do have a lot of data to analyse. I suspect most readers will skip from the introduction to the discussion, anyway – but anyone who wants the juicy details of the analyses should have easy access to them.)

Next step: Open-access journal

I like the idea of open-access journals. However, when I submitted previous versions of the manuscript I was somewhat swayed by the argument that going open access would decrease the visibility and credibility of the paper. This is probably true, but without any doubt, the next step will be to submit the paper to an open-access journal. Preferably one with open review. I would like to see a reviewer calling a paper “irritating” in a public forum.

At least in this case, traditional journals have shown – well, let’s just say that we still have a long way to go in improving replicability in psychological sciences. For now, I have uploaded a pre-print of the paper on OSF and on researchgate. On researchgate, the article has over 200 views, suggesting that there is some interest in this theory; the finding that the key study is not replicable seems relevant to researchers. Nevertheless, I wonder if the failure to provide support for this theory will ever gain as much visibility as the original study – how many researchers will put their trust into a theory that they might be more sceptical about if they knew the key study is not as robust as it may seem?

In the meantime, my offer of a bottle of beverage for anyone who can show that the analyses or data are fundamentally flawed, still stands.

-------------------------------------------------------


* Beer, wine, whiskey, brandy: You name it. Limited only by my post-doc budget.
** The full references of all papers cited throughout the blogpost can be found in the preprint of our paper.

-----------------------------------------

Edit 30/6: Thanks all for the comments so far, I'll have a closer look at how I can implement your helpful suggestions when I get the chance!

Please note that I will delete comments from spammers and trolls. If you feel the urge to threaten physical violence, please see your local counsellor or psychologist.