Tuesday, November 29, 2016

On Physics Envy

I followed my partner to a workshop in plasma physics. The workshop was held in a mountain resort in Poland – getting there was an adventure worthy, perhaps, of a separate blog post.

“I’m probably the only non-physicist in the room”, I say, apologetically, at the welcome reception, when professors come up to me to introduce themselves, upon seeing an unfamiliar face in the close-knit fusion community. 
Remembering this XKCD comic, I ask my partner: “How long do you think I could pretend to be a physicist for?” I clarify: “Let’s say, you are pretending to be a psychological scientist. I ask you what you’re working on. What would you say?”
“I’d say: ‘I’m working on orthographic depth and how affects reading processes, and also on statistical learning and its relationship to reading’.” Pretty good, that’s what I would say as well.
“So, if you’re pretending to be a physicist, what would you say if I ask you what you’re working on?”, he asks me.
“I’m, uh, trying to implement… what’s it called… controlled fusion in real time.”
The look on my partner’s face tells me that I would not do very well as an undercover agent in the physics community.

The attendees are around 50 plasma physicists, mostly greying, about three women among the senior scientist, perhaps five female post-docs or PhD students. Halfway through the reception dinner, I am asked about my work. In ten sentences, I try to describe what a cognitive scientist/psycholinguist does, trying to make it sound as scientific and non-trivial as possible. Several heads turn, curious to listen to my explanation. I’m asked if I use neuroimaging techniques. No, I don’t, but a lot of my colleagues and friends do. For the questions I’m interested in, anyway, I think we know too little about the relationship between brain and mind to make meaningful conclusions.
“It’s interesting”, says one physicist, “that you could explain to us what you are doing in ten sentences. For us, it’s much more difficult.” More people join in, admitting that they have given up trying to explain to their families what it is they are doing.
“Ondra gave me a pretty good explanation of what he is doing”, I tell them, pointing at my partner. I sense some scepticism. 

Physics envy is a term coined by psychologists (who else?), describing the inferiority complex associated with striving to be taken serious as a field in science. Physics is the prototypical hard science: they have long formulae, exact measurements where even the fifth decimal places matter, shiny multi-billion-dollar machines, and stereotypical crazy geniuses who would probably forget their own head if it wasn’t attached to them. Physicists don’t always make it easy for their scientific siblings (or distant cousins)* but, admittedly, they do have a right to be smug towards psychological scientists, given the replication crisis that we’re going through. The average physicist, unsurprisingly, finds it easier to grasp concepts associated with mathematics than the average psychologist. This means that physicist have, in general, a better understanding of probability. When I tell physicists about some of the absurd statements that some psychologists have made (“Including unpublished studies in the meta-analysis erroneously biases an effect size estimate towards zero.”; “Those replicators were just too incompetent to replicate our results. It’s very difficult to create the exact conditions under which we get the effect: even we had to try it ten times before we got this significant result!”), physicists start literally rolling on the floor with laughter. “Why do you even want to stay in this area of research?” I was asked once, after the physicist I was talking to had wiped off the tears of laughter. The question sounded neither rhetorical nor snarky, so I gave a genuine answer: “Because there are a lot of interesting questions that can be answered, if we improve the methodology and statistics we use.”

In physics, I am told, no experiment is taken seriously until it has been replicated by an independent lab. (Unless it requires some unique equipment, in which case it can't be replicated by an independent lab.) Negative results are still considered informative, unless they are due to experimental errors. Physicists still have issues with researchers who make their results look better than they actually are by cherry-picking the experimental results that fit best within one’s hypothesis and with post-hoc parameter adjustments – after all, the publish-or-perish system looms over all of academia. However, the importance of replicating results is a lesson that physicists have learnt from their own replication crisis: in the late 1980s, there was a shitstorm about cold fusion, set off by experimental results that were of immense public interest, but theoretically implausible, difficult to replicate, and later turned out to be due to sloppy research and/or scientific misconduct. (Sounds familiar?)

Physicists take their research very seriously, probably to a large extent because it is often of great financial interest. There are those physicists who work closely with industry. Even for those who don’t, their work often involves very expensive experiments. In plasma physics, a shot on the machine of Max Planck Institute for Plasma Physics, ASDEX-Upgrade, costs several thousand dollars. The number of shots required for an experiment depends on the research aims, and whether there is other data available, but can go up to 50 or more. This gives very strong motivation to make sure that one’s experiment is based on accurate calculations and sound theories which are supported by replicable studies. Furthermore, as there is only one machine – and only a handful of similar machines all over Europe – it needs to be shared with all other internal and external projects. In order to ensure that shots (and experimental time) are not wasted, any team wishing to perform an experiment needs to submit an application; the call for proposals opens only once a year. A representative of the team will also need to do a talk in front of the committee, which consists of the world’s leading experts in the area. The committee will decide whether the experiment is likely to yield informative and important results. In short, it is not possible – as in psychology – to spend one’s research career testing ideas one has on a whim, with twenty participants, and publish only if it actually ‘works’. One would be booed of the stage pretty quickly.

It’s easy to get into an us-and-them mentality and feelings of superiority and inferiority. No doubt all sciences have something of importance and of interest to offer to society in general. But it is also important to understand how we can maximise the utility of the research that we produce, and in this sense we can take a leaf out of physicists’ books. The importance of replication should be adopted also into the psychological literature: arguably, we should simply forget all theories that are based on non-replicable experiments. Perhaps more importantly, though, we should start taking our experiments more seriously. We need to increase our sample sizes; this conclusion seems to be gradually coming through as a consensus in psychological science. This means that also our experiments will become more expensive, both in terms of money and time. By conducting sloppy studies, we may still not loose thousands of dollars of taxpayers’ (or, even worse, investors’) money for each blotched experiment, but we will waste the time of our participants, the time, nerves and resources of researchers who try to make sense of or replicate our experiments, and we stall progress in our area of research, which has strong implications for policy makers in areas ranging from education through improving social equality, prisoners’ rehabilitation, and political/financial decision making, to mental health care.
--------------------------------------

* Seriously, though, I haven’t met a physicist who is as bad as the linked comic suggests.   


Acknowledgement: I'd like to thank Ondřej Kudláček, not only for his input into this blogpost and discussions about good science, but also for his unconditional support in my quest to learn about statistics.

Thursday, November 24, 2016

Flexible measures and meta-analyses: The case of statistical learning


On a website called flexiblemeasures.com, Malte Elson lists 156 dependent measures that have been used in the literature to quantify the performance on the Competitive Reaction Time Task. A task which has this many possible ways of calculating the outcome measure is, in a way, convenient for researchers: without correcting for multiple comparisons, the probability that the effect of interest will be significant in at least one of the measures skyrockets.

So does, of course, the probability that a significant result is a Type-I error (false positive). Such testing of multiple variables and reporting only the one which gives a significant result is an instance of p-hacking. It becomes problematic when another researcher tries to establish whether there is good evidence for an effect: if one performs a meta-analysis of the published analyses (using standardised effect sizes to be able to compare the different outcome measures across tasks), one can get a significant effect, even if each study reports only random noise and one creatively calculated outcome variable that ‘worked’.

Similarly, it becomes difficult for a researcher to establish how reliable a task is. Take, for example, statistical learning. Statistical learning, the cognitive ability to derive regularities from the environment and apply them to future events, has been linked to everything from language learning to autism. The concept of statistical learning ties to many theoretically interesting and practically important questions, for example, about how we learn, and what enables us to be able to use an abstract, complex system such as languages before we even learn to tie a shoelace.

Unsurprisingly, many tasks have been developed that are supposed to measure this cognitive ability of ours, and to correlate performance on these tasks to various everyday skills. Let us set aside the theoretical issues with the proposition that a statistical learning mechanism underlies the learning of statistical regularities in the environment, and concentrate on the way statistical learning is measured. This is an important question for someone who wants to study this statistical learning process: before running an experiment, one would like to be sure that the experimental task ‘works’.

As it turns out, statistical learning tasks don’t have particularly good psychometric properties: when the same individuals perform different tasks, the correlations between performance on different tasks are rather low; the test-retest reliability varies across tasks, but ranges from pretty good to pretty bad (Siegelman & Frost, 2015). For some tasks, performance on statistical learning tasks is not above chance for the majority of the participants, meaning that they cannot be used as valid indicators of individual differences in the statistical learning skill. This raises questions about why such a large proportion of published studies find that individual differences in statistical learning are correlated with various life-skills, and explains anecdotal evidence from myself and colleagues of conducting statistical learning experiments that just don’t work, in the sense that there is no evidence of statistical learning.* Relying on flexible outcome measures increases the researcher’s chances of finding a significant effect or correlation, which can be especially handy when the task has sub-optimal psychometric properties (low reliability and validity reduce the statistical power to find an effect if it exists). Rather than trying to improve the validity or reliability of the task, it is easier to continue analysing different variables until something becomes significant.

The first example of a statistical learning tasks is the Serial Reaction Time Task. Here, the participants respond to a series of stimuli, which appear on different positions on a screen. The participant presses buttons which correspond to the location of the stimulus. Unbeknown to the participant, the sequence of the locations repeats – the participants’ error rates and reaction times decrease. Towards the end of the experiment, normally in the penultimate block, the order of the locations is scrambled, meaning that the learned sequence is disrupted. Participants perform worse in this scrambled block compared to the sequential one. Possible outcome variables (which can all be found in the literature) are:
- Comparison of accuracy in the scrambled block to the preceding block
- Comparison of accuracy in the scrambled block to the succeeding (final) block
- Comparison of accuracy in the scrambled block compared to an average of the preceding and succeeding blocks
- The increase in accuracy across the sequential blocks
- Comparison of reaction times in the scrambled block to the preceding block
- Comparison of reaction times in the scrambled block to the succeeding (final) block
- Comparison of reaction times in the scrambled block compared to an average of the preceding and succeeding blocks
- The increase in reaction times across the sequential blocks.

This can hardly compare to the 156 dependent variables from the Competitive Reaction Time Task, but it already gives the researcher increased flexibility in selectively reporting only the outcome measures that ‘worked’. As an example of how this can lead to conflicting conclusions about the presence or absence of an effect: in a recent review, we discussed the evidence for a statistical learning deficit in developmental dyslexia (Schmalz, Altoè, & Mulatti, in press). In regards to the Serial Reaction Time Task, we concluded that there was insufficient evidence to decide whether or not there are differences in performance on this task across dyslexic participants and controls. Partly, this is because researchers tend to report different variables (presumably the one that ‘worked’): as it is rare for researchers to report the average reaction times and accuracy per block (or to respond to requests for raw data), it was impossible to pick the same dependent measure from all studies (say, the difference between the scrambled block and the one that preceded it) and perform a meta-analysis on it. Today, I stumbled across a meta-analysis on the same question: without taking into account differences between experiments in the dependent variable, Lum, Ullman, and Conti-Ramsden (2013) conclude that there is evidence for a statistical learning deficit in developmental dyslexia.

As a second example: in many statistical learning tasks, participants are exposed to a stream of stimuli which contain regularities. In a subsequent test phase, the participants then need to make decisions about stimuli which either follow the same patterns or not. This task can take many shapes, from a set of letter strings generated by a so-called artificial grammar (Reber, 1967) to strings of syllables with varying transitional probabilities (Saffran, Aslin, & Newport, 1996). It should be noted that both the overall accuracy rates (i.e., the observed rates of learning) and the psychometric properties varies across different variants of this tasks (see, e.g., Siegelman, Bogaerts, & Frost, 2016, who specifically aimed to create a statistical learning task with good psychometric properties). In these tasks, accuracy is normally too low to allow an analysis of reaction times; nevertheless, different dependent variables can be used: overall accuracy, the accuracy of grammatical items only, or the sensitivity index (d’). And, if there is imaging data, one can apparently interpret brain patterns in the complete absence of any evidence of learning on the behavioural level.

In summary, flexible measures could be an issue for evaluating the statistical learning literature: both in finding out which tasks are more likely to ‘work’, and in determining to what extent individual differences in statistical learning may be related to everyday skills such as language or reading. This does not mean that statistical learning does not exists, or that all existing work on this topic is flawed. However, it creates cause for healthy scepticism about the published results, and many interesting questions and challenges for future research. Above all, the field would benefit from increased awareness of issues such as flexible measures, which would lead to the pressure to increase the probability of getting a significant result by maximising the statistical power, i.e., decreasing the Type-II error rate (through larger sample sizes and more reliable and valid measures), rather than using tricks that affect the Type-I error rate.

References
Lum, J. A., Ullman, M. T., & Conti-Ramsden, G. (2013). Procedural learning is impaired in dyslexia: Evidence from a meta-analysis of serial reaction time studies. Research in Developmental Disabilities, 34(10), 3460-3476.
Reber, A. S. (1967). Implicit learning of artificial grammars. Journal of Verbal Learning and Verbal Behavior, 6(6), 855-863.
Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science, 274(5294), 1926-1928.
Schmalz, X., Altoè, G., & Mulatti, C. (in press). Statistical learning and dyslexia: a systematic review. Annals of Dyslexia. doi:10.1007/s11881-016-0136-0
Siegelman, N., Bogaerts, L., & Frost, R. (2016). Measuring individual differences in statistical learning: Current pitfalls and possible solutions. Behavior Research Methods, 1-15.
Siegelman, N., & Frost, R. (2015). Statistical learning as an individual ability: Theoretical perspectives and empirical evidence. Journal of Memory and Language, 81, 105-120.

------------------------------------------------------------------
* In my case, it’s probably a lack of flair, actually.