Tuesday, May 8, 2018

Some thoughts on trivial results, or: Yet another argument for Registered Reports


A senior colleague once joked: “If I read about a new result, my reaction is either: ‘That’s trivial!’, or ‘I don’t believe that!’”

These types of reactions are pretty common when presenting the results of a new study (in my experience, anyway). In peer review, especially the former can be a reason for a paper rejection. In conversations with colleagues, one sometimes gets told, jokingly: “Well, I could have told you in advance that you’d get this result, you didn’t have to run the study!” This can be quite discouraging, especially if, while you were planning your study, it did not seem at all obvious to you that you would get the obtained result.

In many cases, perhaps, the outcomes of a result are obvious, especially to someone who has been in the field for much longer than you are. For some effects, there might be huge file drawers, such that it’s a well-known secret that an experimental paradigm which seems perfectly reasonable at first sight doesn’t actually work. In this case, it would be very helpful to hear that it’s probably not the best idea to invest time and resources on this paradigm. However, it would be even more helpful to hear about this before you plan and execute your study.  

One also needs to take into account that there is hindsight bias. If you hear the results first, it’s easy to come up with an explanation for the exact obtained pattern. Thus, a some result that might seem trivial in hindsight would actually have been not so east to predict a priori. There is also often disagreement about the triviality of an outcome: It's not unheard of (not only in my experience) that Reviewer 1 claims that the paper shouldn't be published because the result is trivial, while Reviewer 2 recommends rejection because (s)he doesn’t believe this result.

Registered reports should strongly reduce the amount of times that people tell you that your results are trivial. If you submit a plan to do an experiment that really is trivial, the reviewers should point this out while evaluating the Stage 1 manuscript. If they have a good point, this will save you from collecting data for a study that many people might not find interesting. And if the reviewers agree that the research question is novel and interesting, they cannot later do a backflip and say that it’s trivial after having seen the results.

So, this is another advantage of registered reports. And, if I’m brave enough, I’ll change the way I tell (senior) colleagues about my work in informal conversations, from: “I did experiment X, and I got result Y” to “I did experiment X. What do you think happened?”

Tuesday, March 27, 2018

Can peer review be objective in “soft” sciences?


Love it or hate it – peer review is likely to elicit strong emotional reactions from researchers, at least at those times when they receive an editorial letter with a set of reviews. Reviewer 1 is mildly positive, Reviewer 2 says that the paper would be tolerable if you rewrote the title, abstract, introduction, methods, results and discussion sections, and Reviewer 3 seems to have missed the point of the paper altogether.
 
This is not going to be an anti-peer-review blog post. In general, I like peer review (even if you might get a different impression if you talk to me right after I get a paper rejected). In principle, peer review is an opportunity to engage in academic discussion with researchers who have similar research interests as you. I have learned a lot from having my papers peer reviewed, and most of my papers have been substantially improved by receiving feedback from reviewers, who often have different perspectives on my research question.

In practice, unfortunately, the peer review process is not a pleasant chat between two researchers who are experts on the same topic. The reviewer has the power to influence whether the paper will be rejected or accepted. With the pressure to publish, the reviewer may well delay the graduation of a PhD student or impoverish a researcher’s publication record just before an important deadline. That’s a different matter, that has more to do with the incentive system than with the peer review system. The thing about peer review is, though, that it should be as objective as possible: especially given that, in practice, a PhD student’s graduation may depend on it.

Writing an objective peer review is probably not a big issue for harder sciences and mathematics, where verifying the calculations should be often enough to decide whether the conclusions of the paper are warranted. In contrast, in soft sciences, one can always find methodological flaws, think of potential confounds or alternative explanations that could explain the results, or require stronger statistical evidence. The limit is the reviewer’s imagination: whether your paper gets accepted or not may well be a function of the reviewers’ mood, creativity, and implicit biases for or against your lab.

This leads me to the goal of the current blog post: Can we achieve objective peer review in psychological science? I don’t have an answer to this. But here, I aim to summarise the kinds of things that I generally pay attention to when I review a paper (or would like to pay more attention to in the future), and hope for some discussion (in the comments, on twitter) about whether or not this constitutes as-close-to-objective-as-possible peer review.

Title and abstract
Here, I generally check whether the title and abstract reflect the results of the paper. This shouldn’t even be an issue, but I have reviewed papers where the analysis section described a clear null result, but the title and abstract implied that the effect was found.

Introduction
Here, I aim to check whether the review of the literature is complete and unbiased, to the best of my knowledge. As examples of issues that I would point out: the authors selectively cite studies with positive results (or, worse: studies with null-results as if they had found positive results), or misattribute a finding or theory. As minor points, I note if I cannot follow the authors’ reasoning.
I also consider the a priori plausibility of the authors’ hypothesis. The idea is to try and pick up on instances of HARKing, or hypothesising after results are known. If there is little published information on an effect, but the authors predict the exact pattern of results of a 3x2x2x5-ANOVA, I ask them to clarify whether the results were found in exploratory analyses, and if so, then to rewrite the introduction accordingly. Exploratory results are valuable and should be published, but should not be phrased as confirmatory findings, I write.
It is always possible to list other relevant articles or other theories which the authors should cite in the introduction (e.g., more of the reviewer’s papers). Here, I try to hold back with suggestions: the reader will read the paper through the lens of their own research question, anyway, and if the results are relevant for their own hypothesis, they will be able to relate them without the authors writing a novel on all possible perspectives from which their paper could be interesting.

Methods
No experiment’s methods are perfect, but some imperfections make the results uninterpretable, other types of imperfection should be pointed out as limitations, and yet others are imperfections that are found in all papers using the paradigm, so it’s perfectly OK to have those imperfections in a published paper, unless they are a reviewer’s personal pet peeve. In some instances, it is even considered rude to point out certain imperfections. Sometimes, pointing out imperfections will just result in the authors citing some older papers which have the same imperfections. Some imperfections can be addressed with follow-up analyses (e.g., by including covariates), but in this case it’s not clear what the authors should do if they get ambiguous results or results that conflict with the original analyses.
Perhaps this is the section with which you can always sink a paper, if you want. If for no other reason, then, in most cases, based on the experiment(s) being underpowered. It probably varies from topic to topic and from lab to lab what level of imperfection can be tolerated. I can’t think of any general rules or things that I look at when evaluating the adequacy of the experimental methods. If authors reported a priori power analyses, one could objectively scrutinise their proposed effect size. In practice, demanding power analyses in a review would be likely to lead to some post-hoc justifications on the side of the authors, which is not the point of power analyses.
So, perhaps the best thing is to simply ask the authors for the 21-word statement, proposed by Simmons et al., which includes a clarification about whether or not the sample size, analyses, and comparisons were determined a priori. I must admit that I don’t do this (but will start to do this in the future): so far, failing to include such a declaration, in my area, seems to fall into the category of “imperfections that are found in all papers, so it could be seen as rude to point them out”. But this is something that ought to change.
Even though the methods themselves may be difficult to review objectively, one can always focus on the presentation of the methods. Could the experiment be reproduced by someone wanting to replicate the study? It is always best if the materials (actual stimuli that were used, scripts (python, DMDX) that were used for item presentation) are available as appendices. For psycholinguistic experiments, I ask for a list of words with their descriptive statistics (frequency, orthographic neighbourhood, other linguistic variables that could be relevant). 

Results
In some ways, results sections are the most fun to review. (I think some people whose paper I reviewed would say that this is the section that is the least fun to get reviewed by me.) The first question I try to answer is: Is it likely that the authors are describing accidental patterns in random noise? As warning signs, I take a conjunction of small sample sizes, strange supposedly a priori hypotheses (see “Methods” section), multiple comparisons without corrections, and relatively large p-values for the critical effects.
Content-wise: Do the analyses and results reflect the authors’ hypotheses and conclusions? Statistics-wise: Are there any strange things? For example, are the degrees of freedom in order, or could there be some mistake during data processing?
There are other statistical things that one could look at, which I have not done to date, but perhaps should start doing. For example, are the descriptive statistics mathematically possible? One can use Nick Brown’s and James Heathers’ GRIM test for this. Is the distribution of the variables, as described by the means and standard deviations, plausible? If there are multiple experiments, are there suspiciously many significant p-values despite low experimental power? Uli Schimmack’s Incredibility Index can be used to check this. Doing such routine checks is very uncommon in peer review, as far as I know. Perhaps reviewers don’t want to include anything in their report that could be misconstrued (or correctly construed) as implying that there is some kind of fraud or misconduct involved. On the other hand, it should also be in authors' best interests if reviewers manage to pick up potentially embarrassing honest mistakes. Yes, checking of such things is a lot of work, but arguably it is the reviewer’s job to make sure that the paper is, objectively, correct, i.e., that the results are not due to some typo, trimming error or more serious issues. Just like reviewers of papers in mathematics have to reproduce all calculations, and journals such as the Journal of Statistical Software verify the reproducibility of all code before sending a paper out to review.
And speaking of reproducibility: Ideally, the analyses and results of any paper should be reproducible. This means that the reviewers (or anyone else, for that matter), can take the same data, run the same analyses, and get the same results. In fact, this is more than an ideal scenario: running the same analyses and getting the same results, as opposed to getting results that are not at all compatible with the authors’ conclusions, is kind of a must. This requires that authors upload their data (unless this is impossible, e.g., due to privacy issues), and an analysis script.
The Peer Reviewer’s Openness (PRO) Initiative proposes that reviewers refuse to review any paper that does not provide the data, i.e., that fails to meet the minimum standard of reproducibility. I have signed this initiative, but admit that I still review papers, even when the data is not available. This is not because I don’t think it’s important to request transparency: I generally get overly excited when I’m asked to review a paper, and get halfway through the review before I remember that, as a signatory of the PRO, I shouldn’t be reviewing it at all until the authors provide the data or a reason why this is not possible. I compromise by including a major concern at the beginning of my reviews, stating that the data should be made available unless there are reasons for not making it public. So far, I think, I’ve succeeded only once in convincing the authors to actually upload their data, and a few editors have mentioned my request in their decision letter. 

Discussion
Here, the main questions I ask are: Are the conclusions warranted by the data? Are any limitations clearly stated? Can I follow the authors’ reasoning? Is it sufficiently clear which conclusions follow from the data, and which are more speculative? 
As with the introduction section, it’s always possible to suggest alternative hypotheses or theories for which the results may have relevance. Again, I try not to get too carried away with this, because I see it as the reader’s task to identify any links between the paper and their own research questions.

In conclusion
Peer review is a double-edged sword. Reviewers have the power to influence an editor’s decision, and should use it wisely. In order to be an unbiased gate-keeper to sift out bad science, a reviewer’s report ought to be as objective as possible. I did not aim to make this blog post about Open Science, but looking through what I wrote so far, making sure that the methods and results of a paper are openly available (if possible) and reproducible might be the major goal of an objective peer reviewer. After all, if all information is transparently presented, each reader has the information they need in order to decide for themselves whether they want to believe in the conclusions of the paper. The probability of your paper being accepted for publication would no longer depend on whether your particular reviewers happen to find your arguments and data convincing.  

I will finish the blog post with an open question: Is it possible, or desirable, to have completely objective peer review in psychological science?

Sunday, February 11, 2018

By how much would we need to increase our sample sizes to have adequate power with an alpha level of 0.005?


At our department seminar last week, the recent paper by Benjamin et al. on redefining statistical significance was brought up. In this paper, a large group of researchers argue that findings with a p-value close to 0.05 reflect only weak evidence for an effect. Thus, to claim a new discovery, the authors propose a stricter threshold, α = 0.005.

After hearing of this proposal, the immediate reaction in the seminar room was horror at some rough estimations of either the loss of power or increase in the required sample size that this would involve. I imagine that this reaction is rather standard among researchers, but from a quick scan of the “Redefine Statistical Significance” paper and four responses to the paper that I have found (“Why redefining statistical significance will not improve reproducibility and could make the replication crisis worse” by Crane, “Justify your alpha” by Lakens et al., “Abandon statistical significance” by McShane et al., and “Retract p < 0.005 and propose using JASP instead” by Perezgonzales & Frías-Navarro), there are no updated sample size estimates.

Required sample estimates for α = 0.05 α = 0.005 are very easy to calculate with g*power. So, here are the sample size estimates for achieving 80% power, for two-tailed independent-sample t-tests and four different effect sizes:

Alpha
N for d = 0.2
N for d = 0.4
N for d = 0.6
N for d = 0.8
0.05
788
200
90
52
0.005
1336
338
152
88

It is worth noting that most effects in psychology tend to be closer to the d = 0.2 end of the scale, and that most designs are nowadays more complicated than simple main effects in a between-subject comparison. More complex designs (e.g., when one is looking at an interaction) usually require even more participants.

The argument of Benjamin et al., that p-values close to 0.05 provide very weak evidence, is convincing. But their solution raises practical issues which should be considered. For some research questions, collecting a sample of 1336 participants could be achievable, for example by using online questionnaires instead of testing participants at the lab. For other research questions, collecting these kinds of samples is unimaginable. It’s not impossible, of course, but doing so would require a collective change in mindset, the research structure (e.g., investing more resources into a single project, providing longer-term contracts for early career researchers), and incentives (e.g., relaxing the requirement to have many first-author publications).

If we ignore peoples’ concerns about the practical issues associated with collecting this many participants, the Open Science movement may lose a great deal of supporters.

Can I end this blog post on a positive note? Well, there are some things we can do to make the numbers from the table above seem less scary. For example, we can use within-subject designs when possible. Things already start to look brighter: Using the same settings in g*power as above, but calculating the required sample size for “Difference between two dependent means”, we get the following:

Alpha
N for d = 0.2
N for d = 0.4
N for d = 0.6
N for d = 0.8
0.05
199
52
24
15
0.005
337
88
41
25

We could also pre-register our study, including the expected direction of a test, which would allow us to use a one-sided t-test. If we do this, in addition to using a within-subject design, we have:

Alpha
N for d = 0.2
N for d = 0.4
N for d = 0.6
N for d = 0.8
0.05
156
41
19
12
0.005
296
77
36
22

The bottom line is: A comprehensive solution to the replication crisis should address the practical issues associated with getting larger sample sizes.

Thursday, February 8, 2018

Should early-career researchers make their own website?


TL;DR: Yes.

For a while, I have been thinking about whether or not to make my own website. I could see some advantages, but at the same time, I was wondering how it would be perceived. After all, I don’t think any of my superiors at work have their own website, so why should I?

To see what people actually think, I made a poll on Twitter. It received some attention and generated some interesting discussions and many supportive comments (you can read them directly by viewing the responses to the poll I linked above). In this blogpost, I would like to summarise the arguments that were brought up (they were mainly pro-website).

But first, and without any further ado, here are the results:

The results are pretty clear, so – here it is: https://www.xenia-schmalz.net/. It’s still a work-in-progress, so I would be happy to get any feedback!

It is noteworthy that there are some people who did think that it’s narcissistic for Early Career Researchers (ECRs) to create their own website. It would have been interesting to get some information about the demographics of these 5%, and their thoughts behind their vote. If you are an ECR who is weighing up the pros and cons of creating a website, then, as Leonid Schneider pointed out, you may want to think about whether you would want to positively impress someone who judges you for creating an online presence. Either way, I decided that the benefits outweigh any potential costs.

Several people have pointed out in response to the twitter poll that a website is only as narcissistic as you make it. This leads to the question: what comes off as narcissistic? I can imagine that there are many differences in opinion on this. Does one focus on one’s research only? Or include some fun facts about oneself? I decided to take the former approach, for the reason that people who google me are probably more interested in my research rather than my political opinion or to find out whether I’m a cat or a dog person.

In general, people who spend more time on self-promotion than on actually doing things that they brag they can do are not very popular. I would rather not self-promote at all than come off as someone with a head full of hot air. Ideally, I would want to let my work speak for itself and for colleagues to judge me based on the quality of my work. This, of course, requires that people can access my work – which is where the website comes in. Depending on how you design your website, this is literally what it is: A way for people to access your work, so they can make their own opinion about its quality. 

In principle, universities create websites for their employees. However, things can get complicated, especially for ECRs. ECRs often change affiliations, and sometimes go for months without having an official job. For example, I refer to myself as a “post-doc in transit”: My two-year post-doc contract at the University of Padova went until March last year, and I’m currently on a part-time short-term contract at the University of Munich until I will (hopefully) get my own funding. In the meantime, I don’t have a website at the University of Munich, only an out-of-date and incomplete website at the University of Padova, and a still-functioning and rather detailed and up-to-date website at the Centre of Cognition and its Disorders (where I did my PhD in 2011-2014; I’m still affiliated with the CCD as an associate investigator until this year, so probably this site will disappear or stop being updated rather soon). Several people pointed out, in the responses to my twitter poll, that they get a negative impression if they google a researcher and only find an incomplete university page: this may come across as laziness or not caring.

What kind of information should be available about an ECR? First, their current contact details. I somehow thought that my email address should be findable for everyone who looks for it, but come to think of it, I have had people contact me through researchgate or twitter and saying that they couldn’t find my email address.

Let’s suppose that Professor Awesome is looking to hire a post-doc, and has heard that you’re looking for a job and have all the skills that she needs. She might google you, only to find an out-dated university website with an email address that doesn’t work anymore, and in order to contact you, she would need to find you on researchgate (where she would probably need to have an account to contact you), or search for your recent publications, find one where you are a corresponding author, and hope that the listed email address is still valid. At some stage, Professor Awesome might give up and look up the contact details of another ECR who fits the job description.

Admittedly, I have never heard of anyone receiving a job offer via email out of the blue. But one can think of other situations where people might want to contact you with good news: Invitations to review, to become a journal editor, to participate in a symposium, to give a talk at someone else’s department, to collaborate, to give an interview about your research, or simply to discuss some aspects of your work. These things are very likely to increase your chances of getting a position in Professor Awesome’s lab. For me, it remains an open question whether having a website will actually result in any of these things, but I will report back in one year with my anecdotal data on this.

Second, having your own website (rather than relying on your university to create one for you) gives you more control of what people find out about you. In my case, a dry list of publications would probably not bring across my dedication to Open Science, which I see as a big part of my identity as a scientist.

Third, a website can be useful tool to link to your work: not just a list of publications, but also links to full texts, data, materials and analysis scripts. One can even link to unpublished work. In fact, this was one of my main goals while creating the website. In addition to a list of publications on the CV section, I included information about projects that I’m working on or that I have worked on in the past. This was a good reason to get myself organised. First, I sorted my studies by an overarching research question (which has helped me to figure out: What am I actually doing?). Then, for each study, I added a short description (which has helped me to figure out what I have achieved so far), and links to the full text, data and materials (which helped me to verify that I did make this information publicly accessible, which I always tell everyone else that they should do). 

Creating the website is therefore a useful tool for myself to keep track of what I'm doing. People on twitter have pointed out in their comments that it can also be useful for others: not only for the fictional Professor Awesome who is only waiting to make you a job offer, but also, for example, for students who would like to apply for a PhD at your department and are interested to get more information about what people in the department are doing.

I have included information about ongoing projects, published articles, and projects-on-hold. Including information about unpublished projects could be controversial: given that the preprints are presented alongside with published papers, unsuspecting readers might get confused and mistake an unpublished study for a peer-reviewed paper. However, I think that the benefits of making data and materials for unpublished studies outweighs the cost. Some of these papers are unpublished for practical reasons (e.g., because I ran out of resources to conduct a follow-up experiment). Even if an experiment turned out to be unpublishable because I made some mistakes in the experimental design, other people might learn from my mistakes in conducting their own research. This is one of the main reasons why I created the website: To make all aspects of all of my projects fully accessible.

Conclusion
As with everything, there are pros and cons with creating a personal website. A con is that some people might perceive you as narcissistic. There are many pros, though: Especially as an ECR, you provide a platform with information about your work which will remain independently of your employment status. You increase your visibility, so that others can contact you more easily. You can control what others can find out about you. And, finally, you can provide information about your work that, for whatever reason, does not come across in your publication list. So, in conclusion: I recommend to ECRs to make their own website.

Thursday, February 1, 2018

Why I don’t wish that I had never learned about the replication crisis


Doing good science is really hard. It takes a lot of effort and time, and the latter is critical for early-career researchers who have a very limited amount of time to show their productivity to their next potential employer. Doing bad science is easier: It takes less time to plan an experiment, one needs less data, and if one doesn’t get good results immediately (one hardly ever does), the amount of time needed to make the results look good is still less than the amount of time needed to plan and run a well-designed and well-powered study.

Doing good science is frustrating at times. It can make you wonder if it’s even worth it. Wouldn’t life be easier if we were able to continue running our underpowered studies and publish them in masses, without having a bad conscience? I often wonder about this. But the grass always looks greener from the other side, so it’s worth taking a critical look at the BC (before-crisis) times before deciding whether the good old days really were that good.

I learned about the replication crisis gradually, starting in the second year of my PhD, and came to realise its relevance for my own work towards the end of my PhD. During my PhD, I conducted a number of psycholinguistic experiments. I knew about statistical power, in theory – it was that thingy that we were told we should calculate before we start an experiment, but nobody really does that, anyway. I collected as many participants as was the norm in my field. And then the fun started: I remember sleepless nights, followed by a 5-am trip to the lab because I’d thought of yet another analysis that I could try. Frustration when also that analysis didn’t work. Doubts about the data, the experiment, my own competence, and why was I even doing a PhD? Why was I unable to find a simple effect that others had found and published? Once, I was very close to calling the university which gave me my Bachelor of Science degree, and asking them to take it back: What kind of scientist can’t even replicate a stupid effect?

No, I don’t wish that I had never learned about the replication crisis. Doing good science is frustrating at times, but much more satisfying. I know where to start, even if it takes time to get to the stage when I have something to show. I can stand up for what I think is right, and sometimes, I even feel that I can make a difference in improving the system.

Tuesday, January 9, 2018

Why I love preprints


An increasing number of servers are becoming available for posting preprints. This allows authors to post versions of their papers before publication in a peer-reviewed journal. I think this is great. In fact, based on my experiences with preprints so far, if I didn’t need journal publications to get a job, I don’t think I would ever submit another paper to a journal again. Here, I describe the advantages of preprints, and address some concerns that I’ve heard from colleagues who are less enthusiastic about preprints.

The “How” of preprints
Preprints can be simply uploaded to a preprint server: for example, on psyarxiv.com, via osf.io, or even on researchgate. It’s easy. This covers the “how” part.

The “Why” of preprints
In an ideal world, a publication serves as a starting point for a conversation, or as a contribution to an already ongoing discussion. Preprints fulfil this purpose more effectively than journal publications. Their publication takes only a couple of minutes, while publication in a journal can take anywhere between a couple of months to a couple of years. With modern technology, preprints are findable for researchers. Preprints are often posted on social media websites, such as twitter, and are then circulated by others who are interested in the same topic, and critically discussed. With many preprint servers, preprints become listed on google scholar, which sends alerts to researchers who are following the authors. The preprint can also be linked to supplementary material, such as the data and analysis codes, thus facilitating open and reproducible science.

Preprints are advantageous to show an author’s productivity: If someone (especially an early career researcher) is unlucky in obtaining journal publications, they can demonstrate, on their CV, that they are productive, and potential employers can check the preprint to verify its quality and the match of research interests.

The author has a lot of flexibility in the decision of when to upload a preprint. The earlier a preprint is uploaded, the more possibilities the author has to receive feedback from colleagues and incorporate them in the text. The OSF website, which allows users to upload preprints, has a version control function. This means that an updated version of the file can be uploaded, while the older version is archived. Searches will lead to the most recent version, thus avoiding version confusion. At the same time, it is possible to track the development of the paper.

The “When” of preprints
In terms of timing, one option is to upload a preprint shortly after it has been accepted for publication at a journal. In line with many journals’ policies, this is a way to make your article openly accessible to everyone: while uploading the final, journal-formatted version is a violation of copyright, uploading the author’s version is generally allowed1.

Another option is to post a preprint at the same time as submitting the paper to a journal. This has an additional advantage: It allows the authors to receive more feedback. Readers who are interested in the topic may contact the author with corrections or suggestions. If this happens, the author can still make changes before the paper reaches its final, journal-published version. If, conversely, a mistake is noticed only after journal publication, the author either has to live with it, or issue an often stigmatising correction.

A final possibility is to upload a preprint that one does not want to publish. This could include preliminary work, or papers that have been rejected repeatedly by traditional journals. Preliminary work could be based on research directions which did not work out for whatever reason. This would inform other researchers who might be thinking of going in the same direction of potential issues with a given approach: this, in turn, would stop them from wasting their resources by doing the same thing only to find out, too, that it doesn’t work.

Uploading papers that have been repeatedly rejected is a more hairy issue. Here, it is important, for the authors, to consider why the paper has been rejected. Sometimes, papers really are fundamentally flawed. They could be p-hacked, contain fabricated data, or errors in the analyses; theory and interpretation could be based on non sequiturs or be presented in a biased way. Such papers have no place in the academic literature. But there are other issues that might make a paper unsuitable for publication in a traditional journal, but still useful for others to know about. For example, one might run an experiment on a theoretically or practically important association, and find that a one’s measure is unreliable. In such a scenario, a null result is difficult to interpret, but it is important that colleagues know about this, so they can avoid using this measure in their own work. Or, one might have run into practical obstacles in participant recruitment, and failed to get a sufficiently large sample size. Again, it is difficult to draw conclusions from such studies, but if the details of this experiment are publically available, this data can be included in meta-analysis. This can be critical for research questions which concern a special population that is difficult to recruit, and in fact may be the only way in which conducting such research is possible.

With traditional journals, one can also be simply unlucky with reviewers. The fact that luck is a huge component in journals’ decisions can be exemplified with a paper of mine, that was rejected as being “irritating” and “nonsense” from one journal, and accepted with minor revisions by another one. Alternatively, one may find it difficult to find a perfectly matching journal for a paper. I have another anecdote as an example of this: After one paper of mine was rejected by three different journals, I uploaded a preprint. A week later, I had received two emails from colleagues with suggestions about journals that could be interested in this specific paper, and two months later the paper was accepted by the first of these journals with minor revisions.  

The possibility of uploading unpublishable work is probably the most controversial point about preprints. Traditional journals are considered to give a paper a seal of approval: a guarantee of quality, as reflected by positive reports of expert reviewers. In contrast, anyone can publish anything as a preprint. If both preprints and journal articles are floating around on the web, it could be difficult, especially for people who are not experts in the field (including journalists, or people who are directly affected by the research, such as patients reading about a potential treatment), to determine which they can trust. This is indeed a concern – however, I maintain that it is an open empirical question whether or not the increase in preprint will exacerbate the spreading of misinformation.

The fact is that traditional journals’ peer review is not perfect. Hardly anyone would contest this: fundamentally flawed papers sometimes get published, and good, sound papers sometimes get repeatedly rejected. Thus, even papers published in traditional journals are a mixture of good and bad papers. In addition, there are the notorious predatory journals, that accept any paper against a fee, and publish it under the appearance of being peer reviewed. These may not fool persons who are experienced with academia, but journalists and consumers may find this confusing.

The point stands that the increase in preprints may increase the ratio of bad to good papers. But perhaps this calls for increased caution in trusting what we read: the probability that a given paper is bad is definitely above zero, regardless of whether it has been published as a preprint or in a traditional journal. Maybe, just maybe, the increase of preprints will lead to increased caution in evaluating papers based on their own merit, rather than the journal it was published in. Researchers would become more critical of the papers that they read, and post-publication peer review may increase in importance. And maybe, just maybe, an additional bonus will lie in the realisation that we as researchers need to become better at sharing our research with the general public in a way that provides a clear explanation of our work and doesn’t overhype our results.

Conclusion
I love preprints. They are easy, allow for fast publication of our work, and encourage openness and a dynamic approach to science, where publications reflect ongoing discussions in the scientific community. This is not to say that I hate traditional peer review. I like peer review: I have often received very helpful comments from which I have learned about statistics, theory building, and got a broader picture of the views held by colleagues outside of the lab. Such comments are fundamental for the development of high-quality science. 

But: Let’s have such conversations in public, rather than in anonymous email threads moderated by the editor, so that everyone can benefit. Emphasising the nature of science as an open dialogue may be the biggest advantage of preprints.

 __________________________________________
1 This differs from journal to journal. For specific journals’ policies on this issue, see here.

Wednesday, December 20, 2017

Does action video gaming help against dyslexia?


TL;DR: Probably not.

Imagine there is a way to improve reading ability in children in dyslexia, which is fun and efficient. For parents of children with dyslexia this would be great: No more dragging your child to therapists, spending endless hours in the evening trying to get the child to practice their letter-sound rules or forcing them to sit down with a book. According to several recent papers, a fun and quick treatment to improve reading ability might be in sight, and every parent can apply this treatment in their own home: Action video gaming.

Action video games differ from other types of games, because they involve situations where the player has to quickly shift their attention from one visual stimulus to another. First-person shooter games are a good example: one might focus on one part of the screen, and then an “enemy” appears and one needs to direct the visual attention to him and shoot him1.

The idea that action video gaming could improve reading ability is not as random as might seem at first sight. Indeed, there is a large body of work, albeit very controversial, that suggests that children or adults with dyslexia might have problems with shifting visual attention. The idea that a visual deficit might underlie dyslexia originates from the early 1980s (Badcock et al., Galaburda et al.; references are in the articles linked below), thus it is not in any way novel or revolutionary. A summary of this work would warrant a separate blog post or academic publication, but for some (favourable) reviews, see Vidyasagar, T. R., & Pammer, K. (2010). Dyslexia: a deficit in visuo-spatial attention, not in phonological processing. Trends in Cognitive Sciences, 14(2), 57-63 (downloadable here) or Stein, J., & Walsh, V. (1997). To see but not to read; the magnocellular theory of dyslexia. Trends in neurosciences, 20(4), 147-152 (downloadable here), or (for a more agnostic review) Boden, C., & Giaschi, D. (2007). M-stream deficits and reading-related visual processes in developmental dyslexia. Psychological Bulletin, 133(2), 346 (downloadable here). It is worth noting that there is little consensus, amongst the proponents of this broad class of visual-attentional deficit theories, about the exact cognitive processes that are impaired and how they would lead to problems with reading.

The way research should proceed is clear: If there is a theoretical groundwork, based on experimental studies, to suggest that a certain type of treatment might work, one does a randomised controlled trial (RCT): A group of patients are randomly divided into two groups, one is subjected to the treatment in question, and the other to a control treatment, and we compare the improvement between pre- and post-measurement in the two groups. To date, there are three such studies:

Franceschini, S., Gori, S., Ruffino, M., Viola, S., Molteni, M., & Facoetti, A. (2013). Action video games make dyslexic children read better. Current Biology, 23(6), 462-466 (here)

Franceschini, S., Trevisan, P., Ronconi, L., Bertoni, S., Colmar, S., Double, K., ... & Gori, S. (2017). Action video games improve reading abilities and visual-to-auditory attentional shifting in English-speaking children with dyslexia. Scientific Reports, 7(1), 5863 (here), and

Gori, S., Seitz, A. R., Ronconi, L., Franceschini, S., & Facoetti, A. (2016). Multiple causal links between magnocellular–dorsal pathway deficit and developmental dyslexia. Cerebral Cortex, 26(11), 4356-4369 (here).

In writing the current critique, I am assuming no issues with the papers at stake, or with the research skills or integrity of the researchers. Rather, I would like to show that, under the above assumptions, the three studies may provide a highly misleading picture of the effect of video gaming on reading ability. The implications are clear and very important: Parents of children with dyslexia have access to many different sources of information, some of which provide only snake-oil treatments. From a quick google search for “How to cure dyslexia”, the first five links suggest modelling letters out of clay, early assessment, multi-sensory instructions, more clay sculptures, and teaching phonemic awareness. As reading researchers, we should not add to the confusion or divert resources from treatments that have actually been shown to work, by adding yet another “cure” to the list.

So, what is my gripe with these three papers? First, that there are only three such papers. As I mentioned above, the idea that there is a deficit in visual-attentional processing amongst people with dyslexia, and that this might be a cause of their poor reading ability, has been floating around for over 30 years. We know that the best way to establish causality is through a treatment study (RCT): We have known this for well over thirty years2. So, why didn’t more people conduct and publish RCTs on this topic?

The Mystery of Missing Data
Here is a hypothesis which, admittedly, is difficult to test: RCTs have been conducted for 30 years, but only three of them ever got published. This is a well-known phenomenon in scientific publishing: in general, studies which report positive findings are easier to publish. Studies which do not find a significant result tend to get stored in file-drawer archives. This is called the File-Drawer Problem, and has been discussed as early as 1979 (Rosenthal, R. (1979). The "File Drawer Problem" and Tolerance for Null Results. Psychological Bulletin, 86(3), 638-641, here). 

The reason this is a problem goes back to the very definition of the statistical test we generally use to establish significance: The p-value. p-values are considered “significant” if they are below 0.05, i.e., below 5%. The p-value is defined as the probability of obtaining the data or more extreme observations, under the assumption that the null hypothesis is true. They key is the second part. By rephrasing the definition, we get the following: When the effect is not there, the p-value tells us that it is there 5% of the time. This is a feature, not a bug, as it does exactly what the p-value was designed to do: It gives us a long-run error rate and allows us to keep it constant at 5% across a set of studies. But this desired property becomes invalidated in a world where we only publish positive results. In a scenario where the effect is not there, 5 in 100 studies will give us a significant p-value, on average. If only the five significant studies are published, we have a 100% rate of false positives (significant p-values in the absence of a true effect) in the literature. If we assume that the action video gaming effect is not there, then we would expect, on average, three false positives out of 60 studies3. Is it possible that in 30 years, there is an accumulation of studies which trained dyslexic children’s visual-attentional skills and observed no improvement?

Magnitude Errors
The second issue in the currently published literature relates to the previous point, and extends to the possibility that there might be an effect of action video gaming on reading ability. So, for now, let’s assume the effect is there. Perhaps it is even a big effect, let’s say, it has a standardised effect size (Cohen’s d) of 0.3, which is considered to be a small-to-medium-size effect. Realistically, the effect of action video gaming on reading ability is very unlikely to be bigger, since the best-established treatment effects have shown effect sizes of around 0.3 (Galuschka et al., 2014; here).

We can simulate very easily (in R) what will happen in this scenario. We pick a sample of 16 participants (the number of dyslexic children assigned to the action video gaming group in Franceschini et al., 2017). Then, we calculate the average improvement across the 16 participants, in the standardised score:

x=rnorm(16,0.3,1)
mean(x)

The first average value I get a mean improvement of 0.24. Not bad. Then I run the code again, and get a whooping 0.44! Next time, not so lucky: 0.09. And then, we even get a negative effect, of -0.30.

This is just a brief illustration of the fact that, when you sample from the population, your observed effect will jump around the true population effect size due to random variation. This might seem trivial to some, but, unfortunately, this fact is often forgotten even by well-established researchers, who may go on to treat an observed effect size as a precise estimate.

When we sample, repeatedly, from a population, and plot a histogram of all the observed means, we get a normal distribution: A fair few observed means will be close to the true population mean, but some will not be at all.

We’re closing in on the point I want to make here: Just by chance, someone will eventually run an experiment and obtain an effect size of 0.7, even if the true effect is 0.5, 0.2, or even 0. Bigger observed effects, when all else is equal, will yield significant results while smaller observed effects will be non-significant. This means: If you run a study, and by chance you observe an effect size that is bigger than the population effect size, there will be a higher probability that it will be significantly and get published. If your identical twin sibling runs an identical study but happens to obtain an effect size that is smaller than yours – even if it corresponds to the true effect size! – it may not be significant, and they will be forced to stow it in their file drawer.

Given that only the significant effects are published (or even if there is a disproportionate number of positive compared to negative outcomes), we end up with a skewed literature. In the first-case scenario, we considered the possibility that the effect might not be there at all. In the second scenario, we assume that the effect is there, but even so, the published studies, due to the presence of publication bias, may have captured effect sizes that are larger than the actual treatment effect. This has been called by Gelman & Carlin (2014, here) the “Magnitude Error”, and has been described, with an illustration that I like to use in talks, by Schmidt in 1992 (see Figure 2, here).

Getting back to action video gaming and dyslexia: Maybe action video gaming improves dyslexia. We don’t know: Given only three studies, it is difficult to adjudicate between two possible scenarios (no effect + publication bias or small effect + publication bias).

So, let’s have a look at the effects reported in the three published papers. I will ignore the 2013 paper4, because it only provides the necessary descriptives in figures rather than tables, and the journal format hides the methods section with vital information about the number of participants god-knows-where. In the 2017 paper, Table 1 provides the pre- and post-measurement values of the experimental and control group, for word reading speed, word reading accuracy, phonological decoding (pseudoword reading) speed, and phonological decoding accuracy. The paper even reports the effect sizes: The action video game training had no effect on reading accuracy. For speed, the effect sizes are d = 0.27 and d = 0.45 for word and pseudoword reading, respectively. In the 2015 paper, the effect size for the increase in speed for word reading (second row of the table) is 0.34, and for pseudoword reading ability, it is 0.58.

The effect sizes are thus comparable across studies. Putting the effect sizes into context: The 2017 study found an increase in speed, from 88 seconds to 76 seconds to read a list of words, and from 86 seconds to 69 seconds to read a list of pseudowords. For words, this translates to an increase in speed of 14%: In practical terms, if it takes a child 100 hours to read a book before training, it would take the same child only 86 hours to read the same book after training.

In experimental terms, this is not a huge effect, but it competes with the effect sizes for well-established treatment methods such as phonics instruction (Hedge’s g’ = 0.32; Galuschka et al., 2014)5. Phonics instruction focuses on a proximal cause of poor reading: A deficit in mapping speech sounds onto print. We would expect a focus of proximal causes to have a stronger effect than a focus on distal causes, where there are many intermediate steps between a deficit and reading ability, as explained by McArthur and Castles (2017) here. In our case, the following things have to happen for a couple of weeks of action video gaming to improve reading ability:

- Playing first-person shooter games has to increase children’s ability to switch their attention rapidly,
- The type of attention switching during reading is the same as the attention switching to a stimulus which appears suddenly on the screen,
-  Improving your visual attention leads to an increase in reading speed.

There are ifs and buts at each of these steps. The link between action video gaming and visual-attentional processing would be diluted by other things which train children’s visual-attentional skills, such as how often they read, played tennis, sight-read sheet music, or looked through “Where’s Wally” books during the training period.6 In between visual-attentional processing and reading ability, are other variables which affect reading ability and dilute this link: the amount of time they read at home, motivation and tiredness at the first versus the second testing time point, and many others. These other factors dilute the treatment effect by adding variability to the experiment that is not due to the treatment. This should lead to smaller effect sizes.

In short: There might be an effect of action video gaming on reading ability. But I’m willing to bet that it will be smaller than the effect reported in the published studies. I mean this literally: I will buy a good bottle of a drink of your choice to anyone who can convince me that the effect 2 weeks of action video gaming on reading ability is in the vicinity of d = 0.3.

How to provide a convincing case for an effect of action video gaming on reading ability
The idea that something as simple as action video gaming can improve children’s ability to do one of the most complex tasks they learn at school is an incredible claim. Incredible claims require very strong evidence. Especially if the claim has practical implications.

To convince me, one would have to conduct a study which is (1) well-powered, and (2) pre-registered. Let’s assume that the effect is, indeed, d = 0.3. With g*power, we can easily calculate how many participants we would need to recruit for 80% power. Setting “Means: Difference between two dependent means (matched pairs)” in “Statistical test”, a one-tailed test (note that both of these decisions increase power, i.e., decrease the number of required participants), effect size of 0.3, alpha of 0.05 and power of 0.8, it shows that we need 71 children in a within-children design to have adequate power to detect such an effect.

A study should also be pre-registered. This would remove the possibility of the authors tweaking the data, analysis and variables until they get significant results. This is important in reading research, because there are many different ways in which reading ability can be calculated. For example, Gori and colleagues (Table 3) present 6 different dependent variables that can be used as the outcome measure. The greater the amount of variables one can possibly analyse, the greater the flexibility for conducting analyses until at least some contrast becomes significant (Simmons et al., 2011, here). Furthermore, pre-registration will reduce the overall effect of publication bias, because there will be a record of someone having started a given study:

In short: To make a convincing case that there is an effect of the magnitude reported in the published literature, we would need a pre-registered study with at least 70 participants in a within-subject design.

Some final recommendations
For researchers: I hope that I managed to illustrate how publication bias can lead to magnitude errors: the illusion that an effect is much bigger than it actually is (regardless of whether or not it exists). Your perfect study which you pre-registered and published with a significant result and without p-hacking might be interpreted very differently if we knew about all the unpublished studies that are hidden away. This is a pretty terrifying thought: As long as publication bias exists, you can be entirely wrong with the interpretation of your study, even if you do all the right things. We are quickly running out of excuses: We need to move towards pre-registration, especially for research questions such as the one I discussed here, which has strong practical implications. So, PLEASE PLEASE PLEASE, no more underpowered and non-registered studies of action video gaming on reading ability.

For funders: Unless a study on the effect of action video gaming on reading ability is pre-registered and adequately powered, it will not give us meaningful results. So, please don’t spend any more of the tax payers’ money on studies that cannot be used to address the question they set out to answer. In case you have too much money and don’t know what to do with it: I am looking for funding for a project on GPC learning and automatisation in reading development and dyslexia.   

For parents and teachers who want to find out what’s best for their child or student: I don’t know what to tell you. I hope we’ll sort out the publication bias thing soon. In the meantime, it’s best to focus on proximal causes of reading problems, as proposed by McArthur and Castles (2017) here.

-------------------------------------------------------
1 I know absolutely nothing about shooter games, but from what I understand characters there tend to be males.
2 More like 300 years, Wikipedia informs me.
3 This assumes no questionable research practices: With questionable research practices, the false positive rate may inflate to 60%, meaning that we would need to assume the presence of only 2 unpublished studies which did not find a significant treatment effect (Simmons et al., 2011, here)
4 I can do this in a blog post, right?
5 And this is probably an over-estimation, given publication bias.
6 If playing action video games increases visual-attentional processing ability, then so should, surely, these other things?