Discussion Paper
No. 2017-77 | September 28, 2017
Annette N. Brown and Benjamin Douglas Kuflick Wood
Which tests not witch hunts: a diagnostic approach for conducting replication research
(Published in The practice of replication)

Abstract

This paper provides researchers with an objective list of checks to consider when planning a replication study with the objective of validating findings for informing policy. These replication studies should begin with a pure replication of the published results and then reanalyse the original data to address the original research question. The authors present tips for replication exercises in four categories: validity of assumptions, data transformations, estimation methods, and heterogeneous impacts. For each category they offer an introduction, a tips checklist, some examples of how these checks have been employed, and a set of resources that provide statistical and econometric details.

JEL Classification:

C10, B41, A20

Links

Cite As

[Please cite the corresponding journal article] Annette N. Brown and Benjamin Douglas Kuflick Wood (2017). Which tests not witch hunts: a diagnostic approach for conducting replication research. Economics Discussion Papers, No 2017-77, Kiel Institute for the World Economy. http://www.economics-ejournal.org/economics/discussionpapers/2017-77


Comments and Questions



Anonymous - Referee report 1
October 19, 2017 - 08:55
This paper suggests check lists to consider in planning and doing a replication study. I agree with the authors on all the points made and I think that these "tips" will be helpful for many researchers interested in doing replication studies. Furhermore, they are for sure helpful for any original study, too.

Annette Brown - Thank you
December 13, 2017 - 23:53
We thank the referee for his/her comments.

Anonymous - Referee report 2
November 08, 2017 - 15:01
The paper "Which tests not witch hunts: a diagnostic approach for conducting replication research" provides an introductory guidance for how to approach a replication study. The authors provide important considerations for a replication in four dimensions of research: assumptions, data transformations, estimation methods, and impacts. Whereas the title of the paper promises somewhat more concrete tools (tests) for conducting replications, I prefer the approach which is actually given by the authors in the paper. Rather than discussing the methodology of particular tests, the paper boosts the creativity of the reader to find possible and important starting points to conduct a replication study. Although the paper is intended to give guidance for conducting replication studies, the paper just as well suits for conducting original research which is intended to be robust and replicable.

Annette Brown - Thank you
December 13, 2017 - 23:54
We thank the referee for her/his comments.

Anonymous - Referee report 3
December 08, 2017 - 09:14
Summary Replication a good and valuable thing: consider the efforts of Camerer et al. in Science, for example, or of Young in his recent working paper. Why don’t we see more replications? What problems might arise, and how might they be solved? In this paper, the authors describe their goal of providing “a neutral checklist that can help replication researchers identify useful ways of validating results.” This very current and topical undertaking is accomplished by summarizing and condensing some recent replication work. Much of the discussion revolves around 3ie-funded replication work, as the authors are or were employed by 3ie. The discussion is broken into four pieces: “validity of assumptions, data transformations, estimation methods, and heterogeneous impacts.” Discussion This was an interesting draft to read. As it aims to capitalize on the opportunity presented by a recent surge in replication studies in economics, summarizing the work and formulating a checklist, its reader is prompted to reflect on both what is in the manuscript, and what could be but is not. Opening The paper begins by usefully discussing, yet somehow completely mischaracterizing, the recent work of Galiani, Gertler, and Romero (2017). The opening three sentences: “While most researchers accept the scientific premise for replication, many still oppose or resist its practice. One reason given for this resistance is the assumed intent of replication research, especially of internal replication research where the replication researcher works with the data from the original study. Galiani, Gertler, and Romero (2017) claim that replication suffers from ‘overturn bias’, which they attribute to both journal editors and to replication researchers, citing a survey of editors as evidence of the first claim and original authors of replicated studies as evidence of the second claim.” Galiani, Gertler, and Romero’s work is pertinent and worth citing in this context. However, they do not use the word “intent” anywhere in their NBER working paper. They also do not assume it. They do not limit themselves to the mentioned survey of editors or comments from authors: on the second page of their working paper, for example, they specifically discuss the 3ie replication effort (the same effort that is the basis for the present manuscript). They point out that although (as one would hope) it is only a minority of replication studies that overturn major results (which accords with this manuscript’s own discussion, so I have no reason to doubt their accuracy), the lone replication to be published comes from that minority. This is an important piece of evidence in the opening arguments made by Galiani, Gertler, and Romero, and it is an error of omission not to mention it in this context. The present manuscript’s characterization of Galiani, Gertler, and Romero (2017) seems to misconstrue it, making it appear flimsy in ways that it is not. Just two sentences later in the present manuscript, though, there is a “challenge” in “avoiding ‘overturn bias.’” The existence of “overturn bias” motivates the manuscript by the end of the first paragraph, a motivation echoed by the title. I am not what the rhetorical value was in misconstruing the cited work; it seems this may be a case of inadvertently poor wording. I would suggest that, further to this manuscript’s own goals, the authors should characterize at least this cited source more accurately. Dos and Don’ts The bigger error of omission in this manuscript, and it is an important one if the goal is in fact to provide a productive set of guidelines, is that it never clearly says what NOT to do. A great opportunity to discuss what not to do, whether in general or in relation to a specific example, comes in the discussion of heterogeneous treatment effects. The authors write: “There are different reasons why original authors might not conduct subgroup analysis or test for heterogeneous impacts. One is simply statistical power. If the dataset is small to begin, there may not be enough power to meaningfully conduct subgroup analysis.” This is a good point. When I read on, to see the way this “statistical power” criterion is made practical in the associated checklist, it turns out to be absent: “statistical power” never appears again, either in the checklist or anywhere else in the manuscript. (There was, however, a suggestion to use “machine learning,” which seemed unsubstantiated by any discussion elsewhere in the text.) Exemplar Don’ts While this omission seems straightforward to rectify, there is a difficulty in rectifying it. Many suggestions on the checklist are justified by real examples, and to give an example of what not to do is to confront controversy. As Galiani, Gertler, and Romero’s paper describes, the 3ie replications saw no shortage of controversy. Taking the prominent example of the “replication and re-analysis” of Miguel and Kremer (2004), the present manuscript provides the following comment regarding that replication, the authors’ reply, and the ensuing blog-based back-and-forth among original authors, replicators, and other scholars weighing in on various aspects of the re-analysis: “A well-known example of an epidemiological replication study of an econometric paper is Davey, Aiken, Hayes, and Hargreaves (2015).” This is a surprisingly brief understatement, given the extent of the scholarly discussion on this specific example. A challenge here is that this, and several of the other replications, have resulted in (academically) heated exchanges. To provide a checklist that could have remedied any aspect of these conflicts is to weigh in on those conflicts, though doing so might prevent some such conflicts in the future. I recognize that this is a hard thing to do, but other authors in this area have managed to discuss the adversarial character of these exchanges more openly. The present authors are in a unique position, having seen much of the relevant correspondence first-hand, to take an explicit stand on procedural or econometric mistakes that others would do well to avoid. But as it stands, this omission diminishes both the neutrality and value of the checklist. Minor note: how to move from example to checklist In justifying the proposed checklist, it might be helpful to readers to show (perhaps in a table?) which of the replications discussed as examples employs the item on the checklist. Or if there was another method by which the checklist items were pulled from the examples, it would place the checklist on firmer footing to simply explain the relevant method in relation to each checklist item. I am particularly struck by the “machine learning” checklist item (mentioned earlier), which seems totally unlinked by text to any of the discussed examples, unless I overlooked something. Minor note: completeness of references for guiding researchers I should mention that the “validating assumptions” literature section seemed to be missing some of the more comprehensive treatments of quasi-experimental methods in the context of program evaluation. This is easily remedied. In the context of regression discontinuity designs, I respectfully suggest a reference to some of the recent papers by Calonico, Cattaneo, and Titiunik (Econometrica, JASA, R Journal, Stata Journal). More generally, either the Gertler, et al. book, “Impact Evaluation in Practice,” or the Angrist and Pischke book, “Mostly Harmless Econometrics,” could provide a good starting point on these methods for researchers unfamiliar with them. Minor note: completeness of references for replication terminology Finally, I note that the four-part framework for this paper expands on the framework that the same authors articulated in Brown, et al., (2014): as they write, the “middle two categories relate to the measurement and estimation analysis approach to replication (Brown, Cameron and Wood, 2014)” while “The fourth category, heterogeneous outcomes, is related to the theory of change analysis approach to replication.” Since the Brown, et al., (2014) paper, however, there have been newer (yet well-cited and well-known) discussions of the broader replication effort accompanied by proposed definitions for the different types of “replication” included in the effort. The present manuscript does not engage with these, but would be of greater use to the profession if it were to do so. Consider referencing Duvendack, Palmer-Jones, and Reed (2014 and 2017) as well as the terminology of Clemens (2017).

Annette Brown - Thank you
December 19, 2017 - 01:15
We thank the referee for her/his comments. We disagree that we mischaracterize the Galiani, et al. study. You are correct they do not use the word "intent". We will revisit our use of that word in our revision. However, they state "several independent scholars questioned the assumptions made by the replicators, claiming that many of these lacked scientific justification and may have been made to maximize the likelihood of overturning the original results." This reads as Galiani et al providing evidence to support a claim that replication researchers have the intent of overturning original results, albeit evidence not from replication researchers. You are correct that we are in a unique position having seen the relevant correspondence between and among original authors, replication researchers, and ourselves firsthand. Our commentary on that experience is far outside of the scope of this paper, but we have started writing a separate manuscript about it. We do not plan to weigh in on the individual conflicts, though -- not in any paper. We established a policy in that regard early on and have tried to remain true to it (although have sometimes been accused by both sides of weighing in on the other side). In terms of don'ts, what we have written about in the past and will consider including (or at least referencing) in this paper is how replication researchers interpret and write about their replication results. We spent a lot of time in our roles at 3ie working with replication researchers to edit the way they present their results in an attempt to remedy some of the conflicts. We agree that the difficulty getting replication studies published is evidence of how editors view replication research. We don't think that anyone doubts that it is hard to get replication studies, so we're not quite sure what the publication record of the 3ie-funded studies adds to the evidence of Galiani, et al. or to what economists generally know. As it turns out, we are aware that a large number of the 3ie-funded replication studies are under consideration for journal publication right now, so it would be awkward for us to comment one way or another on the evidence that only one study has been published in a journal. Thanks for catching the machine learning reference. There was a paper we had discussed in the text that used machine learning in a replication study. We deleted the discussion from the text, but neglected to take the item from the list. We will fix this. The discussion of definitions can get complicated quite quickly, as authors use different characteristics to define terms, not just different terms for the same things. We'll look again at the references you listed and see if we can connect the dots. Thank you for the many great suggestions for additional citations!

Andrew C. Chang - Balance Tests and Negative Recommendations
December 10, 2017 - 00:27 | Author's Homepage
*The opinions expressed here are mine and are not necessarily those of the Board of Governors of the Federal Reserve System* I think Annette and Ben's paper is a useful consolidation of recommendations and associated literature. I found myself thinking about their tips for balance tests in particular. In my opinion, it is usually not useful to run these balance tests, except in the situation where you might be worried that, somehow, there was an error in the randomization procedure itself (e.g., inappropriate intervention in the randomization by a funding agency that was outside of the researcher's control, a point also made by McKenzie, 2017). So I think this recommendation could be more nuanced. Or the paper could be expanded more generally to include some points of things NOT to do, as suggested by referee #3 (although maybe not necessarily the examples articulated by referee #3).

Annette Brown - Thank you
December 19, 2017 - 01:20
Thanks Andrew for your comments! We agree that many consider it a bad thing to run balance tests. We will try to be more nuanced in the revision. Needless to say, the don'ts are trickier, and we don't want this paper to look like a commentary on the conflicts from the 3ie-funded studies. As noted in the response to the anonymous referee, we can reference or include some discussion of language (and perhaps even process) to include some don'ts.

Anonymous - Referee report 3
December 18, 2017 - 10:41
This manuscript describes how replication of RCT should proceed. It provides check lists for the various phases of a replication along with selected readings for the various points on the check lists. These checklist remind me on unit test scripts for software that are applied whenever the code is modified, to check that the behavior of the software has not changed in adverse ways. Here, the idea is that the results of a RCT should not change if the check list is executed. While this is great for replications, I think this should also apply to the initial studies as well. The published version typically contains one scenario, sometimes with a couple of alternative specifications. Enforcing the use of such a checklist and the publication of its results would help greatly in the credibility of RCTs, along with their pre-registration.

Annette Brown - Thank you
December 19, 2017 - 00:30
We thank the referee for his/her comments. We should clarify that our suggestions are not meant to apply exclusively to RCTs. Many of the checks we present apply to quasi-experimental designs. We are focused on counterfactual-based evaluations of programs or interventions, though, which leads many to think RCTs. We agree that similar checklists can be quite useful for developing pre-analysis plans for new studies. Much useful guidance has been written by Glennerster and Miguel and others about pre-analysis plans.

W. Robert Reed - Decision letter
January 20, 2018 - 18:54
see attached file