Discussion Paper

No. 2017-77 | September 28, 2017
Which tests not witch hunts: a diagnostic approach for conducting replication research
(Published in Special Issue The practice of replication)


This paper provides researchers with an objective list of checks to consider when planning a replication study with the objective of validating findings for informing policy. These replication studies should begin with a pure replication of the published results and then reanalyse the original data to address the original research question. The authors present tips for replication exercises in four categories: validity of assumptions, data transformations, estimation methods, and heterogeneous impacts. For each category they offer an introduction, a tips checklist, some examples of how these checks have been employed, and a set of resources that provide statistical and econometric details.

JEL Classification:

C10, B41, A20


  • Downloads: 187


Cite As

Annette N. Brown and Benjamin Douglas Kuflick Wood (2017). Which tests not witch hunts: a diagnostic approach for conducting replication research. Economics Discussion Papers, No 2017-77, Kiel Institute for the World Economy. http://www.economics-ejournal.org/economics/discussionpapers/2017-77

Comments and Questions

Anonymous - Referee report 1
October 19, 2017 - 08:55

This paper suggests check lists to consider in planning and doing a replication study. I agree with the authors on all the points made and I think that these "tips" will be helpful for many researchers interested in doing replication studies. Furhermore, they are for sure helpful for any original ...[more]

... study, too.

Annette Brown - Thank you
December 13, 2017 - 23:53

We thank the referee for his/her comments.

Anonymous - Referee report 2
November 08, 2017 - 15:01

The paper "Which tests not witch hunts: a diagnostic approach for conducting replication research" provides an introductory guidance for how to approach a replication study. The authors provide important considerations for a replication in four dimensions of research: assumptions, data transformations, estimation methods, and impacts. Whereas the title of the ...[more]

... paper promises somewhat more concrete tools (tests) for conducting replications, I prefer the approach which is actually given by the authors in the paper. Rather than discussing the methodology of particular tests, the paper boosts the creativity of the reader to find possible and important starting points to conduct a replication study. Although the paper is intended to give guidance for conducting replication studies, the paper just as well suits for conducting original research which is intended to be robust and replicable.

Annette Brown - Thank you
December 13, 2017 - 23:54

We thank the referee for her/his comments.

Anonymous - Referee report 3
December 08, 2017 - 09:14


Replication a good and valuable thing: consider the efforts of Camerer et al. in Science, for example, or of Young in his recent working paper. Why don’t we see more replications? What problems might arise, and how might they be solved? In this paper, the authors describe their ...[more]

... goal of providing “a neutral checklist that can help replication researchers identify useful ways of validating results.” This very current and topical undertaking is accomplished by summarizing and condensing some recent replication work. Much of the discussion revolves around 3ie-funded replication work, as the authors are or were employed by 3ie. The discussion is broken into four pieces: “validity of assumptions, data transformations, estimation methods, and heterogeneous impacts.”


This was an interesting draft to read. As it aims to capitalize on the opportunity presented by a recent surge in replication studies in economics, summarizing the work and formulating a checklist, its reader is prompted to reflect on both what is in the manuscript, and what could be but is not.


The paper begins by usefully discussing, yet somehow completely mischaracterizing, the recent work of Galiani, Gertler, and Romero (2017). The opening three sentences:

“While most researchers accept the scientific premise for replication, many still oppose or resist its practice. One reason given for this resistance is the assumed intent of replication research, especially of internal replication research where the replication researcher works with the data from the original study. Galiani, Gertler, and Romero (2017) claim that replication suffers from ‘overturn bias’, which they attribute to both journal editors and to replication researchers, citing a survey of editors as evidence of the first claim and original authors of replicated studies as evidence of the second claim.”

Galiani, Gertler, and Romero’s work is pertinent and worth citing in this context. However, they do not use the word “intent” anywhere in their NBER working paper. They also do not assume it. They do not limit themselves to the mentioned survey of editors or comments from authors: on the second page of their working paper, for example, they specifically discuss the 3ie replication effort (the same effort that is the basis for the present manuscript). They point out that although (as one would hope) it is only a minority of replication studies that overturn major results (which accords with this manuscript’s own discussion, so I have no reason to doubt their accuracy), the lone replication to be published comes from that minority. This is an important piece of evidence in the opening arguments made by Galiani, Gertler, and Romero, and it is an error of omission not to mention it in this context. The present manuscript’s characterization of Galiani, Gertler, and Romero (2017) seems to misconstrue it, making it appear flimsy in ways that it is not.

Just two sentences later in the present manuscript, though, there is a “challenge” in “avoiding ‘overturn bias.’” The existence of “overturn bias” motivates the manuscript by the end of the first paragraph, a motivation echoed by the title. I am not what the rhetorical value was in misconstruing the cited work; it seems this may be a case of inadvertently poor wording. I would suggest that, further to this manuscript’s own goals, the authors should characterize at least this cited source more accurately.

Dos and Don’ts

The bigger error of omission in this manuscript, and it is an important one if the goal is in fact to provide a productive set of guidelines, is that it never clearly says what NOT to do. A great opportunity to discuss what not to do, whether in general or in relation to a specific example, comes in the discussion of heterogeneous treatment effects. The authors write:

“There are different reasons why original authors might not conduct subgroup analysis or test for heterogeneous impacts. One is simply statistical power. If the dataset is small to begin, there may not be enough power to meaningfully conduct subgroup analysis.”

This is a good point. When I read on, to see the way this “statistical power” criterion is made practical in the associated checklist, it turns out to be absent: “statistical power” never appears again, either in the checklist or anywhere else in the manuscript. (There was, however, a suggestion to use “machine learning,” which seemed unsubstantiated by any discussion elsewhere in the text.)

Exemplar Don’ts

While this omission seems straightforward to rectify, there is a difficulty in rectifying it. Many suggestions on the checklist are justified by real examples, and to give an example of what not to do is to confront controversy. As Galiani, Gertler, and Romero’s paper describes, the 3ie replications saw no shortage of controversy. Taking the prominent example of the “replication and re-analysis” of Miguel and Kremer (2004), the present manuscript provides the following comment regarding that replication, the authors’ reply, and the ensuing blog-based back-and-forth among original authors, replicators, and other scholars weighing in on various aspects of the re-analysis:

“A well-known example of an epidemiological replication study of an econometric paper is Davey, Aiken, Hayes, and Hargreaves (2015).”

This is a surprisingly brief understatement, given the extent of the scholarly discussion on this specific example. A challenge here is that this, and several of the other replications, have resulted in (academically) heated exchanges. To provide a checklist that could have remedied any aspect of these conflicts is to weigh in on those conflicts, though doing so might prevent some such conflicts in the future. I recognize that this is a hard thing to do, but other authors in this area have managed to discuss the adversarial character of these exchanges more openly. The present authors are in a unique position, having seen much of the relevant correspondence first-hand, to take an explicit stand on procedural or econometric mistakes that others would do well to avoid. But as it stands, this omission diminishes both the neutrality and value of the checklist.

Minor note: how to move from example to checklist

In justifying the proposed checklist, it might be helpful to readers to show (perhaps in a table?) which of the replications discussed as examples employs the item on the checklist. Or if there was another method by which the checklist items were pulled from the examples, it would place the checklist on firmer footing to simply explain the relevant method in relation to each checklist item. I am particularly struck by the “machine learning” checklist item (mentioned earlier), which seems totally unlinked by text to any of the discussed examples, unless I overlooked something.

Minor note: completeness of references for guiding researchers

I should mention that the “validating assumptions” literature section seemed to be missing some of the more comprehensive treatments of quasi-experimental methods in the context of program evaluation. This is easily remedied. In the context of regression discontinuity designs, I respectfully suggest a reference to some of the recent papers by Calonico, Cattaneo, and Titiunik (Econometrica, JASA, R Journal, Stata Journal). More generally, either the Gertler, et al. book, “Impact Evaluation in Practice,” or the Angrist and Pischke book, “Mostly Harmless Econometrics,” could provide a good starting point on these methods for researchers unfamiliar with them.

Minor note: completeness of references for replication terminology

Finally, I note that the four-part framework for this paper expands on the framework that the same authors articulated in Brown, et al., (2014): as they write, the “middle two categories relate to the measurement and estimation analysis approach to replication (Brown, Cameron and Wood, 2014)” while “The fourth category, heterogeneous outcomes, is related to the theory of change analysis approach to replication.” Since the Brown, et al., (2014) paper, however, there have been newer (yet well-cited and well-known) discussions of the broader replication effort accompanied by proposed definitions for the different types of “replication” included in the effort. The present manuscript does not engage with these, but would be of greater use to the profession if it were to do so. Consider referencing Duvendack, Palmer-Jones, and Reed (2014 and 2017) as well as the terminology of Clemens (2017).

Andrew C. Chang - Balance Tests and Negative Recommendations
December 10, 2017 - 00:27

*The opinions expressed here are mine and are not necessarily those of the Board of Governors of the Federal Reserve System*

I think Annette and Ben's paper is a useful consolidation of recommendations and associated literature.

I found myself thinking about their tips for balance tests in particular. ...[more]

... In my opinion, it is usually not useful to run these balance tests, except in the situation where you might be worried that, somehow, there was an error in the randomization procedure itself (e.g., inappropriate intervention in the randomization by a funding agency that was outside of the researcher's control, a point also made by McKenzie, 2017). So I think this recommendation could be more nuanced. Or the paper could be expanded more generally to include some points of things NOT to do, as suggested by referee #3 (although maybe not necessarily the examples articulated by referee #3).