Abstract

This note introduces the concept of the replication paradigm, a framework that can (and should) be followed in every replication attempt. The paradigm expands, in part, on Bruce McCullough’s well-known paraphrase of Berkeley computer scientist Jon Claerbout’s insight – “An applied economics article is only the advertising for the data and code that produced the results” – and on the view that the primary social and scientific value of replication is to measure the scientific contribution of the inferences in an empirical study. The paradigm has four steps. First, in the “candidate study,” identify and state clearly the hypotheses advanced by the study’s authors. Second, provide a clear statement of the authors’ econometric methods. Third, discuss the data. Fourth, discuss the authors’ statistical inference. The author´s purpose in this ordering is to reverse the too-frequent focus in the replication literature on “data.” The correct data, of course, are critical to the replication. But “replication” as a scientific endeavor will never achieve respectability unless and until it abandons a narrow focus on data and expands its focus to the underlying scientific inferences.

JEL Classification:

B41

Cite As

Richard G. Anderson (2017). Should you choose to do so… A replication paradigm. Economics Discussion Papers, No 2017-79, Kiel Institute for the World Economy. http://www.economics-ejournal.org/economics/discussionpapers/2017-79

Comments and Questions

Anonymous - Referee report 1

October 16, 2017 - 11:38

• The economics profession has hit an all-time low, if now we publish replication plans rather than actual replication studies.• The software used in the study was RATS.• As for the desire to use 1990-1992 as a structural break, I warn the authors of the issue of data mining as the author appears fixated on structural breaks other than what the endogenous breaks determined by the Gregory-Hansen procedure. I suggest the authors read the cointegration literature on this point regarding exogenously imposing structural breaks.• As for the reporting of the results, and for that matter the so called replication plan. There is an underlying assumption that the presentation of results are independent of the requests of both referees and editors in the review process.

Anonymous - Referee report 2

November 17, 2017 - 12:24

see attached file

[ Download attached file ]

Annette Brown - Some suggestions

December 23, 2017 - 01:32

Hi Richard, I’ve now read the paper three times, and I think I am finally ready to write you some comments. There is a lot of interesting discussion in here – many thoughts clearly based on your years of experience with replication. You have much here is that is useful, but I think the paper would make a better contribution if you focused more on your core arguments and tightened the discussion around them. What follows are some suggestions combined with some comments and questions. I spent quite a bit of time trying to make sure that I understood how exactly you are defining your replication paradigm. In the end, I think you don’t quite nail it down in the discussion paper. I also think that the metaphor of reverse engineering gets confused. Here’s how I would characterize the paradigm: Replication paradigm • To maximize the scientific contribution of a pure replication, the study should be conducted as a “do-over” exercise. The do-over exercise begins with documenting the replication space based on the published study and then conducting the empirical analysis from the beginning according to the information in the replication space. The replication space includes:o The hypotheses advanced by the original authorso The econometric methods employed by the original authorso The data used by the original authorso The basis of inference used by the original authors for the results reported • To draw conclusions about whether a pure replication is unsuccessful, the replication study should include “reverse engineering” exercises in order to identify why the replication space does not produce the published results by determining how instead the published results were produced. I find it confusing to call the first element of the exercise “reverse engineering”. When we talk about reverse engineering a product, we have only the final product in hand, and we want to figure out what materials and production processes were used to make that product. To keep it simple, let’s imagine an Ikea bookcase. For reverse engineering, we would just have the bookcase. We would then take it apart. We would use the taking apart process to try to guess at the best putting-it-together process and we would look at the materials we have when it is all apart to figure out what materials we’d need to make another. I imagine you are thinking that the published article is the bookcase, and you take it apart by pulling out the elements of the replication space and writing them down. But I see the *results* as being the bookcase. The article is the box from Ikea. The box tells me what materials to use (by, hopefully, including all of them in the box) and it tells me what process to use by giving me instructions. When I try to replicate the bookcase, it is a do-over exercise, in that I am taking what Ikea gave me and seeing if I can follow their instructions on their materials to produce their result. In my characterization of your replication paradigm, the replication space is the Ikea box, and the documenting-that-space is like taking everything out of the box and laying it out carefully before you start putting the bookcase together. The replication is trying to put the bookcase together. If the bookcase doesn’t turn out the way it is supposed to (which we all know NEVER happens), then I start reverse engineering it to see if I can figure out what I did wrong. That is, I start to slowly take it apart to see if I can figure out what went wrong. Did I mess up the instructions or were the instructions wrong? Did the instructions not work because the materials were not the same size or quality as the instructions required? I should not conclude that Ikea sold me a faulty product unless I can show that the problem was really their instructions or their materials and not that I didn’t put the bookcase together correctly. Based on similar thinking, I see the second element of the paradigm as being the reverse engineering part of the replication study. It is what is necessary to conclude that the replication was not successful. To me, the do-over aspect of the first element of the replication paradigm distinguishes it from an auditing exercise. Or perhaps more precisely, it is essentially an audit of the article (did the original authors provide the necessary materials and instructions in the article to build the results). The replication exercises that I have seen that I would call audits are when a push button replication fails, i.e. the original authors’ code applied to the original authors’ data does not produce the published results, and the replication researcher reads the code to try to determine why the code does not work. Of course a code audit might be one of the first reverse engineering exercises that the a replication researcher undertakes if the do-over is not successful. It would help answer the question “did I code it wrong, or did they code it wrong?” (Another replication research exercise that might be considered an audit would be comparing the replication study to what was designed or proposed in the replication plan. But like replication, audit is a term everyone defines differently. ) The purpose of the second element of the replication paradigm as described above – that you should know what isn’t working before you conclude that a replication is unsuccessful – is the same idea that Ben Wood and I talked about in our Development Impact blog post “When is an Error Not an Error?” where we said, “We submit that the word “error” only be used in replication studies when the replication researcher can identify the source of the mistake. ” I’ll now make some more specific comments on the paper. • In the abstract you say, the value is to “measure the scientific contribution of the inferences in an empirical study”. I’m not sure what you mean by this. A replication, especially a pure replication which is what you talk about in the discussion paper, verifies published findings. True, those findings cannot make any scientific contribution unless they can be verified, but how much of a contribution they make does not seem a function of the replication. • I don’t think you need the “I assume that the researcher has selected a single ‘candidate’ empirical study, to be replicated, on top of which she will pursue new work.” You only talk about pure replication here, so whether or not she will pursue new work doesn’t seem relevant. • I think your discussion of the reverse engineering element of the paradigm, for example the statement “the researcher should explore, to every possible extent, the set of potential causes for the failure to reproduce”, could be made more useful for potential replication researchers if it also included a discussion of time. It seems to me that at some point, the marginal benefit of being able to say “the replication was unsuccessful and here’s why” vs. saying “I tried X, Y, and Z and just can’t figure out why I cannot reproduce the published results” is not worth the marginal cost of a researcher’s time. Ben Wood might be able to provide some perspective based on his experience. • In addition (or alternatively) it would be useful for potential replication researchers to have some suggested steps or exercises for reverse engineering, like the one I suggested above: start by auditing the original authors’ code to see if their code matches the estimation methods as described in the paper. Another suggested step could be trying alternate versions of the data, as implied by your Dewald et al. (1986) example. • You use the term replication space (p. 3) before you define. I recommend you put the full definition of the paradigm, including the definition of the replication space, very early in the article. • I find the long example from your Dewald et al (1986) work to be distracting. If I understand correctly, the point of the example is to show the role of data in the replication space and how they can make a difference to the ability to conduct a do-over. I think maybe you are also saying that your reverse engineering exercise in this case was to try alternate versions of the data. Rather than try to present the entire set up from the replication study as an example of the use of the entire paradigm, I’d just use it to make the points about data and/or one exercise for reverse engineering and not present the full space or any of the equations or graphs. The main objective of the discussion paper is supposed to be the description of a replication plan for a new replication, so I’d focus on that. • It is really awkward to say here “results of the replication are omitted”. I think that is what turned off one of the referees. The point of the special issue is present replication plans, so I’d lead with that. If you have indeed conducted the full replication study, you might mention that in a footnote. • Your presentation of the replication still misses a vital element, which is that you need to say how *you* will determine whether your replication of this study is successful or not. As it stands, you lay out the space and then go back to a general conceptual, “some replicators” discussion. The point of collecting a set of plans is for the readers to learn what you would do. Soo First, once you have written the code to conduct the empirical analysis based on the information in the replication space, how will you decide for this study whether your results are close enough? The ask here isn’t for you to determine how this should be done for any study, just for this study. You imply that you would compare all published results to yours, for example, and not just key results. For this study, what kinds of differences will cause you to go to the next step?o If, after the do-over attempt, you do decide that the differences are big enough that you need to identify what’s not working, what are the reverse engineering exercises you would do for this study (and why?) For example, you state that trying different versions of the data does not make sense for this study. Would you look at whether they might have used different variables from the set than those you assumed they used? Would you try different computer programs? Would you try different estimators? OLS or DOLS? What makes the publication of replication plans interesting is learning what you think are the right things to try based on the replication space for this study. • The discussion starting mid page 11 (comments) and going to the end of section 5 feels extraneous to this paper. I recommend deleting all of it and saving it for a future paper. It is interesting discussion, but you don’t draw any firm conclusions from it, and for me it distracted from the main contributions of the paper, i.e. presenting and describing a replication paradigm and giving an example of how to apply that paradigm to a specific paper. • It would also be useful to mention whether and when in your replication plan (or under the replication paradigm) you would reach out to the original authors. Should that communication be a necessary part of the reverse engineering exercise, for example?

Anonymous - Referee report 3

December 23, 2017 - 07:25

Replication studies are important, under-produced, and not very standardized. In his paper, Richard Anderson, provides his preferred template for what a replication of a non-experimental paper should be. I am frequently confused about what a replication should accomplish. Since authors often now post data and code, at one extreme a replication can be downloading these files, running them, and checking the obtained results against those reported in the published paper. This is I think what Anderson means by auditing. Basically, it’s a check against typographical error or even fraud and can be done in a few hours. At the other end of the spectrum is something like David Abouy’s work on Acemoglu, Johnson, & Robinson’s famous 2001 AER paper on settler mortality. Albouy painstakingly digs into the historical data used to create the settler mortality variable, pointing out inconsistencies and questionable attributions. He then goes on to show that even a small change in how the variable is constructed can lead to big changes in the results. I’m sure this work took months if not years on Albouy’s part, and it had an impact in helping shift the profession from heavy use of IV to more use of quasi-experimental and experimental methods. Somewhat in the middle are studies that either obtain data from the original authors or collect as close to the same data as possible, use their own software to calculate results and compare them to those from the original study. Since the data and code are not identical, the expectation of getting the exact same coefficients is fairly low, and if the researcher finds the same sign, significance level, and general magnitude of the key coefficients as the original paper, the replication would be deemed a success. Often papers like these go on to extend the dataset and introduce new variables to see if the original result holds under these new circumstances. In some sense this is a combined replication / robustness check. I personally see value in all three of these approaches. However, for the audit, I think journals should have a grad assistant perform this task before any paper is accepted for publication. The journal should certify that the results in the paper are mechanically correct. So given my views that different enterprises can be consider replications and they all can have value, I am a bit uncomfortable with Anderson’s message that there is only one right way to do a replication. I guess Anderson could say that my examples above aren’t really replications, but I think most of us would agree they are. Anderson also places a lot of responsibility on the replicator to be able to say why a replication failed. There are really only 3 answers to this question. 1. Different data, 2. Different code / software, 3. Error / fraud by either the original author or the replicator. Data, especially macro data is revised all the time. Results from the PWT version 7 may not look much like the results gotten from version 8. If the model is non-linear / iterative, then two different programs can produce different results from exactly the same data. So unless the model is basic, it can be very hard for a replicator to tell if the failure comes from different data or different coding. This is exactly the case in the second example Anderson discusses where the authors use non-linear techniques (threshold autogressions) and do not provide either their data or code or even the name of the software used to produce the results. Anderson applies his paradigm to this paper, but does not tell the reader if his replication is successful, let alone what caused it to fail, if it did. To summarize, in my view, replication is an art, not easily reduced to a set formula. There is room for a variety of goals and approaches. I absolutely feel that journals have a responsibility to “audit” empirical papers by running the data and code and ensuring the results in the paper are accurate. Beyond that, I say, “let a thousand flowers bloom”.

W. Robert Reed - Decision letter

January 20, 2018 - 19:52

see attached file

[ Download attached file ]