Discussion Paper

No. 2017-74 | September 28, 2017
A replication recipe: list your ingredients before you start cooking
(Published in Special Issue The practice of replication)

Abstract

The author argues that researchers should do replications using preanalysis plans. These plans should specify at least three characteristics: (1) how much flowtime the researchers will spend, (2) how much money and effort (working hours) the researchers will spend, and (3) the intended results and the precision of the replication necessary for “success”. A researcher’s replication will be “successful” according to context-specific criteria in the preanalysis plan. The author also argues that the two biggest drawbacks of preanalysis plans—(1) that they discount unexpected but extraordinary findings and (2) that they make it difficult for researchers to prespecify all possible actions in their decision trees—are less relevant for replications compared with new research. The author concludes with describing a preanalysis plan for replicating a paper on housing demand and household formation.

JEL Classification:

B41, C80, C81, R21

Assessment

  • Downloads: 221

Links

Cite As

Andrew C. Chang (2017). A replication recipe: list your ingredients before you start cooking. Economics Discussion Papers, No 2017-74, Kiel Institute for the World Economy. http://www.economics-ejournal.org/economics/discussionpapers/2017-74


Comments and Questions


Anonymous - Referee report 1
October 18, 2017 - 09:42

see attached file


Andrew C. Chang - Referee#1 Reply
November 22, 2017 - 04:25

See attached.


Anonymous - Referee report 2
November 21, 2017 - 11:50

Summary of the Paper’s Findings/Contributions:

The paper does not have empirical or theoretical findings, per se. It is a think piece, arguing the general merits of preanalysis plans for the case of replication studies and commenting on how to construct them. It raises several provocative issues and, ...[more]

... in my opinion, has several thoughtful insights on the subject. However, the author has crafted a preanalysis plan that tailors itself very closely to his own current situation, dramatically limiting the usefulness of this discussion to the field at large. I had hoped to find some sort of scientific approach to how one might select and execute a replication study – and the current write up reflects something far different from that.

If the currently submitted paper is not useful for laying out a defendable scientific scheme for others to follow, again because of the way situational information is called upon so heavily, then this manuscript becomes a simple addendum to the actual replication study of interest (Haurin and Rosenthal, 2007) that was conducted by the author, such that an abbreviated version of this discussion belongs paired with that work, rather than independently published in Economics: The Open-Access, Open-Assessment E-Journal. This becomes even more obvious once one considers the way the selection criteria were specifically crafted around the mentioned special issue’s call for replications. [As a side note, the special issue’s call was referred to multiple times but never actually pinned down. What Journal? Perhaps a link to that call for papers could be included. It is hard to evaluate the author’s choices in the preanalysis plan without seeing the specific constraints of the call.] Most of my comments fall along these lines.

Comments:

1. I fail to see how the statements made about replication attempts do not apply more generally to essentially all scholarly research projects. Consider the use of statements like “Without prespecification, the amount of flowtime and budget that you could invest in a replication could grow uncontrollably” and “Your budget for a ‘successful’ replication is, most likely, less than that of the Bill & Melinda Gates Foundation”. Replace the word ‘replication’ with ‘study’ and you still have equally true statements.

2. Several arbitrary and/or unnecessary criteria are included in the selection method. For example:

a. The statement about not selecting from the authors own previous replication work is a pure redundancy from the second (more general) criteria, which stated that previously replicated papers were to be selected. By definition, if the author had previously replicated a study then someone had previously replicated it.
b. To exclude work by those with a connection to the current place of employment and those with a personal correspondence history seems completely arbitrary. When I started into the description of how the paper to target was “selected”, I suspected some sort of broad/general criteria that would produce a large population of potential choices that was not subject to obvious biases, and then with that pool some sort of random selection procedure would be applied. [For example, alphabetize the entire list of articles then use a random number generator to select one by position on the list.]
c. The criteria of “A paper that I read within a year prior to the special issue’s call” illustrates the arbitrary nature of this discussion even more clearly than points a and b above. I do not actually know how to interpret this. Is the author really suggesting others follow this approach? How does this ‘selection criteria’ not boil down in the end to picking a paper the individual author is comfortable with and interested in for independent reasons? Again, I am not challenging the idea that this ‘comfortable’ method is a fine way to proceed with picking a study to replicate, but my point is that any ‘comfortable’ and/or ‘clearly non-scientific’ selection method does not merit independent publication as a stand-alone academic contribution.

3. Several arbitrary and/or unnecessary criteria are included in the definition of “success”. For example:

a. Point 4 is quite vague. The author would wait a “prespecified amount of time” (later quantifying as a ‘few’ weeks) and would engage in a “prespecified number of attempts” (never quantifying).
b. Similarly, point 5 mentions a “flowtime of around two months” to do the replication. How is this different from a normal research not just thinking to themselves of their research goals? Would the author’s own life/circumstances that played out during the time in question not be accounted for in a reasonable way? Again, I just don’t see a rigorous scientific contribution here that others could benefit from.
c. Most importantly, point 9 indicates that “If the data that I downloaded was obviously flawed, then I would give up and work on another research paper.” The mind (ok, MY mind) reels upon reading this. I trust the authors of the original Haurin and Rosenthal paper because I have no reason at all not to. I trust the author of this paper for the same reason. Setting that aside, it seems that in the case of a purely doctored research endeavor, this step in the process would actually cause the research conducting a replication to give up when they should not. Assume the original study misrepresented the steps they took to include/exclude observations. When the replicator checked simple things like raw observation counts they would get something quite different and ‘give up’ on that replication.

4. While I understand the researcher, and in turn the journal to which the paper is submitted to, carries a direct interest in the field of Economics. However, Economics has direct connections to other disciplines including Finance, Accounting, Marketing, Political Science, Geography, Sociology, Urban Planning, and many others. [A simple review of the 750+ journals indexed by Econlit substantiates this point.] In every way, the paper was written as if other fields do not exist. For some topics, I could see that an acceptable choice, but in this case is difficult to defend given the position of the hard sciences, medicine, and psychology as being so dramatically ahead of Economics in terms of replication protocol. Since the paper does not shy away from making normative assertions, I’ll make one of my own. Academic research should seek to reflect the current state of knowledge on the investigated topic, regardless of what field has produced the key insights that determine that knowledge. For example, relevant classic research from psychology that considers the key differences between replications and the scientific method in physical sciences relative to social/behavioral sciences is ignored.


Anonymous - Referee#2 Reply
November 25, 2017 - 22:17

See attached.


Anonymous - Referee report 3
November 28, 2017 - 11:48

see attached file


Andrew C. Chang - Referee#3 Reply
December 04, 2017 - 22:58

See attached.


Anonymous - Referee report 4
November 28, 2017 - 11:51

The submitted manuscript both provides a general discussion on the importance of replication plans and how they should be conducted as well as a brief "own" replication plan. I enjoyed reading the manuscript and fully agree with the importance of replication plans.

I will evaluate the manuscript in light ...[more]

... of the four issues the authors were asked to touch in their replication plans.

"(i) a general discussion of principles about how one should do a replication":
I particularly liked the argument that a results-free defense is important to defend replications against (some) original authors. This is, in my view, one of the largest problems in publishing replications: original authors who become referees of their (maybe only partly) replicated papers and do not like to see that they cannot be replicated. I have no remarks on this part.

"(ii) an explanation of why the "candidate" paper was selected for replication":
This is not done awfully convincing. The choice appears quite arbitrary were, eventually, it turns out that the author just chose one paper he is familiar with anyway in order to demonstrate his ideas without a real interest in its replication. One could also debate whether one (!) citation in Google Scholar makes a paper “influential”. But then, this is not the focus of this manuscript, and, therefore, fine. (Also, as of today, the paper has 58 citations so the result is fine even though the criterion is a bit lax). In particular, as it is not required that the replication is actually performed this whole point seems to be of minor importance.

"(iii) a replication plan that applies these principles to the "candidate" article,":
This has been done convincingly.

"(iv) a discussion of how to interpret the results of the replication (e.g., how does one know when the replication study "replicates" the original study).":
In my view, this is the only weak part. One of the author’s three criteria for a replication plan is to "set (...) the set of estimates and the degree of precision that will define a "successful" replication." (lines 3 and 4 of the article). In his proposed replication plan, step 10 on page 8 reads: "I would be "successful" if I was able to replicate the Figures (...) to a reasonable degree of accuracy." This is quite vague and, in my view, does not meet the author’s own criterion.

I understand that it is much easier when it comes to regression coefficients (or sample means, numbers of observations) to pre-specify up to what difference between original and replicated coefficient we can speak of a successful replication. Here, the plan is about replicating figures. Nevertheless this is supposed to be part of the plan and the author might want to try laying out what a "reasonable degree of accuracy" could be and what the figures should look like in order to not speak about a reasonable degree. For sure there will not be an objective metric on what a successful replication of a figure is. Yet, this would be quite an interesting and potentially helpful discussion, in particular for researchers who actually plan to replicate figures.


Andrew C. Chang - Referee#4 Reply
December 04, 2017 - 23:47

See attached.


Annette Brown - Comments
December 19, 2017 - 15:40

I will begin these comments with the caveat that I have not read all the referee reports and replies already posted about this paper.

I agree with the point that context, or purpose, of a replication should determine what a "successful" replication means, although I personally try to ...[more]

... avoid using the terms successful and failed. The three examples you give are useful. I think they would be more useful if you extended them to talk a bit about what replication procedures you would employ in the three situations and not just what counts as successful. For example, in the first case (verifying for archival record) you may only need a push button replication, i.e. verifying that the authors' code runs on the authors' data to produce the published results. For the third case, where the researcher wants to learn an econometric technique, the researcher would want to begin with raw data and attempt to write the estimation code herself.

I would also question, though, whether one should ever use the terms successful or failed in the third case. Would it ever make sense to suggest that a replication of a paper failed if a replication researcher cannot manage to recreate the code to replicate the published finding from robustness check #534 scribbled out in footnote #81?

When I was at 3ie, we certainly hoped that replication pre-analysis plans would help reduce the criticisms. Unfortunately, the original authors rarely read them at the time they were posted (i.e. prior to the conduct of the replication research) and often didn't read them even when they were writing their replies. I still believe they are the right thing to do! But am less optimistic about their effect on the less-than-civil rebuttals. Along these lines, you do not say anything about whether and where replication pre-analysis plans should be posted publicly. The formal registries are not well-equipped for this, but I presume that replication plans could be posted easily in OSF.

I like your inclusion in the plan of flowtime and working hours estimates. At 3ie we found that replication studies ended up taking much longer than anyone expected, for various reasons that often involved a lot of time spent very unproductively. One remedy that we introduced for this is requiring of 3ie-funded replication studies, or recommending for others, that replication researchers always begin with a push button replication. If the authors' code doesn't run on the authors' data, there is no point starting a pure replication (new code on raw data) at the beginning (unless, perhaps, it is solely a learning exercise). I think you could usefully add a lot more discussion here, including: How did you come up with the estimates that you have here? What is the breakdown of the timeline and working hours by some key milestones in the replication work (in your case, maybe by figure in the original paper)? What happens if the timeline or working hours are fully expended and you haven't been able to fully replicate? I personally would not argue that a replication should be considered failed if it cannot be completed in the planned time (either duration or working hours), but I also think there needs to be a stopping point so that a replication researcher does not spin her wheels. The question then becomes what is this situation called?? Here again is why I argue against using successful and failed as there are many grey situations.

Related to my point above is your step #9 where you say if the data set is obviously flawed, you would give up. One of the referees commented on this too. I agree that sometimes you do need to "give up", but there should still be an output in this case. What would you conclude in this situation, and what would you do with that information? I am not suggesting that this should be considered a "failed" replication, but once you have gone to the trouble of writing a replication plan (posting it?) and starting the process, I think you have some responsibility to report your findings, even if they are that the data set used for (or at least posted as the data set used for) the original paper is flawed, and in what way.

I appreciate the pre-specification of the research process that you present in section 3 of the paper, but I am more interested in the analysis plan, about which there are very few details. You do specify that you will conduct a pure replication (your code on raw data), but here you could be more precise. Will you try to code using their estimation methods as described in the publication? Will you look to other sources to learn more about their methods (i.e. is there a working paper with additional information)? Or will you look at their published findings and apply what you think are the most appropriate methods for producing those results? You touch on this a bit in step 7, but for a pre-*analysis* plan, I would expect more details about the methods and your approach to replicating them.

Finally, if you propose to use the label successful, as you suggest in step 10, then I think you need to be more precise about what you consider to be a "reasonable degree of accuracy". It doesn't help you much to pre-specify if the ultimate judgment is still arbitrary and made after the results are seen.


Andrew C. Chang - Thoughts on Annette's Comments
January 16, 2018 - 01:34

*The views and opinions expressed here are mine and are not necessarily those of the Board of Governors of the Federal Reserve System*

Annette,

Thank you for reading my paper. I would like to expand on two of your points.

First, on the potential less-than-civil rebuttals ...[more]

... from original authors, I would hope that, should the original authors counter the replication plan with a less-than-civil rebuttal, that the preanalysis plan would also allow researchers other than the replicator to defend the replication.

Second, the replication preanalysis plans should probably be made public, or at least made public with a delay, (e.g., how the Open Science Framework allows authors to delay the release of their preanalysis plans by up to four years).

Andrew


W. Robert Reed - Decision letter
January 20, 2018 - 20:05

see attached file