Don’t be (data) evil

A few months ago, one of my coauthors e-mailed this guy who had published a paper in JFE to ask him for their data. We meant to use this data for a different purpose. No response. I understand: let me explain. This is analysis that could, in principle, be redone from scratch (i.e., not from proprietary data) so the guy probably thought “these guys are lazy, so why bother: they should do their homework.”

Okay, I understand plausible reasons but, nevertheless, I would categorize this kind of attitude as evil. Replications don’t come for free: they are time-consuming in effort that could be spent doing something new, and can cause extra error since the follow-up research, whose focus is different, may not apply the same standards as the original. Plus, we all know that there are always many small choices that are not always reported and add up to alterations in the results.

Consider the following. In an NBER replication study of 67 papers published in major economics journals, only 1 paper out of 3 could be replicated replicated which was raised to slightly less than 1 out of 2 with ‘assistance’ from the authors.

We should all applaud the decision made by some journals including, but not exclusively, journals such as AER or JAR,  to require code and data. Information is a public good and I wonder how much will be achieved from relying on data charity (my experience has not been good there). However, I am still worried that two fundamental problems are still not being formally addressed by authors and editors.

  • Requiring a “credible source policy” for any empirical claim used in the journal

Even journal that have a data policy do not require their articles to use sources from journals or authors (possibly voluntarily – by public posting) to meet the same data policy. Think about it: (a) the data policy is present because there are doubts as to the credibility of any result that cannot be replicated and (b) a general policy of academic journal is not to include any claim that has not been rigorously established. Should (a) and (b) imply that no reference can be used that does not meet the data policy – or, if it used, that failure to meet the data policy should be noted and qualify the claim as ‘tentative’?

Compare this to theory. I know of no article that would reference a claim whose theoretical proof is unavailable even if, say, the proof is too large to fit in the margin, as more than a conjecture. Hard sciences are not protected from fraud, but all serious journals implement strict policies about sharing data. The lack of a credible source policy in the social sciences may, in the end, encourage less transparent journals to free-ride and hurt the very same journals implementing the policy.

  • Requiring peer-review over the supplementary documents

Can rules work without enforcement? Yet, I noticed that, even in journals with a data policy, the supplementary documents do not seem to have been refereed with the same standards as the original article, if refereed at all.

Most of the code that is shared is not commented in a way that allows the user to know what variables are and what they do, and for the most part, appears to be the incomplete notes of one co-author. On occasion, the code builds on a data set that comes from a different paper (and whose code is not available), completely defeating the purpose of the replication. There is very little guidance as to which files should be executed and in which order, as well as how this relates to each result in the paper. In other words, this code has simply not been designed to be re-used by a third party. 

I’ll point here to J. Shapiro and M. Gentzkow’s excellent advice. They say it all but I’d like to pick up on one point that connects to my experience as a theory person. Writing a code should be like writing a proof: clear, elegant and as concise as possible. Like for a proof, one would reprise a correct argument to make it cleaner and more direct, and remove the unnecessary repetitions. The data portion and code should be refereed, and the referee should be able to run the code, understand what it does and suggest improvements.





Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s