On May 12th and 13th 2017, the Journal of Accounting Research conducted its annual conference as Registered Reports (RR). In RR, the review occurs primarily at the research design stage, before results are known. RR protect science against practices that correlate to acceptance likelihood but bias test statistics in ways that cannot be rigorously controlled for, e.g., changing hypotheses, data or specifications in the face of unexpected results or non-results, see below from the discussion paper by Professor C. Chambers.
Overall, the experience has been a great success and provides a template for adoption as a separate submission path in addition to existing channels. In this entry, I will speak in broad terms of what I have learnt about RR in accounting research as well as some specific papers presented at the conference, all of which make a significant contribution to a variety of questions. Then, I’ll cover possible improvements to put on the table if editors want to make this model sustainable. Lastly, I will engage some bigger issues that RR connects to and the leadership by example that JAR is setting for accounting and non-accounting journals.
RR teach us that many questions are very hard to answer without pre-commitment to a research design, in cases where the universe of possible tests is unknown but likely large. This is true for experimental studies, run 100 experiments about various implications of a theory, for archival, correlate an accounting event to 100 possible outcome variables, and even for theory, try 100 assumptions and keep the counter-intuitive implications. Within this state of affairs, the field has become skeptical on virtually any result that comes out of the major journals, leading to a situation of research relativism that is entirely unacceptable. RR offer the means for authors to focus on sounds methods, without fearing their findings. It is therefore to the benefit of the community and to the benefit of the authors.
A secondary insight from the conference is that non-results can be the stars of the show. Many of the studies were well-motivated, intuitive and one would have expected some effect – indeed, the studies went after hypotheses widely-believed to be true by researchers studying them. Side by side to analyses of power, the non-results are prior-changing breakthroughs. They also reveal that knowing what we don’t know is the prior step to seeding the next papers.
The opening paper by Y. Ertimur, C. Rawson, J. Roger and S. Zechman is a fine example of RR as a tool to encourage risky data collection, that is, data that may not pan out with an easy-to-interpret or consistent message. ERRZ hand-collected evidence about the inactivity spells of CEOs, a common occurrence with external hires. I would have conjectized, for the CEO labor market, that high-ability CEOs come with more sensitive knowledge, implying greater likelihood of non-compete agreements, longer spells, and higher subsequent pay as a result of ability differentials and compensation for the gap. By contrast, according to conventional labor market unemployment theories, longer gap may indicate CEOs seeking a valid match, where a longer unemployment spell may publicly indicate lower innate ability and may even directly erode it. Surprisingly enough, ERRZ find results consistent with the conventional theory, thus showing that CEO gaps may not be conceptually that different from unemployment gaps. Perhaps, this is a first step toward unifying the study of CEO labor market into the many robust insights from labor market theories with search frictions.
The paper by K. Allee, M. DeAngelis and J. Moon use the RR format for an entirely different purpose. A criticism of the text analysis literature is that the degrees of freedom for cranking out (yet another) measure of information in text document is too large to separate data mining from truth. What ADM do is to construct a new measure on sound principles, before its relation to other variables are known. They define scriptability as a score capturing the difficulty in operationalizing machine reading of common accounting documents. Interestingly, the firm characteristics that were conjectured to correlate to scriptability do so remarkably well, a non-trivial feat in an RR setting. Against all expectations, however, scriptability did not seem to have a clear connection to the mechanisms through which information comes into price. Associations with price discovery, bid-ask spreads and volumes, are all over – often insignificant or even with the wrong sign depending on the document. The paper makes a strong case that machine reading may not have first-order consequences on market price. Beyond the non-result, the paper shows us how the construction of a measure, for settings in which many adhoc researcher choices are required, can gain additional value and credibility when conducted in the context of RR.
Two papers went after a field experiment. Field experiments let researchers observe real subjects involved in their habitual decisions, but with the benefit of experimental control samples. S. Li and T. Sandino ran a field experiment testing the effect of information sharing systems on creativity. Most of the literature provides experiments that supports the conventional view that creativity can be affected by intervention, but there is valid concern that the experiments that did not support this view were rejected or not written into a paper. LS created an app in which a sample of storekeepers in India could share promotional posters that they designed, and learn from posters submitted by others. The results? The experimental treatment failed to elicit much change in the quality of posters submitted by storekeepers. Perhaps the benefit of such information-sharing systems are low relative to what people can learn on their own from direct physical interactions. This result may not be unique to India, and may extend to creative settings in which interpersonal interactions already do a good job at sharing information. It puts a caveat on current beliefs that social media will dramatically increase flows of creative information.
The next field experiment, by D. Bernard, N. Cade and F. Hodge is a fascinating example of a creative research design to answer a controversial question: does giving away shares help the company sell products? On the one hand, many researchers believe that individuals dramatically exaggerate the effect of their actions – after all, how many of us vote despite the fact that our chances of being pivotal are virtually zero. On the other hand, rational choice advocates argue that, for such obvious decision problems, biases are unlikely to change customer purchases. I wish the reader to stop here and make a bet on which side wins the argument. BCH pre-commit to give a $20 share of ownership of stocks in Starbucks and then follow up on Starbucks purchases in the treatment samples against a control sample given a $20 share of ownership in a different company. The results? No effect is found. Hence, a non-targeted IPO might do very little good on company sales.
The paper was also an interesting learning experience because there was somewhat of a blowup, midway into the paper, as authors conducted very compelling analyses that did suggest the existence of an effect. But these analyses were unplanned, and conducted ex-post in response to the absence of a main effect. This may seem obviously misled, but let’s review what these analyses were. Again, they were very compelling: companies distribute shares to customers and even IPO would be over-subscribed by existing clients. Hence, a natural test would be to select subjects who drink coffee. In this subsample, it was found that stock ownership increases purchases. But these analyses were only conducted because the main effect was not found, opening the question as to which other reasonable analyses would have been in the universe of possible tests. The message here is clear “report but do not conclude”; the door is wide open to further work.
The paper by H. Eyring and V.G. Narayanan is another take on the degree-of-freedom problem in some of the question we ask. EN were given permission to experiment around the information displayed to takers of Harvard’s online courses, showing either means or upper quartiles of the performance distribution, and measuring whether student achievements improved. The issues with such a question is that one could have reported various quantiles, or report it in different ways, until some effect were observed. Among the papers, this was the one with the strongest confirmation, as it was clear that the reporting of a bar that could be reasonably met increased performance, while a bar that could not be reasonably met or was too low reduced performance or had no effect.
We can debate ad infinitum as to whether the same insight can be applied to settings in which financial incentives are also a choice, but (i) there are compensation settings in which the employees cannot primarily rewarded based on financial incentives (e.g., unionized positions, bureaucracies), and (ii) the education application is, even on its own, a very important application that can be directly used at very little cost by the many organizations that provide online learning tools. This is one of the rare studies in our area that immediately translates into concrete useful applications, and the RR has given additional validity to its insights.
Surprisingly, there was only one experimental paper at the conference. Among the entire set, this is the paper that had the most unambiguously clear non-result – a fact that may be a benefit of clear laboratory data where opportunities to seek other variables are limited. Z. Kowaleski, B. Mayhew and A. Tegeler investigate a setting where interactions between audit and non-audit services could create an environment conducive to collusion between the firm and its auditor. They conduct a repeated game and examine whether the equilibrium that is being played seems to shift toward more or less manipulation in a group with both audit and non-audit services. The results? Not at all. The average occurrence of misreporting and auditor enforcement appears to be the same. This also strikes against a widely-held view in the post-Enron world, namely, that auditors should be strictly regulated because most of their activities will contaminate the quality of audits. The paper speaks to the possibility of collusion in a repeated game even without non-audit services and how the consulting culture does not appear to be a factor moving the equilibrium.
The final paper was an outlier. It was a fundamental contribution to accounting knowledge but also, not the type of paper that starts with a well-defined hypothesis to reject, or even strong researcher priors. Here, the RR format nevertheless played a major role by eliciting a huge effort in data collection. L. Hail, A. Tahoun and C. Wang go back to old news archives in countries all around the world, up to two hundred years in the past when possible, to search for occurrence of scandals and enforcement actions. This was meant to provide descriptive evidence about the lead-lag relation between regulation and scandals, a fact that was, prior, unknown. This is critical evidence that speaks to the fundamental question: what is the positive model for what regulators do? It’s an odd duck because, as one of a journal editors noted after the talk, this was not a paper that was after some risky results and so the reason for pre-committing to analyses, versus a direct exploratory review of the data was not self-evident. The paper did inform us that journal may consider an alternate format with pre-acceptance of data collections that is of general interest but resource-consuming, without some of the strict requirements in the full RR.
In a follow-up panel session, Chris Chambers, a world expert in operationalizing RR, noted that the accounting profession has taken the format as its own, adapting it to the needs and challenges of the field. Also, all involved authors strongly supported opening a path in journals for RR, as one possible submission format. I’d like to share here some additional personal thoughts about what might be part of future iterations of this idea.
– Methodology. Most of the review process was focused on data and experimental design questions, probably because these are the first-order aspects to be considered. In terms of methods, most studies fell back on linear regression as the main tool. There was very little Bayesian analysis (the workhorse for these methods in the natural sciences) or even any of the economic analysis tool that are standard in the economic sciences, e.g., estimation of preferences or some choices made within the data set. That’s too bad, because good data is the main obstacle to methodologies that are rich in insights, so the transparency in data in RR is a complement to stepping beyond interpreting conditional means. In fact, RR that focus on analyses methods is perfectly suited to respond to theoretical overfitting, since there is a large universe of statistical or economic models suited to a data set. I hope future iterations of the RR will involve using the format as a stepping board for using the cutting-edge methods to draw information from data.
– Theory. A striking feature of most of the seven studies is that they were all based on an relevant hypothesis, none of them was based on a theory. The difference between hypothesis and theory, is that a theory is a broad set of concepts that will lead to multiple hypotheses. Even if we find that say, social media increases creativity, or that a particular game induces collusion, it is not clear which theory is being tested or what theory led to these hypotheses. Starting from a primitive theory is difficult, but the RR offers an ambitious agenda to go after research designs that will falsify theories by looking jointly at the many hypotheses they imply.
– Power. C. Chambers also noted in the panel that power analysis is part of RR, but not necessarily a feasible first step in the absence of an easy formulation of a Bayesian prior. This echoed many concerns by the audience that accounting problems are relatively unstructured and do not lend themselves to analyses of power, relative to fields with more structured knowledge of disturbances. Measuring power requires assumptions. Nevertheless, I find counter-intuitive the principle that not making assumptions and not measuring power should dominate a best-effort estimate of power, one that is transparent in the assumptions that have been made and, in the worst cases, offer alternatives. Without power, many of us were unable to set confidence intervals on the non-result and some of the potential value of RR is not realized.
– Conclusions. C. Chambers also made a point, toward the end, that unplanned results contaminate the RR and, while helpful, should be separated in the conclusions. It was striking that authors, not referees, felt compelled not to stop at non-results, and worked very hard to provide supplementary analyses. The quest for meaning is, perhaps, not surprising but does show us, once again, why the RR format is so desirable. For the word RR not to lose what it conveys, special care should be taken to separate these supplementary parts from the main conclusion, perhaps by omitting altogether from the abstract and requiring findings to be outlined and in a different paragraph.
To conclude this blog, let us keep in mind that the RR experience, conducted here within the strictest rules by the editors, can teach us about mixed models that borrow from RR and a conventional review process. C. Chambers mentioned offering authors a result-blind submission path, in which authors submit their paper without results. This is different from RR in that the analysis has been conducted and is known by the author, and does have caveats relative to RR. However, it is a path to re-open important non-results or unexpected results that exist but have been buried by the publishing process. It’s also an opportunity to work hand-in-hand with other follow-up RR, or push the community toward the replications that need to be conducted.
As to the broader message, RR teach us about the value of transparency, namely, transparency about research protocols. The transparency is not just to outsiders but also to ourselves, and it makes us realize how the process can distort significance levels.
I will add three short notes, which are probably slightly on the more controversial side but could be nevertheless items to think about.
First, part of the skepticism in research has led to p-values that are probably too stringent for a strict RR report. That we should require a 10% confidence for a result is fine when authors are expected to pick among many regressions, but a lower confidence might speak a lot to noisy problems with large economic significance. E.g., should we only accept global warming if we get to 10% significance, or should we act before this? The answer may be the latter if we know that the research has been conducted without bias. Perhaps more discussion is needed about acceptable p-values and the format of reporting significance stars should be changed.
Second, are we ready for universal adoption of the RR as the method of empirical analysis? Absolutely not! Most of what we do is exploratory in nature as we don’t have good theories and data is setting specific and contains sources of noise that are hard to structure. RR might inform us that the current focus on “testing” is excessive as part of our exploration work, and we should instead rethink conventional studies as helping design better theories – once this is done, RR will be the format for the actual testing.
Third, the RR is part of a broader effort, also engaged by JAR to offer transparency in how the research has been conducted. With the current policy, which is currently only adopted by 9 journals in the Financial Times 50 list (of which only Journal of Finance in financial economics), authors must share their code, see at the end of this post for a list. Since we do not expect most papers to be RR, transparency approach gives to the community tools to evaluate the robustness to alternate specifications or hypotheses, and is the standard companion to RR when RR are not possible.
We shall all be looking ahead as to a wider adoption of the RR guidelines in accounting, so that accounting takes its natural role as a leader in promoting transparency.
Appendix: data policy in the Financial Times 50
|Journal (with link to policy)||Code sharing||Non-proprietary data sharing|
|American Economic Review||Yes||Yes|
|Journal of Accounting Research||Yes||No|
|Journal of Applied Psychology||Yes||No|
|Journal of Finance||Yes||No|
|Journal of Political Economy||Yes||Yes|
|Quarterly Journal of Economics||Yes||Yes|
|Review of Economic Studies||Yes||Yes|
|Academy of Management Journal||No||No|
|Academy of Management Review||No||No|
|Accounting, Organizations and Society||No||No|
|Administrative Science Quarterly||No||No|
|Contemporary Accounting Research||No||No|
|Entrepreneurship Theory and Practice||No||No|
|Harvard Business Review||No||No|
|Human Resource Management||No||No|
|Information Systems Research||No||No|
|Journal of Accounting and Economics||No||No|
|Journal of Business Ethics||No||No|
|Journal of Business Venturing||No||No|
|Journal of Consumer Psychology||No||No|
|Journal of Consumer Research||No||No|
|Journal of Financial and Quantitative Analysis||No||No|
|Journal of Financial Economics||No||No|
|Journal of International Business Studies||No||No|
|Journal of Management||No||No|
|Journal of Management Information Systems||No||No|
|Journal of Management Studies||No||No|
|Journal of Marketing||No||No|
|Journal of Marketing Research||No||No|
|Journal of Operations Management||No||No|
|Journal of the Academy of Marketing Science||No||No|
|Manufacturing and Service Operations Management||No||No|
|Organizational Behavior and Human Decision Processes||No||No|
|Production and Operations Management||No||No|
|Review of Accounting Studies||No||No|
|Review of Finance||No||No|
|Review of Financial Studies||No||No|
|Sloan Management Review||No||No|
|Strategic Entrepreneurship Journal||No||No|
|Strategic Management Journal||No||No|
|The Accounting Review||No||No|