Yesterday my phone started buzzing more than usual with Twitter updates. Turns out our working paper on A/B testing has attracted the attention of a few notable academics on Twitter. Again.
I'm not sure what prompted this new wave of attention. Perhaps it was the fact we uploaded a new version to SSRN. Perhaps it is those end of year summaries I see people making.
One tweet, however, from Ron Kohavi, Microsoft VP of Analysis and Experimentation, was interesting:
Claim of widespread p-hacking sensationalized in paper suffering lack of external validity on two axes: skew due to massive selection bias and time bias.
— Ronny Kohavi (@ronnyk) December 16, 2018
You can see the linked(in) post here: https://www.linkedin.com/pulse/p-hacking-ab-testing-sensationalized-ronny-kohavi/
"Oh, dear", I was thinking to myself. Kohavi is one of the stars of the A/B testing world. What's up with that?
I will address Kohavi's points below, to the best of my understanding of the issues he raised. I am not sure what prompted this "sensational tweet", but I very much welcome feedback about our paper. I hear that email still works as a technology, though.
TL;DR: Don't be that person that doesn't read an entire post because it'll take you 3 minutes. You'll end up like those who make judgement calls about papers based on Tweets by other people:
1. "Selection bias" - Kohavi says "Optimizely’s product was designed for the uninformed, so there is massive selection bias." By "selection bias" Kohavi means that the sample of people in our data who use the platform to run experiments is not representative of the behavior of "normal" ("informed") experimenters.
The fact is that Optimizely is the largest (as far as I can tell) 3rd party A/B testing provider, which is why we worked very hard to convince them to share data with us. The median experimenter in our dataset had run 184 experiments before we collected our data, and there are over 900 of them in our dataset. The platform had over 9,000 experiments started on it in the data collection window. This is more than 6 times higher than what Kohavi mentions they ran on Bing in roughly the same time period (though I am sure Bing.com experiments are larger in sample size).
A study that analyzes experimenter behavior in the wild, on a widely used A/B testing platform, has a lot of external validity about how businesses behave when they experiment. The findings might not apply to Microsoft, or Facebook, but these companies do not need our help in analyzing their internal experiments. In contrast, deriving conclusions from studies about experiments that are run internally at Microsoft or Facebook probably apply less to a "normal" business.
2. "Selection bias" (2) - Even "uninformed" experimenters should have noticed if, experiment after experiment, their decisions result in lower profit. This is basically the rational decision maker assumption we often make in our analysis. We wanted to test it, and we found interesting results.
3. "Selection bias" (3) - Kohavi claims (or assumes) that informed decision makers do not make such mistakes.
In contrast to just claiming otherwise, our paper actually cites many references that show this is incorrect (medical doctors, informed academics and even statisticians make mistakes in interpreting results).
So to summarize - is our dataset representative? It is, if you're the type of A/B tester that runs experiments similar to those used on the platform we got the data from. Luckily, it applies to many if not most businesses.
4. "Time Bias" - Kohavi says "The paper’s external validity across time is therefore weak: most businesses who use A/B testing in the last 2-3 years are aware of the issue, and Optimizely's software improved."
I am not sure where the "most" comes from in this sentence and what data supports this claim. Many businesses I encounter (and I encounter a wide variety of businesses as a business school professor) are not aware of this issue, or are aware but not sure of the details, or are not sure how to resolve it, etc.
Optimizely's software has indeed improved, but was the problem resolved? Well, it depends on what problem we're talking about.
For this, I recommend that you read the paper and not just the abstract, but this is the gist of it: Optimizely's software in 2014 notified users they can stop an A/B test when the confidence metric was above 95% or below 5%.
However, and this is a big HOWEVER, this is not where we found experimenters stopped their experiments. Many of them stopped when they reached 90% confidence. So the problem that people might have tried to solve since 2014 might not have been the relevant problem to solve.
When I invest time and money in solving a problem, I prefer to know that the problem exists, and what the negative consequences of the problem are. This is what our paper tries to achieve. We simply ask: "Was there a problem, and if there was, how prevalent was it and what consequences did it have".
We then just report our findings (and how we found them). It would also have been interesting if we found no p-hacking and low FDRs, and it was also interesting when we found there was p-hacking and that it did inflate the FDR. Both findings are interesting.
5. "Sensationalized" - Kohavi uses words such as "widespread", "sensationalized", "massive" and "bias" in his text. By using "bias" he actually hints that our results are biased (although what he actually says is that the results are probably correct, just not generalizable).
The paper's abstract is pretty straightforward. It describes the findings of the paper (like any other abstract), and does not use any superlatives. We use very measured language in the paper, and our conclusions very carefully explain what we believe are limitations to conclusions from the data. We have never made any attempt to "sensationalize" the paper. We presented it at conferences, submitted it to peer review, did a standard Wharton podcast about it, etc. Nothing fancy or unique.
So on this point, I don't have a good answer. I'm not sure why Kohavi thinks our paper is sensational. Maybe he means "sensationally interesting"? 🙂
To summarize, what do we actually do in the paper? We test a tradeoff.
On one hand, experimenters should want to avoid false discoveries. It will lower their profits. They should notice it pretty quickly if they run many experiments. On the other hand, the platform was designed in a way that allows naive experimenters to make mistakes.
We went about seeing which one prevails. It turns out that mistakes prevailed (sometimes) but not where expected, and not in the form assumed.
Isn't this an interesting finding?