After the spectacular dive by Arturo Vidal in yesterday’s DFB Pokal semifinal, which arguably secured the victory for Bayern Munich over Werder Bremen, this research is more topical than ever. Scientists from the Frankfurt School of Finance and the University of Marburg have analyzed 4248 Budesliga games between 2000 and 2014. In particular, their data contain information on all referee decisions and a classification whether the decisions were correct, disputable or clearly wrong. And guess what, focussing only on the 666 clear penalties that were not given by the refs, there is a stark tendency to rule in favor of top teams. Moreover, the effect is by far the strongest for Bayern Munich, Germany’s most successful club team. The probability that a deprived penalty is in their favor is three times higher than the average.
But hold back your “I always knew it” outburst for a moment and let’s stick to the science. There is no published paper and also no working paper version out yet. So all we’re left with are some interviews with the authors in news papers. The Frankfurt School has a press release which refers to contacting one of the authors personally to get details about the methodology. That’s of course bad science because we cannot immediately check the results ourselves. But unfortunately it’s common in management science to not publish working paper versions and keep papers behind journal paywalls.
At least we can get an idea of what the study is doing from the news paper articles. I cannot judge how trustworthy their data source is. But I can at least imagine a problem with such a classification of referee decisions as mentioned above. The judgement whether a decision is correct will always remain subjective, to a certain extent. No video replay can completely eradicate this problem. And games with top teams involved receive much more media attention. Consequently, critical decisions are discussed more intensively too.
But let’s take their measures for granted for the moment. The results establish that top teams benefit from referee decisions in an unduly way when these teams play against less prominent teams. The same holds for home teams and for teams fighting for the international tournaments (Champions League and Euro League) or against relegation. The authors conclude that refs are driven by public sentiment and unconsciously rule in favor of the more popular teams.
I truly hope that the authors are not just comparing means here. Because these results could be driven by many factors and the conclusion of a referee bias is only valid once these other explanations are ruled out. Generally, even if they control for other influences, I’m skeptical that it’s possible to measure and control for all relevant factors in such an analysis. There are just too many unobservable variables at play–above all the behavior of the players on the pitch. What do Champions League aspirants and teams threatened by relegation have in common? Well, they should play more eagerly and with more effort. Players of top teams might also just simply act smarter on the pitch and commit fouls more “skillfully”. That’s not an argument against the referee bias but casts doubt on the researchers’ interpretation of the results.
More importantly, I wonder whether the study suffers from a severe selection bias (this time the bias is statistical rather than cognitive). The newspaper cites that the probability of a “deprived penalty being a wrong decision” is significantly higher for Bayern Munich than for the average team. But you can only judge the correctness of a referee’s decision once a foul is committed and the referee has actually something to decide on. Top players most likely commit less fouls in the box than the average player. The few remaining situations top players are not able to solve with legal means might be very different from the average foul. This holds, for example, when top players resort to fouls in dead-ball situation when the box is very crowded and fouls are more difficult to observe (and thus more likely to be ruled in favor of the defending team).
In statistical terminology, what the researchers might estimate is the probability of a wrong decision given that a foul occurred, P(wrong decision in their favor | foul, top-team), which is different from the news paper’s interpretation, namely the probability of a wrong decision in favor of the top-team, P(wrong decision in their favor | top-team). In the former case you would condition on an (intermediate) outcome variable of the model, “committing a foul”, which is likely to introduce bias (Angrist and Pischke, 2009, Chapter 3.2.3). The problem of selection bias is well-known but very hard to cope with with standard statistical tools. To do correct inference you rather need something coming close to a controlled experiment which keeps the frequency (and intensity) of fouls constant. I wonder whether the study is able to do that or whether it’s even possible in general.*
To conclude: I’m very curious about this piece of research but we will have to wait until a paper gets published to judge the robustness of the conclusions. A short disclaimer might also be in order as well: as a Munich supporter I will of course deny any “Bayern bonus” until ultimately proven wrong. ;)
* In Judea Pearl’s notation you want to estimate P(wrong decision | do(top-team)) where “foul” is a child of “top-team” in the causal graph. That’s a thorny problem.