Judging the sports judges

A new study published on the preprint server arXiv – and thus awaiting peer review – has placed scoring judges in sports under the microscope.

In sports such as gymnastics and figure skating, scoring decisions have fallen under growing scrutiny in recent years as the breadth, depth and intensity of sports coverage in print and broadcast media continue to rise, along with the sums of money involved.

However, technological solutions such as Hawk-Eye, which can determine the most likely trajectory of a moving ball and is used most effectively in tennis and cricket, are not available for these disciplines, which are scored almost entirely on technique. Consequently, the task of scoring falls to a panel of highly qualified, experienced – but ultimately fallible – human beings, working in a noisy and often disagreeable live environment.

Each judge reports a score within a finite range for each performance and, while guidelines exist to assist them in making this determination, and outlying scores are discarded before the remainder are aggregated, each score is inevitably subjective in nature.

The consequences of perceived erratic judging can be profound. The most famous scoring controversy in recent memory occurred in 2002 at the Winter Olympic Games in Salt Lake City, US, where a scandal in the pairs skating event threatened to overshadow the entire games. It resulted in Canada and original victors Russia sharing the gold medal following an enquiry by the International Skating Union (ISU).

French judge Marie-Reine Le Gougne was allegedly pressured by the Fédération française des sports de glace (FFSG, or French Federation of Ice Sports) into favouring Russia’s duo over their Canadian counterparts. Le Gougne refutes the allegations but was suspended from judging for three years by the ISU. She never returned to the sport.

Sandro Heiniger and Hugues Mercier from the Université de Neuchâtel, Switzerland, aimed to address such issues by judging the judges, with the aim of distinguishing suspicious scoring from honest deviations.

To do so, they analysed international competition scores from eight sports with comparable judging systems, namely diving, dressage, figure skating, freestyle skiing, freestyle snowboarding, gymnastics, ski jumping and synchronised swimming. In some sports, the total number of individual marks analysed numbered in the hundreds of thousands.

By calculating a standard deviation of judging error against the median score for each performance, the researchers were able to show that in most sports, judges exhibited greater consensus in scoring the strongest jumps and routines, and less agreement on middling displays.

For the weakest performances, results varied by sport. In diving and snowboarding, for example, some instances of failure are very clear. These include splashing the water in a dive, or falling during a snowboard run. Such errors produce consistently very low scores with little variability, resulting in a concave quadratic curve – unsurprising, perhaps, as the study notes that the very best and worst performance contain either “less components to evaluate or less errors to deduct”.

In the more artistic forms of gymnastics and synchronised swimming, however, allotting scores to anything other than an outstanding performance is more subjective. The analysis demonstrated this, with fewer marks awarded and greater deviation at the lowest end of the scoring range.

The trend across most sports, however, is consistent enough to suggest that recent frameworks implemented to improve scoring integrity in gymnastics – all but eliminating the possibility of perfect-10 scores, to the horror of many – can be applied successfully to other sports. The researchers also believe their model can be applied to distinguish honest but erratic scoring from biased judging, as the outlier detection threshold for the former is lower than for the latter.

One anomaly is clear from the analysis – in dressage, a popular equestrian sport, the curve showing standard deviation of judging error is convex, with distinctly higher frequency of inaccurate scoring at the highest and lowest ends of the scale, across all levels of competition and for every judging position around the arena.

The study concludes that judges simply show no consensus on what constitutes an above-average horse-riding performance, and urges the Fédération Équestre Internationale (FEI) to review its practices. No horses were available for comment.

Please login to favourite this article.