Do we need a new system of judging?

In 2013, researchers from Oxford University[1] looked at which jobs were susceptible to computer automation and found that there was a 98% chance that Sports Officials would be replaced by computers within 20 years.  I bet they weren’t considering dressage when they came up with that prediction!

Even faced with overwhelming evidence, the leaders of our judging elite don’t believe we have a problem with our current system of judging.  Reading the emails that form part of the joining pack for the FEI Sports Forum 2017 and the Eurodressage.com article, our Dressage Judge General and the president of IDOC (International Dressage Officials Club – the official judges’ club) seem to believe that there is no problem and even if is, it could be simply fixed with more education.

Show me the evidence…

David Stickland’s Global Dressage Analytics (GDA) has moved dressage into the realm of big data.  We now have the means to pinpoint our problems.  Before GDA, we could see if a judge was out compared with the other judges at a competition.  We didn’t know how often it happened, if it was a trend or if it was just one aberrant judge.  If we wanted to see how often this happened it was a lot of work and could only be done for a limited number of tests.

With GDA, we can see the results for every international competition, at each level, even right down to an individual competitor in an individual movement.  We can see if it is just one judge or all judges.  Literally, we can use the power of computing to give us unprecedented access to how we perform as a sport and what we can do to improve it.  We don’t have to wait for a major problem to happen at an Olympic Games before we react, we can take the initiative and improve the sport for all competitors, from the grassroots to International Grand Prix.

The best agreement amongst judges is when they award a 7

An interesting example of data providing useful information on judging can be found when looking at the precision with which different scores are given.  We now know that judges are in best agreement when they award a 7±0.5.  If they go above or below this, they start to disagree, a lot.  Clearly this variation is indicative of a problem that we need to address. This can be seen in Figure 1.

Consistency

Figure 1 Courtesy of David Stickland, Global Dressage Analytics

What makes dressage judging inaccurate?

I would argue that it isn’t the judges.  By and large, they are experts and have spent years learning how to award marks.  They are able to look at a horse and within a few strides work out the Quality of Execution of the movement and spot any mistakes in real time.

It is not only judges that have this ability, most dressage enthusiasts can look at a horse and tell you how well it moves and spot any mistakes.  Quite a few dressage “partners” seem to have this skill even though they may not even sit on a horse – probably due to having had to attend numerous training sessions and competitions – it can’t help but rub off.

What makes a judge different is that they have the ability to translate this perception into a score.

Where is the problem with the system?

The Dressage Handbook is clear on what problems a judge should look for.  It defines 8 categories or components:

  • Precision
  • Rhythm
  • Suppleness
  • Contact
  • Impulsion
  • Straightness
  • Collection
  • Submissiveness

Each component can have multiple problems.  For example, consider Contact. According to the Dressage Handbook, possible problems are:

  • Tight neck
  • Self-carriage
  • Strong in the hand
  • Behind the vertical
  • Not steady
  • Tilted
  • Resistance in the mouth
  • Above the bit

Most of these can have various grades of the problem.

So far, so good.  What is missing is there is no guidance given as to what a judge should do if they see several of these within Contact let alone what they should do if there are problems in other components as well.

To give an example, what if the horse is tight in the neck, not steady, above the bit for some strides and behind the vertical for other strides and the mouth open.  If the horse had had all those Contact problems, it would have other issues such as Rhythm, Impulsion and Collection.  How do you work out what to what mark between 1 and 10 to award?

My view is that it is impossible to be accurate – at present there is no algorithm that enables a judge to logically make a deduction for any deficit in any component.  This is the core problem.

Sidebar: Is dressage judging more difficult than Quantum Gravity?

If you were going to do a PhD in Quantum Gravity, arguably one of the most difficult qualifications that can be achieved, you would need to take a grounding in high-school physics and mathematics (2 years), a Bachelors Degree in Physics (3 years) and then a Doctorate (3 years).  So, acknowledging that you would also need considerable mathematical ability, you could end-up with a doctorate in 8 years.

Alternatively, if you want to be an international dressage judge (4*), you will need to spend 10-12 years in training.  To become a 5* judge, who can judge the Olympics, could take about 18-20 years.

Are we really saying that judging dressage is more difficult than Quantum Gravity or is it just that our system has a problem?

So how do judges cope with such an unwieldly system?

According to Dr Inga Wolframm[2], where the processing requirement exceeds the human brain’s capacity, people develop coping strategies.

A good example of this in dressage is where we can predict the ranking and percentages at top competitions by knowing the horses’ history.  I.e. if you know what mark a horse normally gets, you are in a good position to guess what its mark will be next time. I have heard it said that if you want to be a top international judge all you should do it memorise the final mark of the top 100 horses and if the horse looks great on the day, give it 1% more.  If it is not so good, give it 1% less. You should not be far off the mark.  Maybe this is what judges mean when they say they judge as a “team” – they know what their colleagues will be awarding a horse – an excellent coping strategy?

Judging biases

As Inga also pointed out, judging (not just dressage) is prone to 7 natural biases:

  • Patriotism bias
  • Halo bias
  • Memory-influence bias
  • Social comparison bias
  • Reputation bias
  • Order bias
  • Conformity bias

Giving the same marks as your colleagues may be a coping strategy, it may also be Conformation bias.  In addition, no rider wants to go first (Order bias) and many riders would claim that they have been beaten by a more famous competitor (Halo bias).  Our current system does not manage these at all well.

What does the dressage public think of our current system?

In February 2015, the International Dressage Riders’ Club (IDRC) did a survey of 3,254 spectators, riders and enthusiasts.  Some results were predictable with over 85% of respondents stating they liked the Freestyle the most. When asked about what they didn’t like, problems with the system of judging were the top three: 58% of respondents considered subjectivity in judging to be the most significant issue in dressage; 50% thought lack of transparency; and 30% had difficulty in understanding judging.

To me, this says that we need to look to improve our system.

 

Part 2 – the way forward

What do we need in a system?

Accuracy – Accuracy is the Holy Grail of a scoring system.  The objective should be that it gets the right score and ranks the competitors in the right order.  This is the minimum.  Accuracy

Transparency – It should also be transparent to competitors and the public

Workable by humans – educated humans should be able to use the system and be accurate and precise.

Reduce or eliminate natural biases – the system should aim to reduce or eliminate the natural judging biases.

Cost effective – we must be able to afford to use the system.  Horses are expensive and there isn’t a lot of money left for the luxuries of a complicated system that requires a serious investment in high-tech gadgetry.

Scalability – if possible, the system, or elements of it, should be usable from the grassroots right to the Olympic Games.

Let’s start on a 10…

There is an argument that as most horses get an average mark of about a 7, then they should start on a 7.  If a movement looks good, it may be awarded more points.  If it makes a mistake, then points can be deducted.  Some regard this a positive approach – a combination can “earn” extra marks, they are “added” on.  The argument goes that this demonstrates that we are not a sport that only looks for mistakes.

I fundamentally disagree.  In practise this is very negative system that leads to pigeonholing, (Pigeonholing is a way of putting something in a box so that you can easily categorise them). It really says that you are a 6.5 and you have to do something special to earn going up even a half point. The reality is that it is very difficult to get out of that box when you are in it.  Most riders are well aware of which pigeonhole they are in – once the judges think of you as a 6.5 combination it is devilishly difficult to become a 7 or a 7.5.

In principle, every combination should start on a 10.  As with the law, you are innocent until proven guilty.

Until proven otherwise, the combination is performing at the level of a 10, ie excellent, until it either makes a mistake or the Quality of Execution is less than excellent.

If the Quality of Execution is less than excellent, then there will be a specific reason for it.  The Dressage Handbook has already said that there are 8 components that should be evaluated and there will be one or more of these that is less than excellent.

Algorithmic dressage judging? – heresy!

Let’s imagine a test where things are mostly going right – this is the way it should be at international dressage level – the horse has good impulsion but not excellent and it is behind the vertical on some of the steps.  What mark should we give it? We need a way to work out what the mark is.  One way, our current method, is to look at hundreds of horses that have this level of impulsion and contact issue and know that this is an 8.  What if this horse also was also irregular for 2 strides?  Do we now need to see hundreds of horses that have good impulsion, behind the vertical some of the time and 2 irregular strides and know that this is now a 7?  What if it was irregular for more strides and was behind the vertical most of the time?  How many different combinations are there?  Probably tens of thousands – can we really watch and memorise all those videos?  What about mistakes?  Like, one short change or two or 3 or more?  What do we deduct for those?

When you look at it like this, it is clear that the current system’s main problem is that there is no algorithm for making a deduction.  Judges can all see what the problems are, they just don’t have a codified method of knowing what to deduct from the excellent 10.

System-X – a possible approach for making the Dressage Handbook a Code of Points

Two years ago, IDRC President Kyra Kyrklund and I decided to put together a working group to see if we could find a system of judging that would be more consistent AND more accurate. The working group comprised Dane Rawlins, International Dressage Trainers Club, Dr Inga Wolframm, sports psychologist, Kim Ratcliffe, List 1 judge in the UK, Claudia Rees, political scientist and national level show jumper and dressage rider, Kyra and me.

The result was surprisingly simple, the Dressage Handbook is an excellent basis for a Code of Points.  With some small extensions, it could well be possible to use this system without too much modification.  The result was System-X – the “X” being Latin for “10”.

The core of System-X

Having studied other judged sports such as gymnastics, it was clear that we needed to address three elements:

  • Errors (mistakes) – the Dressage Handbook already defines the possible errors. What was needed was to define what deduction should be made for each error.  For example, if a horse only does 13 1-tempi changes rather than 15, what should be the deduction?
  • Grade of Execution – there are 8 components in the Dressage Handbook and each component has standard problems. Each of these needs to be assigned a numerical deduction.  It turns out that there are not that many problems so to learn what they are and know what marks should be deducted is not that difficult.  It will take training, but a lot less training than with the current approach.
  • The algorithm for deductions – once we have a template of deductions that should be made for Errors and Grade of Execution, there needs to be a way to take these deductions from a 10.

In practice, the way this wold work is very similar to our current system except it is an algorithm:

Final mark = Starting mark – deduction for Errors – deductions for Quality of Execution.

Imagine a horse makes no mistakes (no deduction), is a little behind the vertical (deduction of 0.1), has good but not excellent Impulsion (-1.0) its final mark would be 8.9 (=10 – 0.1 – 1.0). Very simple and has the advantage that judges are already trained to make this kind of evaluation all they would have to learn is the deductions.

Only 4 deductions count…

In theory, there could be lots of deductions.  In practice, there are normally not that many, 4 or less.  This compares with other sports, for example, in gymnastics, according to one Olympic judge, they normally only have 1-3 deductions at international level and they are all made in the judge’s head.  With the right app, we don’t even have to do it in our heads – we can just make a note of the deduction and the computer will calculate the score.

Scalability

System-X has been designed to be scalable from small shows with only one judge right up to international shows with 5 or more judges.

  • Level 1 – At one-judge shows, up to 4 deductions (including Errors and Grade of Execution) can be made and the final figure score written down or entered in a tablet.
  • Level 2 – At competitions with 3 or more judges, it would be the same as for one-judge shows except all the scores would be added together giving the benefit of different viewing angles.
  • Level 3 – If we ever thought it necessary to get even more accuracy or greater granularity (ie more detail) we could split the judging task so that one judge would only look at 2 of the components. This would require 4 or more judges.  Is this necessary or even a good idea?  Frankly, I don’t know if it would give greater accuracy or not.  I do know that it would be a good idea to try it out and see and let the results speak for themselves.

Conclusion

For dressage to survive in this internet era, we have to be conscious of our Social Licence both in terms of horse welfare and the enthusiasts’ belief in dressage as a sport.

If you have ever read Spencer Johnson’s short book “Who moved my cheese” you will know that sometimes making change is the only way to survive – even if it is, at least at first, uncomfortable.

Wayne-signature

 

 

Wayne Channon

[1] The Future of Employment: How Susceptible are Jobs to Computerisation, CF Frey & MA Osborne, 2013

[2] Dr Inga Wolframm, Sports Psychologist, Van Hall Larenstein, 2013.

SHARE