World champion comparison with centipawns

Frapator

There is a recurrent and never ending debate about what was the best world champion.

Best having different meaning in speakers mouth.
sometime the best difference of level with his contemporans, sometime the best titles, or tournaments wins, or absolute ranking etc ... combined with duration of that performance.

Lichess offers a good quality analys tool with centipawns and inacuracies and blunders count.

Why not, if not available yet (i did not check), import all world champions games, of their mature period, and compare them ?

Frapator

Well I guess that it was certainly done before by some engine.

But any trace of this ?

pemo

Maybe this is what you are looking for...

en.chessbase.com/post/computers-choose-who-was-the-strongest-player-

Frapator

great. tx !

this deserves a refresh 10 years after, with better engines just to check, but I bet with same results

vasiliyperedruk

why my chat is not working when I play in tournaments?

OneOfTheQ

truechess.com also did this, in a more comprehensive way than the people who did paper(s) reported by Chessbase.

Ken Regan has also compiled some results like this in his attempts to devise an automated cheating detection framework.

The problem is that the analysis is fairly shallow, out of necessity.

Once you start comparing classical time control games between very strong players, that becomes an issue, because the engine at that shallow level may well be weaker than the players it's evaluating.

Matej Guid and Ivan Bratko (the authors of several such papers, and the one that is the subject of that chessbase article), have argued that it doesn't matter if the the engine is weaker than the players, but I disagree.

Going into their argument and why I think it fails is beyond the scope of this post, but I'll gladly discuss if it anyone wishes to pursue it.

There are two other big problems with this general approach. The first is that centipawn loss is really a function of two things: your strength and your opponent's strength.

That comes up quite frequently here, where a player gets a game with almost no centipawn loss, but just because their opponent played quite badly.

Even top GMs that keep centipawn loss very low in general see their centipawn loss jump up significantly when playing stronger GMs or engines.

Both Matej Guid and Ivan Bratko in their papers and truechess.com try to account for things like this by calculating the complexity of a position and factoring that in, but the methodology is a bit suspect in both cases (there's just not an easy and accurate way to do this with automated engine analysis).

The second problem is that centipawn loss, useful as it may be as a rough guide, is an average over all your moves.

In chess, as it turns out, your strength of play depends much more on your weakest moves than on your average moves, a point made by Tord Romstad while discussing how to design engine evaluations in his rightfully famous post at http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=135133&t=15504 .

That is to say, losing about 5 centipawns every move will score much higher than losing 0 centipawns except for every 20th move when you lose a full pawn (that's actually relatively easy to simulate by fiddling with the source of some open source engines and running a match).

That becomes a big problem when the search is so shallow, because that will result in some deep tactical mistakes being missed, and other tactically sound combinations being classed as blunders (truechess.com tries to mitigate this when determining blunder rate, but the mitigation is pretty limited).

That throws off both centipawn loss and blunder rate, which makes any comparison on that basis a bit suspicious, especially when combined with the fact we can only very crudely account for the strength of opposition/difficulty of the positions.

While SF is a stronger engine than used in any of the previous studies, the analysis here at lichess is of such limited depth that I don't think it would really be an improvement on previous attempts.

The results would still be interesting though, so I might just upload the games from some matches and see what the results look like.

It's all still very interesting of course, and obviously none of these flaws have prevented me from spending far too much time reading about it, but it's worth taking all these attempts with a couple pounds of salt. :)

Frapator

Thank you for sharing clearly all your knowledge on this matter !

Very interesting. Any room for progress according to you ?

OneOfTheQ

Yeah, there are a couple ways I think it could be improved.

The first is boring: just make the engine analysis much deeper than it has been in the previous attempts.

That solves some of the problems, but takes a lot of time and money.

The second is to come up with a better way to evaluate the complexity of a position.

The previous attempts were doing the right thing by trying to reward players who could play at a certain level, but in more complex positions.

The problems there were with how they measured complexity. There is, I think, a better way utilizing concepts similar to what Chaitin uses in his algorithmic information theory, but developing a process to implement it completely would be very difficult.

For me, the ideal setup would be very deep engine searches that reveal whether the move played was a blunder that changed the result with perfect play (win->draw/loss, or draw->loss), and a second calculation determining how complicated the position is for the opponent after the move.

Then you'd reward players for making the position more difficult for their opponent while penalizing players for blunders that changed the objective result of the game with perfect play.

There'd still be room for subjectivity, since you'd have to decide how to weight those components (i.e., how to compare a player who almost never blunders but only makes the position moderately difficult for the opponent to one who blunders more frequently but puts a lot of pressure on the opponent).

At least the raw data would be more useful, then. Eventually, if we collected enough such data, we MIGHT be able to correlate each factor with rating and get some empirically determined weights, but there would always be a bit of a subjective element.

A long time ago I wrote some simplified programs that very crudely implement the complexity calculation for very limited domains of positions, and the results were promising, but heavy emphasis on "crudely" and "limited domains" :)

One day if I have more free time I might make it more serious.

kettwiesel

I agree with OotQ.
The Guid-Bratko approach to measure 'complexity' is very flat and also a bit unlogically. I tested it with lichess blitz games played by expert players. The result was: There is not a significant correlation between the played errors and the 'Guid-Bratko complexity'. I have my own approach with has a significant correlation between the played errors in those games and my predicted errors. The disadvantage of my method is: very expensive (up to 50 times of a normal analysis at the same depth) but sometimes with surprising results:
http://imgur.com/a/vFJN8
(red line: predicted errors, blue line: real played errors)

This topic has been archived and can no longer be replied to.