lichess.org
Donate

Fitting an Elo model to Titled Tuesday blitz games

ChessAnalysis
Winning Probabilities in Chess: Elo model for blitz

Introduction

In my last blog Winning Probabilities in Chess I calculated the probability of a 45.6/46 score between a player A and a player B rated 3300 and 2950 based on the following model:

where r=RA - RB (rating of player A minus the rating of player B).

I used s=400 and k=2. These parameters are chosen so Player A score against Player B is about 0.75 when their rating difference is 200. For details on the choice of these parameters read my previous blog Winning Probabilities in Chess or wikipedia.

My previous blog aim was to try to show those without statistical background how the probability a player winning/drawing/loosing against another could be calculated using the above Elo model.

In this blog I will estimate parameters s and k using Titled Tuesday tournaments games. Titled Tuesday is an 11-round Swiss-system 3+1 blitz chess tournament held every Tuesday on chess.com. 2 tournaments are held: an early session and a late session. Only titled players can participate.

Based on the estimated parameters I will calculate new probabilities for the 45.5/46 score and also expected scores for rating differences of 200 and 400.

I did all the programming in python. I used python-chess in order to parse the pgn files.

As pointed out by GM RealDavidNavara as a response to my previous post my calculations will have, also in this post, at least the following drawbacks:

  1. the probabilities are calculated assuming independence between games.
  2. I do not take in consideration players form.

Empirical analysis

Data sources

I use 2 data sources:

  • garyongguanjie. This database, is a collection of CSVs, and has games between 2022-08-11 and 2023-10-24. The following table shows how the database entries look like:
WhiteResultBlackpairinground
CM Shuvalov (2483)0 - 1GM DanielNaroditsky (3181)11
GM Baku_Boulevard (3026)1 - 0FM Alexei_Gubajdullin (2730)11

An important difference between these 2 datasets is that while garyongguanjie database ratings are chess.com blitz ratings, TWIC ratings are FIDE ratings for classic chess. Nevertheless TWIC dataset has many more games than garyongguanjie. Having more data is, in general, better for model estimation.

Again, and to make clear, a fundamental difference between these 2 datasets is that WhiteElo and BlackElo pgn tags will be different for the same game. The following table shows the ratings for a few players in the different datasets. The ratings correspond to the 24th of October 2023, early Titled Tuesday, round 1.

PlayerTWICgaryongguanjie
Carlsen28393266
Oparin26812994
Grischuk27322977
Sarin26943129

Note that the Elo model does not care for absolute ratings but only for the rating difference. These 2 datasets will allow to test if classic FIDE ratings are a good approximation for a player's strength in Blitz.

The following table shows some basic statistics for the 2 datasets:

TWICgaryongguanjie
# games448488151616
Date first game2020-04-282022-11-08
Date last game2023-12-122023-10-24
Average rating23382510
Highest rating28643266
Lowest rating20002000
# white victories21570973845
# draws4255912744
# black victories19022065027
% white victories48.148.7
% draws9.58.4
% black victories42.442.9

Empirical win/draw/loss distribuition

The next 2 figures show, respectively for TWIC and garyongguanjie, the empirical win/draw/loss distributions, that is, the proportion of games won/draw/lost as a function of the rating difference.

As it can be seen above the graphs are similar.

For better visualization, the next 3 figures show the win, draw and loss distributions separately and compare the 2 data sets.

By visualy inspection off the above figures is possible to conclude that:

  • FIDE classic ratings are a good proxy for blitz strength.
  • Rating differences determine the win/draw/loss distribution.

Estimating s and k by the method of maximum likelihood

Model estimation

This section details some aspects of the model parameters estimation technical. If you so wish you can go directly to the next section.

The maximum likelihood estimation is a method to estimate model's parameters given some observed data.

The Elo model for win/draw/loss distribution is:

The objective will be to maximize the log likelihood, that is the natural logarithm of the likelihood. This is achieved by maximizing a log likelihood function so that, under the assumed statistical model, the observed data is most probable.

The log likelihood for the above model is:

I used the scipy L-BFGS-B method to minimize the log loss (log loss = -log likelihood).

As the convergence was fast enough and stable (the optimal parameters are the same when starting from different initial values for the parameters k and s) I didn't bother to implement the derivatives.

For completeness the derivatives in order a s and k can be seen below. I didn't test, verify or use the derivatives consequently, as they are relatively complex, it is possible I did a mistake.

The fitted model

Because it has more data, I only used the the TWIC data set to fit the model.

The next figure shows the estimated win/draw/loss distribution:

Comparing fitted model with the empirical data

The next 3 figures compare the fit with the empirical data.

Probabilities recalculated

Expected score when r = 200 and r = 400:

200400200400
k=0.23,s=1070 k=2.00,s=400
win0.63870.78420.57720.8264
draw0.09440.07560.36510.1653
loss0.26690.14020.05770.0083
Expected score0.68590.82200.75970.9091

Going back now to the two theoretical players player A rated 3300 and player B rated 2950, using the estimated parameters the probabilities player A winning/drawing/loosing against player B are:

k=0.23,s=1070k=2.00,s=400
win0.75540.7785
draw0.08050.2076
loss0.16410.0138

The probability of the score 45.5/46 and the number of games necessary until there is a 95% probability of seeing a 45.5/46 score are:

k=0.23,s=1070k=2.00,s=400
P(45.5/46)0.00001220.0001222
# games until p(45.5/46) = 0.9524606724506

Final thoughts

Statistics does not prove that some event happened, is happening or will happen. All it does is to give an estimate of how likely it is that the event was to have happened, is happening or will happen.

Plenty of rare events occur, like the earth being hit by an asteroid or winning the euro millions. I would imagine that, for instance, some euro million winners did it at the first time of trying, in spite of the 1 in 139,838,160 odds (this are the odds quoted in the official euro millions website).

It is worthwhile noting again that this calculations assume independence between games and don't take in consideration player's form.