Statistical Significance Testing of an Advanced Metric, xG

A while back in a thread discussion I pointed out the necessity of including some estimate of variance/error when using advanced metrics. I looked around the web to see if I could find anyone doing this. None of the advanced metrics sites that I looked at provide any measure of variation. So I decided to generate some numbers myself.

I downloaded 5v5 data for ixG for individual games from I calculated mean, standard deviation and standard error (standard error is calculated as the standard deviation divided by the square root of the sample size). I used this years Rangers games so sample size is around 50 or less for most players. I calculated 95% confidence intervals around each mean (roughly twice the standard error in each direction for normal distributions - I did not test data for normality). Here are the results for Rangers forwards. I’ve designated groups of players that cannot be distinguished from each other (roughly speaking, you just need to see if their confidence intervals overlap or not) using the same letter:

  • Chris Kreider 0.2120 0.16-0.26 A
  • Filip Chytil 0.1713 0.13-0.21 AB
  • Julien Gauthier 0.1583 0.09-0.23 ABC
  • Barclay Goodrow 0.1500 0.10-0.20 ABC
  • Mika Zibanejad 0.1459 0.11-0.18 ABC
  • Kaapo Kakko 0.1360 0.08-0.19 ABC
  • Ryan Strome 0.1300 0.09-0.17 ABC
  • Sammy Blais 0.1286 0.07-0.18 ABC
  • Dryden Hunt 0.1200 0.08-0.16 ABC
  • Artemi Panarin 0.1164 0.08-0.15 BC
  • Alexis Lafreniere 0.1083 0.07-0.14 BC
  • Kevin Rooney 0.0813 0.05-0.11 C
  • Ryan Reaves 0.0700 0.04-0.10 C
  • Greg McKegg 0.0671 0.03-0.10 C
  • Morgan Barron 0.0560 -0.01-0.12 C

What we see is that even the bottom players are statistically different only from the top two players. What we have is just a big mush of indistinguishability for xG. Is a sample size of 50 too small? Let’s see what happens if we increase the sample size to 250. I’m not going to tabulate 250 games of data - too much work. Instead I’m going to assume mean values stay the same and standard deviation of the sample stays the same. Standard error decreases by a factor of the square root of the sample size. So going from 50 to 250 roughly halves the standard error.

  • Chris Kreider 0.2120 0.1867-0.2373 A
  • Filip Chytil 0.1713 0.1528-0.1898 AB
  • Julien Gauthier 0.1583 0.1245-0.1921 ABC
  • Barclay Goodrow 0.1500 0.1235-0.1765 BC
  • Mika Zibanejad 0.1459 0.1270-0.1648 BC
  • Kaapo Kakko 0.1360 0.1102-0.1618 BC
  • Ryan Strome 0.1300 0.1106-0.1494 BC
  • Sammy Blais 0.1286 0.1006-0.1566 BC
  • Dryden Hunt 0.1200 0.1019-0.1381 BC
  • Artemi Panarin 0.1164 0.0980-0.1348 C
  • Alexis Lafreniere 0.1083 0.0910-0.1256 CD
  • Kevin Rooney 0.0813 0.0657-0.0969 DE
  • Ryan Reaves 0.0700 0.0552-0.0848 E
  • Greg McKegg 0.0671 0.0483-0.0859 E
  • Morgan Barron 0.0560 0.0218-0.0902 E

This helps, but not much. Chris Kreider is distinguishable from most other players (any player without an A) . There is a large group of middling players that are indistinguishable (the BCs), but who can be distinguished from a group of bottom feeders (any player with a D or E).
Strangely enough Artemi Panarin who is thought have one of the best offensive skillsets on the team, if not the best, is a middling player according to ixG. I have no explanation for this result, but it certainly should worry anyone who thinks that xG is the be all/end all in evaluating offensive skill.

One statistical note - when you try to compare every player with every other player the comparisons are considered what are called "aposteriori" comparisons and the size of the difference between two mean ixG values needs to be larger than in a planned comparison (apriori) to be deemed statistically significant - this means even with an increased sample size the results of aposteriori tests would probably be closer to the first set of comparisons - i.e. most of the players are just a big mush of indistinguishability for xG.

So what’s the big picture? xG certainly seems to provide a reasonable measure of offensive skill at a coarse level - the Rangers better offensive players are toward the top of the list, and their poorer players at the bottom. But the metric is so variable it’s hard to use it to make more specific, fine-toothed comparisons - at least if you want to do so with a standard of rigor that is typically used in science when you want to be quite confidant in your conclusions.

You can relax that level of rigor: 50% confidence intervals are much, much smaller than 95% intervals and players are more readily distinguishable - if as a fan you use those then discussion will be more at the level of "I’m right - no you’re wrong, I’m right" that is typical of much bantering. Because that’s what 50% confidence intervals provide - might be wrong, might be right. Here are the results for 50% confidence intervals

  • Chris Kreider 0.2120 0.2035-0.2205 A
  • Filip Chytil 0.1713 0.1651-0.1775 BC
  • Julien Gauthier 0.1583 0.1470-0.1696 BCD
  • Barclay Goodrow 0.1500 0.1411-0.1589 DE
  • Mika Zibanejad 0.1459 0.1396-0.1522 DE
  • Kaapo Kakko 0.1360 0.1274-0.1446 EF
  • Ryan Strome 0.1300 0.1235-0.1365 FG
  • Sammy Blais 0.1286 0.1192-0.1380 FGH
  • Dryden Hunt 0.1200 0.1139-0.1261 GHI
  • Artemi Panarin 0.1164 0.1102-0.1226 HI
  • Alexis Lafreniere 0.1083 0.1025-0.1141 I
  • Kevin Rooney 0.0813 0.0761-0.0865 J
  • Ryan Reaves 0.0700 0.0651-0.0749 K
  • Greg McKegg 0.0671 0.0608-0.0734 K
  • Morgan Barron 0.0560 0.0455-0.0675 K

Now we’re talking distinguishability. Kreider is King. Chytil and Gauthier are Princes. Goodrow, Zibanejad and Kakko are trusted advisors. There are a bunch of court attendees, Strome, Blais, Hunt, Panarin and Lafreniere. Finally the court jesters, Rooney, Reaves, McKegg and Baron.

So keep all this in mind the next time you want to argue that Player A is a better offensive player than Player B because Player A has a higher xG, and advanced metrics, unlike eyes or opinions or simpler metrics, are unbiased, infallible, case closed, end of story. In most cases you need to go to a 50% confidence interval in order to draw that conclusion and that means a 50% chance you’re right and a 50% chance you’re wrong. A lot more humility is in order when using advanced metrics than is typically seen in player discussions.

(By the way, I could not figure out how to get the software to hold on to tabs - the player tables are harder to read than I intended them to be - my apologies)