Enhance your soccer event data with an "expected possession value" model to breathe life into the raw accounting records
As we continue to build out the important steps in a discounted cash flow valuation model and map these steps over to soccer, in the last post we discussed at length how despite the endless complaints you’ll hear about soccer event data, it’s really quite rich and useful. Just like financial accounting data, it is definitely an incomplete record of everything taking place (just as financial accounting data says nothing about the R&D happening behind the scenes at a company nor its promising sales leads, without tracking data, our soccer event data doesn’t tell us where all of the players are, and thus where the space is). Ultimately though, all soccer event data is missing in order to serve as a good accounting record of a player or a team’s performance is a “unit of account” to tie it all together, one that matches the overarching focus of the club’s competitive focus: goal difference. In today’s post, we take a shot at injecting this unit of value into the raw underlying soccer accounting records using some recent contributions from the analytics community.
To mirror the first step that a financial analyst would perform in gathering up the relevant historical data with the aim of making some projections, as soccer analysts we should grab all of the event data in which our target player was involved to get the fullest most reliable record of his performance to date (admitting of course that this is an incomplete picture of his contributions). Because soccer event data which is not denominated in goal units is problematic for us, we need to bequeath the accounting records of the sport a common unit of account: goal units. We could convert all of these events into goal units in several different ways. Most of these ways would fail us. First, here are four such failed attempts to use soccer statistics — some traditional, others advanced — to add the “goal unit” into the raw data records:
We could mark every event that directly resulted in a goal with a 1, and mark all other events with a zero (if I were less charitable I might call this the Daryl Morey method from the last episode). This is insufficient for obvious reasons. It tells us basically nothing about 99% of all actions, and only one thing about the rest of them.
We could mark every event (e.g. pass, dribble, recovery) that was part of a possession chain that ended in a goal with a 1, and all others with a zero (call this a “goal chain”). Again, it brings onto the table more actions than just goals, but it’s still a poor population within which to work. For an action to be considered it has to be attached to the rare goal, making this immediately a worse alternative to #4 below.
Since almost all goals come from shots, we could run an algorithm to assign a value to all of the shots in the data, and weight the value of each shot based on the historical probability of scoring similar shots (this is the now ubiquitous shot-based expected goals metric or “xG,” perhaps first imagined by Charles Reep, but brought to the modern masses by Michael Caley and many others). Here we’re back to the problem of only looking at just shots, when we know almost the entirety of every football match is not a shot.
We could take the shot-based expected goals figures in #3 and assign their values equally to every event (e.g. pass, dribble, recovery) that was part of a possession chain that ended in the shot, and to all other events assign a zero. This approach was first published by Thom Lawrence and Statsbomb, and is referred to as an “expected goal chain.” Admittedly, the purpose of the metric when published was more limited in scope then what we are trying to use it for today, but it runs into the same problem of telling us nothing about most of the actions that happen on the pitch.
The above methods would be reliable and verifiable methods for denominating our historical soccer accounting records in terms of goals, but given the scarcity of goals and shots, how almost all of a player’s touches are probably not attempts to score, and how when his team does score on possessions he is involved in, our player probably isn’t involved in the final ball, none of these methods would tell us with much confidence how the player performs to help his team score goals and not concede goals whenever he touched the ball. The first and third methods ignore 99% of the actions the target player has made, and the second and fourth do poor jobs of attributing team success with any accuracy to the individual players involved. In short, these accounting methods are “reliable” but not “relevant,” at least not enough for our needs.
Relevancy: What do players actually do?
We accept that the money unit of account for soccer data is broadly a “goal,” but almost all actions in a soccer match are not attempts to score. Instead, they are attempts to bring one’s team closer to score, and/or to bring one’s opponent further away from scoring. Each pass, each dribble is an action intended to change the likelihood that a goal will be scored in the near future, or that a goal will not be conceded. For this reason, we really need an accounting method to measures all of these individual actions (all of these accounting records) and assigns to them values that correspond to the changes in these probabilities, of scoring and of conceding.
Players increase their team’s chance of scoring by advancing the ball into dangerous areas by making passes, receiving passes, dribbling by defenders, getting shots off, and latching onto shot rebounds. And often those same actions decrease their opponent’s chance of scoring on the next possession, specifically to the extent they move the ball towards the goal being attacked and away from the goal that will be defended after a turnover. Players decrease their team’s chances of scoring by turning the ball over either via a failed pass, or a failed dribble, by being dispossessed, taking a dumb shot, or even sometimes by successfully passing the ball backwards, away from a more threatening territory and towards a less threatening one. Similarly, these actions often increase the team’s opponent’s chances of scoring on the next possession. All of these actions are recorded as is in the on-ball event data , but if we’re going to be able to “account for” them (and use them as Step 1 of our valuation process), we need a way to assign values to each of those actions based on how each action changes the probability of scoring and the probability of conceding, however small those changes may be. Luckily, this is not just wishful thinking.
Expected Possession Value
In recent years there are several examples of public work to solve this very problem, which I’m probably not alone in saying is some of the most exciting recent work in soccer analytics. They are broadly referred to as “Expected Possession Value” (“EPV”) models, and Kieran Doyle summarized the state of play very well in a recent article for American Soccer Analysis, and I should probably just direct you there to read his summary full-stop. There were also numerous lengthy videos/talks given at the 2019 Statsbomb Conference in the area of EPV models, or extremely adjacent concepts.
A non-exhaustive list of those who have contributed to this area of the field and/or tried their hand at the thing would include Sarah Rudd, Dan Altman, Karun Singh, Thom Lawrence, Luke Bornn and Javier Fernandez, SciSports, KU Leuven, Opta, Dave Laidig, Cheuk Hei Ho and the latest from American Soccer Analysis (more below). And as I’ve mentioned before, for some time now, Ian Graham and Will Spearman at Liverpool have used something similar and probably something even better behind the curtain. The most recent of the public projects, and the one with which I am most familiar is American Soccer Analysis’ “Goals Added (g+)” Model, which I believe at the very least is superbly named for the approach we’re trying to hammer out in this newsletter.
Disclaimer: I was involved at least tangentially in the launch of the “Goals Added” model by American Soccer Analysis. A soccer operations department should consider using any of the available models out there or perhaps building their own for the purpose of completing Step 1 in the “Absolute Unit” framework for projecting player contributions. That said, I’m familiar with g+ and I think it rocks, so I’m going to further illustrate the conceptual points of the overall framework using this specific model below.
Goals Added (g+)
“Goals Added" (“g+”) uses machine learning and historical event data from a decade of Major League Soccer matches to assign a value to every recorded action in a match, computed as the change in the probability that the possessing team will score on the current possession minus the change in possibility that the defending team will score on their next possession when they get the ball back. The team at ASA wrote a number of articles on the goals added model and framework upon its rollout in the spring of 2020:
On the whole, the g+ model assigns a probability of scoring on the current possession and conceding on the next possession to each moment before and after a recorded action in a match (similar to a balance sheet snapshot of a business’ assets and liabilities at a point in time). The changes in these probabilities can then be directly attributed in a stock-flow consistent manner to the recorded actions themselves which move the match forward from one moment to the next in its recorded ledger (similar to a business’ income statement, showing the changes in the business’ assets and liabilities). As an example, a midfielder might receive a pass from a center back near the center circle and the model might assess that based on its training in the historical data set of similar possessions, there is a 1.5% probability that the team with the ball will score on this possession, and also a 1.5% probability that they will concede on their opponent’s next possession, for a net value of zero (0.015-0.015=0). The midfielder might then complete a daring pass forward to the right winger in the final third, after which the team is now 2.5% likely to score and 1.2% likely to concede on the next possession (a net value of 0.013), based on the algorithm learning how various factors of a possession have shaped these probabilities over the last decade of data. The completed pass would then be worth 0.013 goals added to the team in possession (the difference between the now current value of 0.013 and the zero value that existed before the pass attempt). A 1.3% impact on the two-possession goal probability difference may not seem like much, but over hundreds of pass actions, it adds up in a big way. A midfielder who is consistently finding these progressive passes and avoiding turnovers will be scored very well by the g+ model (he will be said to be contributing in a positive way to his team’s goal difference). If the player above lingers on the ball instead of passing it, and he’s dispossessed at midfield, his team’s expected possession value has now plummeted from zero (equally likely to score or concede over 2 possessions) to say -0.7% (let’s say for example, the opponent picking the ball up in transition in the center circle has a 2% chance of scoring and the passer’s team now has a 1.3% chance of scoring on their next possession). Over time, a midfielder who is sloppy in possession will be scored low.
Allocation to Players
Having valued the successful pass walked through above based on the machine learning algorithm, there remains the question of how much of the 1.3% change in expected goal difference upon the completion of the pass was the contribution of the player making the pass versus the player getting on the end and receiving it. Fundamentally both players are contributing to the change in probabilities the team has experienced. The passer is contributing the decision of whether to attempt a pass and to whom as well as the passing execution and the receiver is contributing the movement to create a passing option and the execution to receive the pass. “Goals Added” solves this accounting problem, which is fundamentally one of allocation not valuation, by using an “expected pass model” and assigning the “expected value” of the pass attempt to the passer (i.e. the value most closely tethered to the “decision” the passer made), and the difference between the realized expected value of the successful pass and the “pre-pass” expected value of the pass attempt to the receiving player (i.e. the value the receiving player created in excess of the average expected value of such a pass). This is grounded in a finding articulated by Anderson and Sally in “The Numbers Game”:
Once [Jaeson Rosenfeld of StatDNA] had taken into account things like pass distance, defensive pressure, where on the field the passe was attempted, in what direction (forward or not), and how (in the air, by head, and one touch), a curious result emerged: "after adjusting for difficulty, pass completion percentage is nearly equal among all players and teams. Said another way, the skill in executing a pass is almost equal across all players and teams, as pass difficulty and pass completion percentage is nearly completely correlated….
It is virtually impossible to differentiate among players' passing skills when it comes to executing any given pass (at least at the level of play in the Brazilian top flight).... As a result, at the elite level, the particular situation the passer finds himself in determined a player's completion percentage, not his foot skills. While their passing skills may be highly similar, this doesn't mean that players have identical possession skills.... it is mostly about being in the right place to receive it, helping a teammate position himself in the right place in the right way, and helping him get rid of the ball in order to maintain control for the team.... Good teams are not better at passing than bad ones. They simply engineer more easy passes in better locations, and therefore limit their turnovers.
This allocation is almost certainly incomplete. Tracking data if available would make clear that an individual event such as a pass between two teammates might involve the contributions of several more teammates via their positioning and movement, or the contribution of a previous pass to create an expected possession value that exceeds the value the non-tracking data-trained model is built on. At any rate, until tracking data is ubiquitous, this allocation between passer and receiver is something that must suffice. I haven’t seen many other EPV models recognize this explicitly and so I think it’s worth highlighting since passes are by far the most common action in the event data. Any accounting records which strive to show the historical contribution of players to their team’s performance must recognize that the value of a completed pass is contributed both by the passer and the receiving player.
Useful Characteristics of Accounting Records: Disaggregation
Where we are headed, to ensure the team planning is aligning with the organization’s goals (remember way back, our fictional GM needed to improve by +27 GD?), we’re ultimately going to need to project goal difference at the team level based on the contributions of the individual players (steps 3 & 4), and to do that we very well may need to pull apart these accounting records to understand a player’s contributions in different facets of the game. This mirrors a core tenet of the usability of financial statements as well, the importance of disaggregating revenue information enough so that a user can understand the “nature, amount, timing, and uncertainty” of cash flows. Whatever method or model of assigning units of value to the data a football club ends up choosing as part of Step 1, they should make an earnest attempt to disaggregate their “goal difference contribution” accounting records similarly. “Goals Added” as an example, breaks the overall contribution score down into several facets, which have been expertly visualized by Eliot McKinley below.
Certain of the above radials are impacted disproportionately by risk and uncertainty (i.e. they are noisier). Others might be impacted disproportionately by team effects (e.g. a player’s interrupting score may be elevated because his team is constantly under attack, demanding interrupting actions), or they might interact with player aging curves in various ways out in the future. By following accounting principles and disaggregating the information in this way, we’ll be in a better position to project at the individual player level, and most importantly at the team level, once we get past the assigning of accounting units to the data.
Rates and Quantities
Another essential characteristic of any good player-level accounting record (and something that has nearly always been common practice in the soccer stats community) is the ability to depict the units of value both in terms of a total and also per minutes played (per 90 minutes, per 96 minutes, per minute, etc). A player’s ultimate contribution to his team’s performance over the next season will be a function of both a) the rate at which he contributes via his involvement to his team’s expected goal difference per 90 minutes and b) the amount of minutes he is on the field, so ideally we would like to be able to project both of these things. This is similar to thinking of things in terms of prices and quantities in the business world.
Useful Characteristics of Accounting Records: Relevance
While there is still much work to be done to improve these metrics, and this includes “Goals Added” (and I know the team at ASA is working hard every day to do so), one important attribute of these things is that they work. As it relates to Major League Soccer, ASA found a strong relationship between player wages and g+ scores, found that at the macro-level g+ (and several different slices of g+ metrics) performed better than shots, goals, and expected goals at predicting future team performance, and found that at the micro-level, g+ tended to agree with video analysts’ interpretations of match sequences. Moreover, the lists that the model generated for who were the best players at there position over the last several seasons of data passed the eye test for MLS and in Europe. This stuff largely works and because no model is perfect, the more transparent and well-organized the accounting the records, the easier it might be for a club to take the necessary precautions and acceptances of uncertainty when it goes to build this into player projections.
Toward Player Projections
The overall point of this post is to advocate for the use of an EPV model to instill the value unit of soccer (the goal) into the raw soccer event data to set a good foundation for player projections.
While I am biased, I find the g+ model is uniquely suited to meet our exact needs in this exercise of mapping financial analysis over to player evaluation, as the model itself is on its own a means toward imbuing life into the existing records of historical performances that are soccer event data (step 1: gather historical data). Further, because it was built with “player valuation” in mind, certain aspects of g+ were built specifically to focus on predictive properties of past actions and not simply descriptive actions (step 2: make pro-forma adjustments to convert historical data into projections). There are of course other adjustments required to convert these accounting records into future projections, but the model itself is framed in such a way as to make it both a tremendously useful historical accounting record of the contributions of players to their team’s goal difference performance and an input primed and ready for its place in the next step of future projections of player contributions.
In the next post, we’ll talk about what’s required as a first step to take historical financial records and convert them into something useful for projecting future results on the business valuation side of things and what this looks like on the soccer side of things in player recruitment, and further, how the “goals added” model has some of these adjustments already baked into its very design.