In a rush? Check out my summary slides made using reveal.js
This is pretty cool. In 2013, the NBA started tracking player and ball positions at 25 frames per second using SportVU technology. What does that mean? Well, it means for every single game, we collect on the order of $10^6$ $(x, y)$ coordinates documenting player and ball movement!
My understanding is that the NBA used to have this information available on NBA Stats, but I ended up getting some 2015-2016 data from neilmj’s Github page. Let’s break down the data for one game.
The original data file is a .json with the following keys
1
2
data.keys() = [u'gamedate', u'gameid', u'events']
Each frame of the position data is found in the list of moments in the list of events. For the GSW vs. MIA game on January 11, 2016, for example, when we
1
2
print data['events'][13]['moments'][5]
we get
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[1,
1452570163203,
613.9,
22.0,
None,
[[-1, -1, 25.33642, 37.4772, 4.48997],
[1610612744, 101106, 67.40267, 24.15462, 0.0],
[1610612744, 201575, 64.05268, 15.67852, 0.0],
[1610612744, 201939, 42.50676, 36.1873, 0.0],
[1610612744, 202691, 70.54739, 10.26211, 0.0],
[1610612744, 203110, 71.17798, 42.9455, 0.0],
[1610612748, 2548, 67.98542, 2.67883, 0.0],
[1610612748, 2547, 20.10991, 23.12088, 0.0],
[1610612748, 2736, 62.13558, 47.14905, 0.0],
[1610612748, 201609, 25.13207, 38.95987, 0.0],
[1610612748, 1626159, 57.88044, 12.89062, 0.0]]]
I used a blog post by Savvas Tjortjoglou to confirm the obvious and make sense of the not-so-obvious numbers.
Let’s define frame = data['events'][13]['moments'][5]. Then
frame[0]: The current quarterframe[1]: The unix timestampframe[2]: The time remaining in the quarter (in seconds)frame[3]: The time remaining on the shot clockframe[5]: This is a list of 11 elements for the 10 players on the court and the 1 ball. It is of the form[team_id, player_id, x_coord, y_coord, z]whereteam_idandplayer_idare set to -1 for the ball, andzrepresents the elevation of the ball, and is set to 0 for the players. I couldn’t figure out or find a source that describedframe[1]orframe[4].
I created a pandas dataframe to collect the relevant data and added some other columns that were important for my analysis. Here’s a snippet for the GSW vs. MIA game:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
game_id game_date period game_clock game_time_remain shot_clock \
0021500568 2016-01-11 1 714.04 2874.04 18.27
0021500568 2016-01-11 1 714.04 2874.04 18.27
0021500568 2016-01-11 1 714.04 2874.04 18.27
0021500568 2016-01-11 1 714.04 2874.04 18.27
team_id player_id x_loc y_loc elevation dist_to_ball \
1610612744 203110 76.21067 39.69772 0.0 8.474352
1610612748 2548 58.96625 5.87971 0.0 40.297756
1610612748 2547 86.69693 21.05452 0.0 29.348603
1610612748 2736 71.32197 45.91093 0.0 2.189873
closest_to_ball player_name player_jersey
False Draymond Green 23
False Dwyane Wade 3
False Chris Bosh 1
True Luol Deng 9
To get a feel for the data, I decided to animate a play using matplotlib’s animation module. I scaled the ball according to its height to emphasize how cool this is. This is a clip of the GSW vs. MIA game during the third quarter. You can see Steph Curry (30) attempt a three-point shot, Luol Deng (9) getting the rebound, and Dwyane Wade (3) dribbling up court.
If you’re interested in the code, you can check it out on my Github page.
You can imagine how useful this kind of information is. We can study a player’s spatial patterns and region preferences. We can see how often and where he sets screens. Can he pass well through a double-team? How often does he drive? These are just some of the things we can’t get just by looking at box-score data.
To simplify my analysis, I started by breaking up the court into regions. Assuming a team is shooting on the right side of the court, here is a visual of my division:
I used the NBA’s court description to get the dimensions, and used Python’s Bokeh library for the plot above. I also drew it in matplotlib so you can pick and choose which one you like if you so desire. I wanted to use these blocks to breakdown where ‘actions’ occur. The NBA position data that I described above only gives the (x, y) coordinates of the players and the time he is at that point. Here is how I added context to that info:
- I determined the ‘shooting side’ for each team. I used the NBA stats play-by-play log to find who scored first and at what time, and then compared it with my raw position dataframe to get the (x, y) coordinates.
- The play-by-play log also allows us to get the box-score information so I used this to get the time and location whenever there was a: shot (made or attempted), assist, block, foul committed, free throw, rebound, steal or turnover. At these moments, I could confidently say which player had possession of the ball.
- I also wanted to determine who had possession in between these actions and also who the ball was passed to. I considered a player to have possession if he is the closest and within a distance of 2ft from the ball, for a consecutive string of 15 frames (0.6 seconds). Quick passes and sloppy plays make this number a reasonable starting point. My assumptions resulted in being quite accurate when cross checking with the NBA play-by-play logs, and comparing with the known times a player had the ball (took a shot, got a rebound, etc.)
- I concluded a player passed the ball if he had possession, and his resulting action was neither a shot or turnover. This correlated pretty well when comparing with known data on assists, as well as knowing the total number of times he passed during the game.
- I also used the player position data to deduce defenders. Aside from the NBA play-by-play logs to assert steals or blocks, this was largely determined by looking at the closest player on the opposite team to the offensive player with possession. As of right now, I have only considered defense on the player with the ball.
I need to refine the defense component as well as run appropriate statistics on my possession and passing results to verify the validity of my assumptions. NBA stats has a pretty detailed statistical summary page and simple comparisons with a few players seemed to yield pretty decent results.
I combined the movement and action parameters I deduced with the position data. For simplicity, I’m only showing a few of the columns in the dataframe snippet below, to highlight the parts relevant to this discussion.
1
2
3
4
5
6
7
8
9
10
11
period game_clock shot_clock dist_to_ball player_name region \
1 546.12 16.49 1.108870 Chris Bosh perimeter
1 546.08 16.45 1.512048 Chris Bosh perimeter
1 545.44 15.81 1.268833 Dwyane Wade perimeter
1 545.40 15.77 0.959275 Dwyane Wade perimeter
action defender_name defender_region possession_length
None Andrew Bogut back_court 17
pass Andrew Bogut back_court 17
None Klay Thompson back_court 54
None Klay Thompson back_court 54
You can see that Chris Bosh had possession of the ball for 17 frames (0.68 seconds) and was guarded by Andrew Bogut, before passing in the perimeter region to Dwyane Wade, who had possession for a length of 54 frames (2.16 seconds) and was being defended by Klay Thompson. Note that defender_region is written as back_court since if the current defenders were on offense, they would be in their back court region.
Getting all the relevant NBA game information, parsing it and connecting it with the (x, y) position data took some work so if you are interested in the details, you can check out my python scripts, here, here, and here.
The key idea behind my simulator is to incorporate player position data with their actions. I used the data to collect probabilities of the following:
- Movement: The probability of going from region A to region B, given that a player is in region A.
- Possession: The probability of having possession of the ball when the player’s team is on offense.
- Action: The probability of passing, shooting, or turning the ball over given that a player is in a given region.
- Regional shooting: The probability of making a shot taken from a particular region.
- Defensive parameters: The probability of steals, blocks, offensive rebounds and defensive rebounds per possession.
Below are action probability plots of a select group of players from the MIA-GSW game. We can see for example, that Chris Bosh and Stephen Curry only turned the ball over in the mid-range region, and Steph Curry and Draymond Green have similar shooting tendencies in the paint.
Using the movement and action probabilities per region, I was able to simulate plays, and consequently games. My simulator generates actions and I make the assumption that there are about six actions per play and each play is about 20 seconds. This leads to 216 actions per quarter and thus 848 actions per game. I simulated 100 games using the GSW players: Andrew Bogut, Klay Thompson, Stephen Curry, Draymond Green, and Brandon Rush, and MIA players: Luol Deng, Udonis Haslem, Chris Bosh, Dwyane Wade, and Amar’e Stoudemire. Assuming a constant performance based on the minutes played, I scaled the results of each player as if they played the entire game. This is what is assumed to be the ‘Actual’ results. Below are comparisons of the actual vs. simulated results.
Preliminary results after 100 simulations
1
2
3
4
5
6
7
8
9
10
11
PTS FGA FGM FG3A FG3M OREB DREB STL BLK TO PASS
GSW
mean 85.43 73.26 32.65 13.25 6.71 6.00 40.26 0.49 2.18 18.25 253.68
std 11.35 6.21 4.85 3.44 2.77 2.38 5.87 0.70 1.44 3.73 21.37
rmse 30.80 16.50 10.80 12.48 3.00 2.92 7.06 4.97 1.43 8.66 173.59
MIA
mean 87.31 89.41 37.94 16.20 3.81 21.42 38.83 1.19 2.32 13.95 273.51
std 9.63 6.89 4.80 4.39 2.00 4.35 5.39 1.23 1.58 3.61 20.39
rmse 44.36 25.82 18.66 6.78 2.17 4.33 22.01 1.71 2.38 6.76 120.61
A visual representation helps highlight the differences.
Although the GSW won the actual game on January 11, when we scale the point totals of the players in the simulation to the entire game, MIA is expected to win (approx. 130-114). In the 100 simulations shown above, MIA won 50% of the time. The relatively large root mean square error (rmse) in the point totals is enough to make sense of this outcome. The rmse of a few parameters are not so bad, however. In particular, blocks (BLKS) and three point field goals made (FG3M) are relatively small. Passes are particularly off. I will test the model with players that play more minutes and with data from more games to assess just how bad the damage is.