NBA Game Simulator — Parul Laul Jordon

In a rush? Check out my summary slides made using reveal.js

This is pretty cool. In 2013, the NBA started tracking player and ball positions at 25 frames per second using SportVU technology. What does that mean? Well, it means for every single game, we collect on the order of $10^6$ $(x, y)$ coordinates documenting player and ball movement!

My understanding is that the NBA used to have this information available on NBA Stats, but I ended up getting some 2015-2016 data from neilmj’s Github page. Let’s break down the data for one game.

The original data file is a .json with the following keys

1
2
  data.keys() = [u'gamedate', u'gameid', u'events']

Each frame of the position data is found in the list of moments in the list of events. For the GSW vs. MIA game on January 11, 2016, for example, when we

1
2
  print data['events'][13]['moments'][5]

we get

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
  [1,
   1452570163203,
   613.9,
   22.0,
   None,
   [[-1, -1, 25.33642, 37.4772, 4.48997],
    [1610612744, 101106, 67.40267, 24.15462, 0.0],
    [1610612744, 201575, 64.05268, 15.67852, 0.0],
    [1610612744, 201939, 42.50676, 36.1873, 0.0],
    [1610612744, 202691, 70.54739, 10.26211, 0.0],
    [1610612744, 203110, 71.17798, 42.9455, 0.0],
    [1610612748, 2548, 67.98542, 2.67883, 0.0],
    [1610612748, 2547, 20.10991, 23.12088, 0.0],
    [1610612748, 2736, 62.13558, 47.14905, 0.0],
    [1610612748, 201609, 25.13207, 38.95987, 0.0],
    [1610612748, 1626159, 57.88044, 12.89062, 0.0]]]

I used a blog post by Savvas Tjortjoglou to confirm the obvious and make sense of the not-so-obvious numbers.

Let’s define frame = data['events'][13]['moments'][5]. Then

frame[0]: The current quarter
frame[1]: The unix timestamp
frame[2]: The time remaining in the quarter (in seconds)
frame[3]: The time remaining on the shot clock
frame[5]: This is a list of 11 elements for the 10 players on the court and the 1 ball. It is of the form [team_id, player_id, x_coord, y_coord, z] where team_id and player_id are set to -1 for the ball, and z represents the elevation of the ball, and is set to 0 for the players. I couldn’t figure out or find a source that described frame[1] or frame[4].

I created a pandas dataframe to collect the relevant data and added some other columns that were important for my analysis. Here’s a snippet for the GSW vs. MIA game:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
  game_id     game_date      period  game_clock  game_time_remain  shot_clock  \
  0021500568  2016-01-11       1       714.04        2874.04          18.27   
  0021500568  2016-01-11       1       714.04        2874.04          18.27   
  0021500568  2016-01-11       1       714.04        2874.04          18.27   
  0021500568  2016-01-11       1       714.04        2874.04          18.27   

  team_id      player_id      x_loc       y_loc      elevation   dist_to_ball  \
  1610612744     203110     76.21067    39.69772        0.0          8.474352   
  1610612748       2548     58.96625     5.87971        0.0         40.297756   
  1610612748       2547     86.69693    21.05452        0.0         29.348603   
  1610612748       2736     71.32197    45.91093        0.0          2.189873   

  closest_to_ball      player_name     player_jersey  
      False           Draymond Green         23  
      False           Dwyane Wade             3  
      False           Chris Bosh              1  
      True            Luol Deng               9

To get a feel for the data, I decided to animate a play using matplotlib’s animation module. I scaled the ball according to its height to emphasize how cool this is. This is a clip of the GSW vs. MIA game during the third quarter. You can see Steph Curry (30) attempt a three-point shot, Luol Deng (9) getting the rebound, and Dwyane Wade (3) dribbling up court.

If you’re interested in the code, you can check it out on my Github page.

You can imagine how useful this kind of information is. We can study a player’s spatial patterns and region preferences. We can see how often and where he sets screens. Can he pass well through a double-team? How often does he drive? These are just some of the things we can’t get just by looking at box-score data.

To simplify my analysis, I started by breaking up the court into regions. Assuming a team is shooting on the right side of the court, here is a visual of my division:

I used the NBA’s court description to get the dimensions, and used Python’s Bokeh library for the plot above. I also drew it in matplotlib so you can pick and choose which one you like if you so desire. I wanted to use these blocks to breakdown where ‘actions’ occur. The NBA position data that I described above only gives the (x, y) coordinates of the players and the time he is at that point. Here is how I added context to that info:

I determined the ‘shooting side’ for each team. I used the NBA stats play-by-play log to find who scored first and at what time, and then compared it with my raw position dataframe to get the (x, y) coordinates.
The play-by-play log also allows us to get the box-score information so I used this to get the time and location whenever there was a: shot (made or attempted), assist, block, foul committed, free throw, rebound, steal or turnover. At these moments, I could confidently say which player had possession of the ball.
I also wanted to determine who had possession in between these actions and also who the ball was passed to. I considered a player to have possession if he is the closest and within a distance of 2ft from the ball, for a consecutive string of 15 frames (0.6 seconds). Quick passes and sloppy plays make this number a reasonable starting point. My assumptions resulted in being quite accurate when cross checking with the NBA play-by-play logs, and comparing with the known times a player had the ball (took a shot, got a rebound, etc.)
I concluded a player passed the ball if he had possession, and his resulting action was neither a shot or turnover. This correlated pretty well when comparing with known data on assists, as well as knowing the total number of times he passed during the game.
I also used the player position data to deduce defenders. Aside from the NBA play-by-play logs to assert steals or blocks, this was largely determined by looking at the closest player on the opposite team to the offensive player with possession. As of right now, I have only considered defense on the player with the ball.

I need to refine the defense component as well as run appropriate statistics on my possession and passing results to verify the validity of my assumptions. NBA stats has a pretty detailed statistical summary page and simple comparisons with a few players seemed to yield pretty decent results.

I combined the movement and action parameters I deduced with the position data. For simplicity, I’m only showing a few of the columns in the dataframe snippet below, to highlight the parts relevant to this discussion.

1
2
3
4
5
6
7
8
9
10
11
period  game_clock  shot_clock  dist_to_ball  player_name     region  \
  1      546.12       16.49       1.108870    Chris Bosh    perimeter   
  1      546.08       16.45       1.512048    Chris Bosh    perimeter   
  1      545.44       15.81       1.268833    Dwyane Wade   perimeter   
  1      545.40       15.77       0.959275    Dwyane Wade   perimeter   

action  defender_name   defender_region  possession_length  
 None   Andrew Bogut      back_court              17  
 pass   Andrew Bogut      back_court              17  
 None   Klay Thompson     back_court              54  
 None   Klay Thompson     back_court              54

You can see that Chris Bosh had possession of the ball for 17 frames (0.68 seconds) and was guarded by Andrew Bogut, before passing in the perimeter region to Dwyane Wade, who had possession for a length of 54 frames (2.16 seconds) and was being defended by Klay Thompson. Note that defender_region is written as back_court since if the current defenders were on offense, they would be in their back court region.

Getting all the relevant NBA game information, parsing it and connecting it with the (x, y) position data took some work so if you are interested in the details, you can check out my python scripts, here, here, and here.

The key idea behind my simulator is to incorporate player position data with their actions. I used the data to collect probabilities of the following:

Movement: The probability of going from region A to region B, given that a player is in region A.
Possession: The probability of having possession of the ball when the player’s team is on offense.
Action: The probability of passing, shooting, or turning the ball over given that a player is in a given region.
Regional shooting: The probability of making a shot taken from a particular region.
Defensive parameters: The probability of steals, blocks, offensive rebounds and defensive rebounds per possession.

Below are action probability plots of a select group of players from the MIA-GSW game. We can see for example, that Chris Bosh and Stephen Curry only turned the ball over in the mid-range region, and Steph Curry and Draymond Green have similar shooting tendencies in the paint.

Using the movement and action probabilities per region, I was able to simulate plays, and consequently games. My simulator generates actions and I make the assumption that there are about six actions per play and each play is about 20 seconds. This leads to 216 actions per quarter and thus 848 actions per game. I simulated 100 games using the GSW players: Andrew Bogut, Klay Thompson, Stephen Curry, Draymond Green, and Brandon Rush, and MIA players: Luol Deng, Udonis Haslem, Chris Bosh, Dwyane Wade, and Amar’e Stoudemire. Assuming a constant performance based on the minutes played, I scaled the results of each player as if they played the entire game. This is what is assumed to be the ‘Actual’ results. Below are comparisons of the actual vs. simulated results.

Preliminary results after 100 simulations

1
2
3
4
5
6
7
8
9
10
11
      PTS    FGA    FGM   FG3A   FG3M  OREB  DREB     STL   BLK    TO    PASS
GSW	
mean  85.43  73.26  32.65 13.25  6.71  6.00  40.26   0.49  2.18  18.25  253.68
std   11.35   6.21   4.85  3.44  2.77  2.38   5.87   0.70  1.44   3.73   21.37
rmse  30.80  16.50  10.80 12.48  3.00  2.92   7.06   4.97  1.43   8.66  173.59


MIA
mean  87.31  89.41  37.94 16.20  3.81 21.42  38.83   1.19  2.32  13.95  273.51
std    9.63   6.89   4.80  4.39  2.00  4.35   5.39   1.23  1.58   3.61   20.39
rmse  44.36  25.82  18.66  6.78  2.17  4.33  22.01   1.71  2.38   6.76  120.61

A visual representation helps highlight the differences.

Although the GSW won the actual game on January 11, when we scale the point totals of the players in the simulation to the entire game, MIA is expected to win (approx. 130-114). In the 100 simulations shown above, MIA won 50% of the time. The relatively large root mean square error (rmse) in the point totals is enough to make sense of this outcome. The rmse of a few parameters are not so bad, however. In particular, blocks (BLKS) and three point field goals made (FG3M) are relatively small. Passes are particularly off. I will test the model with players that play more minutes and with data from more games to assess just how bad the damage is.