Week 03 - Tables and Visualizations

Sections 31 and 36, solutions at dchotai.github.io/resources

In [1]:
# Import some modules to use
import numpy as np
from datascience import *

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")

To learn about table maniuplations and visualizing data, we'll use data from Basketball-Reference that describes the average statistics of NBA players for the 2016-17 season. You should know how to use the table methods described on the Data 8 resources page.

Tables

1a.

Read in the nba_16_17.csv table and store it in the nba variable.

In [2]:
# SOLUTION
# We use the Table.read_table method to read in .csv files. Here we read the .csv file into a newly constructed table
nba = Table().read_table('nba_16_17.csv')
nba
Out[2]:
Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PPG
1 Alex Abrines SG 23 OKC 68 6 15.5 2 5 0.393 1.4 3.6 0.381 0.6 1.4 0.426 0.6 0.7 0.898 0.3 1 1.3 0.6 0.5 0.1 0.5 1.7 6
2 Quincy Acy PF 26 TOT 38 1 14.7 1.8 4.5 0.412 1 2.4 0.411 0.9 2.1 0.413 1.2 1.6 0.75 0.5 2.5 3 0.5 0.4 0.4 0.6 1.8 5.8
3 Steven Adams C 23 OKC 80 80 29.9 4.7 8.2 0.571 0 0 0 4.7 8.2 0.572 2 3.2 0.611 3.5 4.2 7.7 1.1 1.1 1 1.8 2.4 11.3
4 Arron Afflalo SG 31 SAC 61 45 25.9 3 6.9 0.44 1 2.5 0.411 2 4.4 0.457 1.4 1.5 0.892 0.1 1.9 2 1.3 0.3 0.1 0.7 1.7 8.4
5 Alexis Ajinca C 28 NOP 39 15 15 2.3 4.6 0.5 0 0.1 0 2.3 4.5 0.511 0.7 1 0.725 1.2 3.4 4.5 0.3 0.5 0.6 0.8 2 5.3
6 Cole Aldrich C 28 MIN 62 0 8.6 0.7 1.4 0.523 0 0 nan 0.7 1.4 0.523 0.2 0.4 0.682 0.8 1.7 2.5 0.4 0.4 0.4 0.3 1.4 1.7
7 LaMarcus Aldridge PF 31 SAS 72 72 32.4 6.9 14.6 0.477 0.3 0.8 0.411 6.6 13.8 0.48 3.1 3.8 0.812 2.4 4.9 7.3 1.9 0.6 1.2 1.4 2.2 17.3
8 Lavoy Allen PF 27 IND 61 5 14.3 1.3 2.8 0.458 0 0 0 1.3 2.7 0.461 0.4 0.5 0.697 1.7 1.9 3.6 0.9 0.3 0.4 0.5 1.3 2.9
9 Tony Allen SG 35 MEM 71 66 27 3.9 8.4 0.461 0.2 0.8 0.278 3.6 7.6 0.479 1.1 1.8 0.615 2.3 3.2 5.5 1.4 1.6 0.4 1.4 2.5 9.1
10 Al-Farouq Aminu SF 26 POR 61 25 29.1 3 7.6 0.393 1.1 3.5 0.33 1.9 4.2 0.445 1.6 2.2 0.706 1.3 6.1 7.4 1.6 1 0.7 1.5 1.7 8.7

... (476 rows omitted)

If you aren't familiar with basketball terminology, this table may seem daunting. When encountering a new data set, it's always a good idea to look up what each column represents. Since we don't have the time in section, here's a table describing the meaning of each label:

Label Meaning
Pos Position (Point Guard, Shooting Guard, Small Forward, Power Forward, Center)
Tm Team abbreviation
G / GS Games played / Games started
MP Minutes played
FG[A][%] Field goals [attempted] [% made]
3P / 2P Three-point / two-point field goals
FT Free throws
[O][D][T]RB Offensive / Defensive / Total rebounds
AST Assists
STL Steals
BLK Blocks
TOV Turnovers
PF Personal Fouls
PPG Points per game

1b.

In the 2016-17 NBA All-Star game, NBA players Klay Thompson, C.J. McCollum, Kyle Lowry, Eric Gordon, Kyrie Irving, Kemba Walker, Nick Young, and Wesley Matthews were chosen to participate in the Three-Point Contest. We'd expect that the top eight three-point shooters are invited to participate in the contest. Were the contest participants actually the top eight three-point shooters?

To find out, set threes equal to a table that contains last season's NBA players sorted by three-point percentage, with the most accurate players at the top. Hint: First filter out players that did not shoot any three-pointers.

In [3]:
# SOLUTION
# Filter out players that did not shoot any threes first, as per the hint (this tip is specific to this data set)
# Use the .sort method to order the rows by 3P%
# Because we want the larger values (most accurate players) at the top, set the optional 'descending' argument to True
threes = nba.where('3P', are.above(0)).sort("3P%", descending=True)
threes
Out[3]:
Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PPG
217 Demetrius Jackson PG 22 BOS 5 0 3.4 0.6 0.8 0.75 0.2 0.2 1 0.4 0.6 0.667 0.6 1.2 0.5 0.4 0.4 0.8 0.6 0 0 0 0 2
160 Treveon Graham SG 23 CHO 27 1 7 0.7 1.5 0.475 0.3 0.6 0.6 0.4 0.9 0.4 0.4 0.6 0.667 0.2 0.6 0.8 0.2 0.2 0 0.1 0.7 2.1
146 Pau Gasol C 36 SAS 64 39 25.4 4.7 9.4 0.502 0.9 1.6 0.538 3.9 7.8 0.494 2 2.9 0.707 1.7 6.2 7.8 2.3 0.4 1.1 1.3 1.7 12.4
88 Pat Connaughton SG 24 POR 39 1 8.1 0.9 1.8 0.514 0.4 0.8 0.515 0.5 1 0.513 0.2 0.2 0.778 0.3 1.1 1.3 0.7 0.2 0.1 0.4 0.6 2.5
335 Johnny O'Bryant PF 23 TOT 11 0 7.3 1.4 2.7 0.5 0.3 0.5 0.5 1.1 2.2 0.5 0.5 0.5 0.833 0.8 0.9 1.7 0.5 0 0.1 0.5 0.9 3.5
223 John Jenkins SG 25 PHO 4 0 3.3 0.5 1.3 0.4 0.3 0.5 0.5 0.3 0.8 0.333 0.5 0.5 1 0 0.3 0.3 0.3 0 0 0 0 1.8
207 Josh Huestis PF 25 OKC 2 0 15.5 3 5.5 0.545 1 2 0.5 2 3.5 0.571 0 0.5 0 2 2.5 4.5 1.5 0 1.5 0 0 7
170 A.J. Hammons C 24 DAL 22 0 7.4 0.8 1.9 0.405 0.2 0.5 0.5 0.5 1.5 0.375 0.4 0.9 0.45 0.4 1.3 1.6 0.2 0 0.6 0.5 1 2.2
151 Marcus Georges-Hunt SG 22 ORL 5 0 9.6 0.4 1.4 0.286 0.2 0.4 0.5 0.2 1 0.2 1.8 2 0.9 0.2 1.6 1.8 0.6 0.2 0 0.4 1 2.8
408 Jason Smith C 30 WAS 74 3 14.4 2.4 4.4 0.529 0.5 1.1 0.474 1.9 3.4 0.546 0.5 0.7 0.686 0.9 2.6 3.5 0.5 0.3 0.7 0.8 2.3 5.7

... (370 rows omitted)

Wow, these players are pretty accurate from long-range. What seems to be wrong here? How can we find the top eight players that consistently shot and made three-pointers?

SOLUTION: Many of these players rarely shot three-pointers. If we look at the 3PA column, we see that many of these players attempted less than one three-pointer per game, and many of them played very few games in the entire season. We can filter out players that did not shoot consistently or with a large enough volume to get a more accurate result.

Suppose we only want to include players that made at least two three-pointers per game. Set top_eight equal to a table that contains the top eight accurate three-point shooters that made at least 2 three-pointers per game.

In [4]:
# SOLUTION
# We want the players that made at least 2 threes per game, so we condition on '3P' >= 2
# Because we want only the first eight players, we can use the .take method to get the specified rows
top_eight = threes.where('3P', are.above_or_equal_to(2)).take(np.arange(8))
top_eight
Out[4]:
Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PPG
248 Kyle Korver SG 35 TOT 67 22 26.2 3.6 7.7 0.464 2.4 5.4 0.451 1.1 2.3 0.494 0.6 0.6 0.905 0.1 2.7 2.8 1.6 0.5 0.3 1 1.6 10.1
130 Jordan Farmar PG 30 SAC 2 0 17.5 2 6 0.333 2 4.5 0.444 0 1.5 0 0 0 nan 0.5 1 1.5 4.5 1 0 1.5 0.5 6
372 J.J. Redick SG 32 LAC 78 78 28.2 5.1 11.4 0.445 2.6 6 0.429 2.5 5.4 0.462 2.3 2.6 0.891 0.1 2.1 2.2 1.4 0.7 0.2 1.3 1.6 15
97 Seth Curry PG 26 DAL 70 42 29 4.8 10 0.481 2 4.6 0.425 2.9 5.4 0.528 1.2 1.4 0.85 0.4 2.2 2.6 2.7 1.1 0.1 1.3 1.8 12.8
288 C.J. McCollum SG 25 POR 80 80 35 8.7 18 0.48 2.3 5.5 0.421 6.3 12.5 0.506 3.4 3.7 0.912 0.8 2.9 3.6 3.6 0.9 0.5 2.2 2.5 23
427 Klay Thompson SG 26 GSW 78 78 34 8.3 17.6 0.468 3.4 8.3 0.414 4.8 9.3 0.516 2.4 2.8 0.853 0.6 3 3.7 2.1 0.8 0.5 1.6 1.8 22.3
302 C.J. Miles SF 29 IND 76 29 23.4 3.7 8.5 0.434 2.2 5.4 0.413 1.5 3.1 0.471 1.1 1.2 0.903 0.4 2.6 3 0.6 0.6 0.3 0.5 2 10.7
274 Kyle Lowry PG 30 TOR 60 60 37.4 7.1 15.3 0.464 3.2 7.8 0.412 3.9 7.5 0.518 5 6.1 0.819 0.8 4 4.8 7 1.5 0.3 2.9 2.8 22.4

Which players rightfully got to participate in the three-point contest? Which players from the above table were left out? Is there anything wrong with the top eight players we found above?

SOLUTION: Klay Thompson, C.J. McCollum, and Kyle Lowry definitely earned their spots. Many prominent three-point shooters like Kyle Korver and J.J. Redick were left out of the contest percentagewise. You might notice that Jordan Farmar was the second most accurate shooter in the table, but he only played in 2 games the entire season.

1c.

Let's look at some players that play the Center position. Centers traditionally don't shoot three-pointers, so their field goals are primarily composed of two-point field goals. Assign centers to a table containing only players that play the center position. Exclude the '3P', '3PA', '3P%', '2P', '2PA', and '2P%' columns. There are around 100 centers in the NBA, so let's restrict the table to players that started in at least 20 games.

In [5]:
# SOLUTION
# Dropping the unwanted labels is faster than selecting the desired labels (though both methods are valid)
# We first keep only players that are Centers to reduce unnecessary computation
# Again, condition on Games Started >= 20 using the are.above_or_equal_to predicate
centers = nba.drop('3P', '3PA', '3P%', '2P', '2PA', '2P%').where('Pos', 'C').where('GS', are.above_or_equal_to(20))
centers
Out[5]:
Rk Player Pos Age Tm G GS MP FG FGA FG% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PPG
3 Steven Adams C 23 OKC 80 80 29.9 4.7 8.2 0.571 2 3.2 0.611 3.5 4.2 7.7 1.1 1.1 1 1.8 2.4 11.3
46 Bismack Biyombo C 24 ORL 81 27 22.1 2.2 4.2 0.528 1.5 2.9 0.534 1.9 5.1 7 0.9 0.3 1.1 1.2 2.5 6
51 Andrew Bogut C 32 TOT 27 21 21.6 1.4 3 0.469 0.1 0.4 0.273 2.1 6 8.1 1.8 0.5 0.9 1.6 3.2 2.9
71 Clint Capela C 22 HOU 65 59 23.9 5.6 8.7 0.643 1.4 2.7 0.531 2.7 5.4 8.1 1 0.5 1.2 1.3 2.8 12.6
76 Willie Cauley-Stein C 23 SAC 75 21 18.9 3.4 6.4 0.53 1.3 2 0.669 1.1 3.4 4.5 1.1 0.7 0.6 0.9 2 8.1
77 Tyson Chandler C 34 PHO 47 46 27.6 3.3 4.9 0.671 1.9 2.6 0.734 3.3 8.2 11.5 0.6 0.7 0.5 1.4 2.7 8.4
90 DeMarcus Cousins C 26 TOT 72 72 34.2 9 19.9 0.452 7.2 9.3 0.772 2.1 8.9 11 4.6 1.4 1.3 3.7 3.9 27
100 Anthony Davis C 23 NOP 75 75 36.1 10.3 20.3 0.505 6.9 8.6 0.802 2.3 9.5 11.8 2.1 1.3 2.2 2.4 2.2 28
103 Dewayne Dedmon C 27 SAS 76 37 17.5 2.1 3.4 0.622 0.9 1.2 0.699 1.7 4.8 6.5 0.6 0.5 0.8 0.8 2.4 5.1
115 Andre Drummond C 23 DET 81 81 29.7 6 11.2 0.53 1.7 4.4 0.386 4.3 9.5 13.8 1.1 1.5 1.1 1.9 2.9 13.6

... (28 rows omitted)

Centers are commonly known for their defensive prowess. Set most_blocks equal to the name of the center that averaged the most blocked shots last season, most_rebounds equal to the name of the center that averaged the most rebounds last season, and most_steals equal to the name of the center that averaged the most steals last season.

In [6]:
# SOLUTION
# In general, when a question asks for the highest/lowest item of a specific column, you should first sort
# the table by the desired column in descending/ascending order.
# We can use the Table.row(i) method to get the row at index i
# Because we sorted in descending order, the first row will be the row with the largest value
# After using the Table.row method, we get a row object with multiple attributes, use the .item method to access
# the desired attribute
most_blocks = centers.sort('BLK', descending=True).row(0).item('Player')
most_rebounds = centers.sort('TRB', descending=True).row(0).item('Player')
most_steals = centers.sort('STL', descending=True).row(0).item('Player')
print("Blocks:", most_blocks + ",", "Rebounds:", most_rebounds + ",", "Steals:", most_steals)
Blocks: Rudy Gobert, Rebounds: Hassan Whiteside, Steals: Andre Drummond

You'll notice that some players are listed as being on team "TOT", which is actually not a real team. "TOT" indicates that the player switched teams during the season. Who was the oldest player that swapped teams last season? Set oldest_swap equal to this player's name. Hint: Using the .row() method may be useful.

In [7]:
# SOLUTION
# First, keep all the rows with players that switched teams (their teams are listed as "TOT")
# Because we want the oldest player, we sort the resulting table in descending order
# Similar to the previous question, use the .row and .item methods to access the player's name
oldest_swap = centers.where('Tm', 'TOT').sort('Age', descending=True).row(0).item('Player')
oldest_swap
Out[7]:
'Andrew Bogut'

Fun fact: the oldest_swap player was on three different teams last season. In his debut with the third team, he broke his leg in under a minute of play and was subsequently waived. :(

1d.

The NBA season consists of 82 games, many of which occur back-to-back or with only one day of rest for players. Players that start in every game they play are generally regarded as the "starters" of their teams. Often times, fatigue can catch up to the starters, which causes them to take miss some games or come off the bench for a few games to rest. Set only_started equal to a table containing only players that started in every game they played.

Hint: You probably haven't been explicitly taught this in lecture, but the .where() method can also take in a value or predicate instead of the standard .where(label, are.predicate) format. This predicate can also be a series of boolean values. The .where() method only includes rows that satisfy the predicate elementwise.

In [8]:
# SOLUTION
# You won't have a problem this complicated on an exam, this problem is more for curious practice.
# Outside of this class, predicates won't always be in are.predicate format
# (Though in this class, are.predicate will be the standard format)
# The hint refers to creating an array of boolean values (each value corresponding to a player in the original order
# of the table). We do this using the == operator, which checks if the values on the left side are equal to the 
# values on the right side elementwise.
# 3 == 4 would return False, [3, 4, 5] == [2, 4, 5] would return [False, True, True]
# Therefore, we're checking if for each row of the table, does the value of 'G' (games played) == 'GS' (games started)?
# This will return an array of boolean (True/False) values, which we can use as a predicate
# Only rows that have a corresponding True value in the predicate will be included in the resulting table
only_started = nba.where(nba.column('G') == nba.column('GS'))
only_started
Out[8]:
Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PPG
3 Steven Adams C 23 OKC 80 80 29.9 4.7 8.2 0.571 0 0 0 4.7 8.2 0.572 2 3.2 0.611 3.5 4.2 7.7 1.1 1.1 1 1.8 2.4 11.3
7 LaMarcus Aldridge PF 31 SAS 72 72 32.4 6.9 14.6 0.477 0.3 0.8 0.411 6.6 13.8 0.48 3.1 3.8 0.812 2.4 4.9 7.3 1.9 0.6 1.2 1.4 2.2 17.3
15 Ryan Anderson PF 28 HOU 72 72 29.4 4.5 10.7 0.418 2.8 7 0.403 1.7 3.7 0.446 1.8 2.1 0.86 1.6 3 4.6 0.9 0.4 0.2 0.8 2 13.6
16 Giannis Antetokounmpo SF 22 MIL 80 80 35.6 8.2 15.7 0.521 0.6 2.3 0.272 7.6 13.5 0.563 5.9 7.7 0.77 1.8 7 8.8 5.4 1.6 1.9 2.9 3.1 22.9
17 Carmelo Anthony SF 32 NYK 74 74 34.3 8.1 18.8 0.433 2 5.7 0.359 6.1 13.1 0.466 4.1 4.9 0.833 0.8 5.1 5.9 2.9 0.8 0.5 2.1 2.7 22.4
19 Trevor Ariza SF 31 HOU 80 80 34.7 4.1 10 0.409 2.4 6.9 0.344 1.7 3 0.556 1.2 1.6 0.738 0.7 5.1 5.7 2.2 1.8 0.3 0.9 1.7 11.7
28 Harrison Barnes PF 24 DAL 79 79 35.5 7.6 16.2 0.468 1 2.8 0.351 6.6 13.4 0.492 3.1 3.6 0.861 1.2 3.8 5 1.5 0.8 0.2 1.3 1.6 19.2
32 Nicolas Batum SG 28 CHO 77 77 34 5.1 12.7 0.403 1.8 5.3 0.333 3.4 7.4 0.453 3.2 3.7 0.856 0.6 5.6 6.2 5.9 1.1 0.4 2.5 1.4 15.1
36 Bradley Beal SG 23 WAS 77 77 34.9 8.3 17.2 0.482 2.9 7.2 0.404 5.4 10 0.538 3.7 4.4 0.825 0.7 2.4 3.1 3.5 1.1 0.3 2 2.2 23.1
45 Patrick Beverley SG 28 HOU 67 67 30.7 3.4 8.1 0.42 1.6 4.3 0.382 1.8 3.8 0.463 1.1 1.4 0.768 1.4 4.4 5.9 4.2 1.5 0.4 1.5 3.3 9.5

... (73 rows omitted)

Set starter_mean equal to the mean age of the players in the only_started player. Set old_diff equal to the difference in age between the oldest player in the table and the mean age of the players in the table. Set young_diff equal to the difference in age between the mean age of the players in the table and the youngest player in the table.

In [9]:
# SOLUTION
# We first take the 'Age' column and aggregate it using the np.average() function, which crunches all of the values
# in the array into a single value (which is the mean of the values)
starter_mean = np.average(only_started.column('Age'))
print("Mean:", str(starter_mean))
Mean: 27.1204819277

Are there more starters in the table that are over the mean age or under the mean age? Why do you think this is the case?

In [10]:
# SOLUTION
# You can use the Table.num_rows attribute to access the number of rows in the table
over = only_started.where('Age', are.above(starter_mean)).num_rows
under = only_started.where('Age', are.below(starter_mean)).num_rows
print("Over:", over, "Under:", under)
Over: 38 Under: 45

SOLUTION: There are more starters in our table that are under the mean age than starters that are over the mean age. This is likely because younger players tend to get more playing time, whereas many older players tend to come off the bench in the later roles of their careers.

1e.

The 2016-17 NBA champions were the Golden State Warriors! Create a new table called champs that is the same as the original nba table, but has a column called "Champion". The "Champion" column should include boolean values that indicate whether the given player was a champion or not.

In [11]:
# SOLUTION
# Similar to question 1d.
# We first get the 'Tm' (Team Abbreviation) column using the .column method
# We use the == operator to check if each value of the resulting array is equal to 'GSW', which is the 
# abbreviation for the Golden State Warriors
# The result of this operation is an array of boolean (True/False) values that correspond to the rows in the table
# We use the Table.with_column method to add a column to the table, with the boolean array we just made as the values
# Note, the Table.with_column method returns a new table (it does not mutate the original table), so we store the
# resulting table in the `champs` variable
champ_bools = nba.column('Tm') == 'GSW'
champs = nba.with_column('Champion', champ_bools)
champs.take(np.arange(80, 86))
Out[11]:
Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% ORB DRB TRB AST STL BLK TOV PF PPG Champion
81 Semaj Christon PG 24 OKC 64 1 15.2 1.2 3.5 0.345 0.2 1 0.19 1 2.5 0.406 0.3 0.5 0.548 0.3 1.1 1.4 2 0.4 0.1 0.7 1.2 2.9 False
82 Ian Clark SG 25 GSW 77 0 14.8 2.7 5.6 0.487 0.8 2.1 0.374 1.9 3.5 0.556 0.6 0.8 0.759 0.3 1.3 1.6 1.2 0.5 0.1 0.7 1 6.8 True
83 Jordan Clarkson SG 24 LAL 82 19 29.2 5.8 13.1 0.445 1.4 4.3 0.329 4.4 8.7 0.503 1.6 2 0.798 0.6 2.4 3 2.6 1.1 0.1 2 1.8 14.7 False
84 Norris Cole PG 28 OKC 13 0 9.6 1.2 4 0.308 0.2 1 0.231 1 3 0.333 0.6 0.8 0.8 0 0.8 0.8 1.1 0.6 0 0.5 1.4 3.3 False
85 Darren Collison PG 29 SAC 68 64 30.3 5 10.5 0.476 1.1 2.6 0.417 3.9 7.9 0.495 2.2 2.5 0.86 0.3 1.9 2.2 4.6 1 0.1 1.7 1.8 13.2 False
86 Nick Collison PF 36 OKC 20 0 6.4 0.7 1.2 0.609 0 0.1 0 0.7 1.1 0.636 0.3 0.4 0.625 0.5 1.1 1.6 0.6 0.1 0.1 0.2 0.9 1.7 False

Visualizations

The two types of data visualizations we learned about last week were bar charts and scatter plots.

Bar charts are used for plotting the distributions of categorical data, which are data that have no numerical significance. For example, if your data was composed of different names of fruit, it wouldn't make sense to perform arithmetic on the fruit names. Categorical data may or may not be ordered; fruit names naturally have no order, but you could alphabetically order them if you wanted to.

In this class, we'll primarily use the Table.barh() method to construct horizontal bar charts. Horizontal bar charts have the category labels on the y-axis and the respective numerical value of each category on the x-axis. The numerical value associated with each category is typically the count/frequency of values that are in that category.

The Table.barh() method's first argument should be the label of the column you want to chart. The second argument should be the label of the column that contains the associated numerical values of the categories. If your table has just two columns (one for categories and one for numerical data), you don't need to pass in the second argument since the method will automatically default to it.

2a.

The following positions table contains a column for the different positions in the nba table and a corresponding column with the frequency of each position in the table.

In [12]:
# Just run this, we haven't learned about the Table.group() method in detail yet
positions = nba.group('Pos')
positions
Out[12]:
Pos count
C 96
PF 97
PF-C 1
PG 97
SF 90
SG 105

Use a horizontal bar chart to visualize the data in this table.

In [13]:
# SOLUTION
# There are just two columns in the `positions` table, so we only need to pass the categorical column label into
# the Table.barh method.
# positions.barh('Pos', 'count') is also a valid answer
positions.barh('Pos')

There seems to be roughly the same number of players per position, except for a lone "PF-C" position. Who is the lone player that played the PF-C position? Set pf_c equal to this player's name.

In [14]:
# SOLUTION
# We first find all the rows where the Position column had the value 'PF-C'
# The resulting table only had 1 row, which is Joffrey Lauvergne
pf_c = nba.where('Pos', 'PF-C').row(0).item('Player')
pf_c
Out[14]:
'Joffrey Lauvergne'

Turns out this player was traded from the Oklahoma City Thunder (where he played Power Forward) to the Chicago Bulls (where he played Center). Of course we could have easily seen this from the positions table without needing to visualize the data, but bar charts sometimes make it easier to understand the data you're working with compared to tables.

2b.

Let's consider the positions excluding the PF-C position. The pos_avg table below contains the average statistics of each position from the table, excluding the PF-C position.

In [15]:
pos_avg = nba.group('Pos', collect=np.mean).take(make_array(0, 1, 3, 4, 5))
pos_avg
Out[15]:
Pos Rk mean Player mean Age mean Tm mean G mean GS mean MP mean FG mean FGA mean FG% mean 3P mean 3PA mean 3P% mean 2P mean 2PA mean 2P% mean FT mean FTA mean FT% mean ORB mean DRB mean TRB mean AST mean STL mean BLK mean TOV mean PF mean PPG mean
C 261.083 26.2812 53.875 25.8333 18.101 3.08542 5.86354 0.52651 0.226042 0.639583 nan 2.85729 5.23125 0.541333 1.39896 2.05417 nan 1.65312 3.75208 5.40625 1.09271 0.495833 0.795833 1.05729 2.04271 7.78542
PF 248.557 26.0309 50.5773 22.9485 18.2474 2.84433 6.20515 0.44634 0.590722 1.74639 nan 2.24948 4.45979 0.494536 1.13711 1.53299 nan 1.10309 3.01237 4.10825 1.12887 0.5 0.445361 0.892784 1.66598 7.40206
PG 253.649 26.732 52.8557 25.2268 20.7629 3.33093 7.81031 0.41268 0.948454 2.66495 0.327031 2.38969 5.13196 0.448175 1.7866 2.17938 nan 0.401031 2.00722 2.40619 3.64742 0.765979 0.179381 1.53299 1.5732 9.39072
SF 214.656 26.9 58.5333 29.5889 22.0044 3.26 7.37444 0.428044 0.947778 2.75222 0.322456 2.31556 4.62556 0.492778 1.54222 1.97111 nan 0.713333 2.87889 3.59667 1.56444 0.747778 0.357778 1.00444 1.68222 9.01667
SG 238 26.1524 53.2857 23.6476 20.5276 3.1019 7.32667 nan 1.10952 3.07905 nan 1.99238 4.24381 nan 1.28667 1.60571 nan 0.419048 2.00762 2.42381 1.70952 0.626667 0.204762 1.00667 1.46381 8.59238

Construct a bar chart that compares the mean points per game, total rebounds, and assists of each position.

In [16]:
# SOLUTION
# We can plot multiple numerical values for a category on a single bar chart!
# The first argument is always the category column label
# The second argument is either the label of the numerical value column to chart (if plotting only 1 value column)
# If plotting multiple value columns, the second argument should be an array that holds the labels of the desired 
# value columns.
pos_avg.barh('Pos', make_array('PPG mean', 'TRB mean', 'AST mean'))

Scatter plots are used for plotting numerical data. Numerical data have meaningful differences (it makes sense to subtract one value from another) and are ordered. Note that data with numerical values are not always numerical data. For example, the census example from lecture had a numerical SEX code (0, 1, or 2). These values were used as categories and the numbers did not have meaningful differences, so the data were categorical.

We'll use the Table.scatter() method to construct scatter plots. The first argument should be the label of the column to use for the x-axis values. The second argument should be the label of the column to use for the y-axis values. If your table has just two columns with numerical data, you don't need to pass in the second argument since the method will automatically default to it.

2c.

We'd expect that players that get more playing time get more chances to score. Draw a scatter plot that compares minutes played to points per game. Does the scatter plot show a reasonable trend?

In [17]:
# SOLUTION
# We see that players that play more minutes tend to also score more points per game. This is a positive,
# roughly linear correlation.
nba.scatter('MP','PPG')

2d.

We'd expect that players that make a high percentage of their shots are also good at shooting free throws. Draw a scatter plot that compares the field goal percentage to free throw percentage. Does the scatter plot show a trend you'd expect? Why do you think some players have high field goal percentage but not free throw percentage?

In [18]:
# SOLUTION
# Some players are pretty tall and have high field goal percentages because they dunk the ball often. Some players
# are just not good at shooting free throws. Some examples include Shaquille O'Neal (retired) and Andre Drummond.
nba.scatter('FG%', 'FT%')
In [ ]: