Sections 31 and 36, solutions at dchotai.github.io/resources
# Import some modules to use
import numpy as np
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
To learn about table maniuplations and visualizing data, we'll use data from Basketball-Reference that describes the average statistics of NBA players for the 2016-17 season. You should know how to use the table methods described on the Data 8 resources page.
Read in the nba_16_17.csv
table and store it in the nba
variable.
# SOLUTION
# We use the Table.read_table method to read in .csv files. Here we read the .csv file into a newly constructed table
nba = Table().read_table('nba_16_17.csv')
nba
If you aren't familiar with basketball terminology, this table may seem daunting. When encountering a new data set, it's always a good idea to look up what each column represents. Since we don't have the time in section, here's a table describing the meaning of each label:
Label | Meaning |
---|---|
Pos | Position (Point Guard, Shooting Guard, Small Forward, Power Forward, Center) |
Tm | Team abbreviation |
G / GS | Games played / Games started |
MP | Minutes played |
FG[A][%] | Field goals [attempted] [% made] |
3P / 2P | Three-point / two-point field goals |
FT | Free throws |
[O][D][T]RB | Offensive / Defensive / Total rebounds |
AST | Assists |
STL | Steals |
BLK | Blocks |
TOV | Turnovers |
PF | Personal Fouls |
PPG | Points per game |
In the 2016-17 NBA All-Star game, NBA players Klay Thompson, C.J. McCollum, Kyle Lowry, Eric Gordon, Kyrie Irving, Kemba Walker, Nick Young, and Wesley Matthews were chosen to participate in the Three-Point Contest. We'd expect that the top eight three-point shooters are invited to participate in the contest. Were the contest participants actually the top eight three-point shooters?
To find out, set threes
equal to a table that contains last season's NBA players sorted by three-point percentage, with the most accurate players at the top. Hint: First filter out players that did not shoot any three-pointers.
# SOLUTION
# Filter out players that did not shoot any threes first, as per the hint (this tip is specific to this data set)
# Use the .sort method to order the rows by 3P%
# Because we want the larger values (most accurate players) at the top, set the optional 'descending' argument to True
threes = nba.where('3P', are.above(0)).sort("3P%", descending=True)
threes
Wow, these players are pretty accurate from long-range. What seems to be wrong here? How can we find the top eight players that consistently shot and made three-pointers?
SOLUTION: Many of these players rarely shot three-pointers. If we look at the 3PA column, we see that many of these players attempted less than one three-pointer per game, and many of them played very few games in the entire season. We can filter out players that did not shoot consistently or with a large enough volume to get a more accurate result.
Suppose we only want to include players that made at least two three-pointers per game. Set top_eight
equal to a table that contains the top eight accurate three-point shooters that made at least 2 three-pointers per game.
# SOLUTION
# We want the players that made at least 2 threes per game, so we condition on '3P' >= 2
# Because we want only the first eight players, we can use the .take method to get the specified rows
top_eight = threes.where('3P', are.above_or_equal_to(2)).take(np.arange(8))
top_eight
Which players rightfully got to participate in the three-point contest? Which players from the above table were left out? Is there anything wrong with the top eight players we found above?
SOLUTION: Klay Thompson, C.J. McCollum, and Kyle Lowry definitely earned their spots. Many prominent three-point shooters like Kyle Korver and J.J. Redick were left out of the contest percentagewise. You might notice that Jordan Farmar was the second most accurate shooter in the table, but he only played in 2 games the entire season.
Let's look at some players that play the Center position. Centers traditionally don't shoot three-pointers, so their field goals are primarily composed of two-point field goals. Assign centers
to a table containing only players that play the center position. Exclude the '3P', '3PA', '3P%', '2P', '2PA', and '2P%' columns. There are around 100 centers in the NBA, so let's restrict the table to players that started in at least 20 games.
# SOLUTION
# Dropping the unwanted labels is faster than selecting the desired labels (though both methods are valid)
# We first keep only players that are Centers to reduce unnecessary computation
# Again, condition on Games Started >= 20 using the are.above_or_equal_to predicate
centers = nba.drop('3P', '3PA', '3P%', '2P', '2PA', '2P%').where('Pos', 'C').where('GS', are.above_or_equal_to(20))
centers
Centers are commonly known for their defensive prowess. Set most_blocks
equal to the name of the center that averaged the most blocked shots last season, most_rebounds
equal to the name of the center that averaged the most rebounds last season, and most_steals
equal to the name of the center that averaged the most steals last season.
# SOLUTION
# In general, when a question asks for the highest/lowest item of a specific column, you should first sort
# the table by the desired column in descending/ascending order.
# We can use the Table.row(i) method to get the row at index i
# Because we sorted in descending order, the first row will be the row with the largest value
# After using the Table.row method, we get a row object with multiple attributes, use the .item method to access
# the desired attribute
most_blocks = centers.sort('BLK', descending=True).row(0).item('Player')
most_rebounds = centers.sort('TRB', descending=True).row(0).item('Player')
most_steals = centers.sort('STL', descending=True).row(0).item('Player')
print("Blocks:", most_blocks + ",", "Rebounds:", most_rebounds + ",", "Steals:", most_steals)
You'll notice that some players are listed as being on team "TOT", which is actually not a real team. "TOT" indicates that the player switched teams during the season. Who was the oldest player that swapped teams last season? Set oldest_swap
equal to this player's name. Hint: Using the .row() method may be useful.
# SOLUTION
# First, keep all the rows with players that switched teams (their teams are listed as "TOT")
# Because we want the oldest player, we sort the resulting table in descending order
# Similar to the previous question, use the .row and .item methods to access the player's name
oldest_swap = centers.where('Tm', 'TOT').sort('Age', descending=True).row(0).item('Player')
oldest_swap
Fun fact: the oldest_swap
player was on three different teams last season. In his debut with the third team, he broke his leg in under a minute of play and was subsequently waived. :(
The NBA season consists of 82 games, many of which occur back-to-back or with only one day of rest for players. Players that start in every game they play are generally regarded as the "starters" of their teams. Often times, fatigue can catch up to the starters, which causes them to take miss some games or come off the bench for a few games to rest. Set only_started
equal to a table containing only players that started in every game they played.
Hint: You probably haven't been explicitly taught this in lecture, but the .where()
method can also take in a value or predicate instead of the standard .where(label, are.predicate)
format. This predicate can also be a series of boolean values. The .where()
method only includes rows that satisfy the predicate elementwise.
# SOLUTION
# You won't have a problem this complicated on an exam, this problem is more for curious practice.
# Outside of this class, predicates won't always be in are.predicate format
# (Though in this class, are.predicate will be the standard format)
# The hint refers to creating an array of boolean values (each value corresponding to a player in the original order
# of the table). We do this using the == operator, which checks if the values on the left side are equal to the
# values on the right side elementwise.
# 3 == 4 would return False, [3, 4, 5] == [2, 4, 5] would return [False, True, True]
# Therefore, we're checking if for each row of the table, does the value of 'G' (games played) == 'GS' (games started)?
# This will return an array of boolean (True/False) values, which we can use as a predicate
# Only rows that have a corresponding True value in the predicate will be included in the resulting table
only_started = nba.where(nba.column('G') == nba.column('GS'))
only_started
Set starter_mean
equal to the mean age of the players in the only_started
player. Set old_diff
equal to the difference in age between the oldest player in the table and the mean age of the players in the table. Set young_diff
equal to the difference in age between the mean age of the players in the table and the youngest player in the table.
# SOLUTION
# We first take the 'Age' column and aggregate it using the np.average() function, which crunches all of the values
# in the array into a single value (which is the mean of the values)
starter_mean = np.average(only_started.column('Age'))
print("Mean:", str(starter_mean))
Are there more starters in the table that are over the mean age or under the mean age? Why do you think this is the case?
# SOLUTION
# You can use the Table.num_rows attribute to access the number of rows in the table
over = only_started.where('Age', are.above(starter_mean)).num_rows
under = only_started.where('Age', are.below(starter_mean)).num_rows
print("Over:", over, "Under:", under)
SOLUTION: There are more starters in our table that are under the mean age than starters that are over the mean age. This is likely because younger players tend to get more playing time, whereas many older players tend to come off the bench in the later roles of their careers.
The 2016-17 NBA champions were the Golden State Warriors! Create a new table called champs
that is the same as the original nba
table, but has a column called "Champion". The "Champion" column should include boolean values that indicate whether the given player was a champion or not.
# SOLUTION
# Similar to question 1d.
# We first get the 'Tm' (Team Abbreviation) column using the .column method
# We use the == operator to check if each value of the resulting array is equal to 'GSW', which is the
# abbreviation for the Golden State Warriors
# The result of this operation is an array of boolean (True/False) values that correspond to the rows in the table
# We use the Table.with_column method to add a column to the table, with the boolean array we just made as the values
# Note, the Table.with_column method returns a new table (it does not mutate the original table), so we store the
# resulting table in the `champs` variable
champ_bools = nba.column('Tm') == 'GSW'
champs = nba.with_column('Champion', champ_bools)
champs.take(np.arange(80, 86))
The two types of data visualizations we learned about last week were bar charts and scatter plots.
Bar charts are used for plotting the distributions of categorical data, which are data that have no numerical significance. For example, if your data was composed of different names of fruit, it wouldn't make sense to perform arithmetic on the fruit names. Categorical data may or may not be ordered; fruit names naturally have no order, but you could alphabetically order them if you wanted to.
In this class, we'll primarily use the Table.barh()
method to construct horizontal bar charts. Horizontal bar charts have the category labels on the y-axis and the respective numerical value of each category on the x-axis. The numerical value associated with each category is typically the count/frequency of values that are in that category.
The Table.barh()
method's first argument should be the label of the column you want to chart. The second argument should be the label of the column that contains the associated numerical values of the categories. If your table has just two columns (one for categories and one for numerical data), you don't need to pass in the second argument since the method will automatically default to it.
The following positions
table contains a column for the different positions in the nba
table and a corresponding column with the frequency of each position in the table.
# Just run this, we haven't learned about the Table.group() method in detail yet
positions = nba.group('Pos')
positions
Use a horizontal bar chart to visualize the data in this table.
# SOLUTION
# There are just two columns in the `positions` table, so we only need to pass the categorical column label into
# the Table.barh method.
# positions.barh('Pos', 'count') is also a valid answer
positions.barh('Pos')
There seems to be roughly the same number of players per position, except for a lone "PF-C" position. Who is the lone player that played the PF-C position? Set pf_c
equal to this player's name.
# SOLUTION
# We first find all the rows where the Position column had the value 'PF-C'
# The resulting table only had 1 row, which is Joffrey Lauvergne
pf_c = nba.where('Pos', 'PF-C').row(0).item('Player')
pf_c
Turns out this player was traded from the Oklahoma City Thunder (where he played Power Forward) to the Chicago Bulls (where he played Center). Of course we could have easily seen this from the positions
table without needing to visualize the data, but bar charts sometimes make it easier to understand the data you're working with compared to tables.
Let's consider the positions excluding the PF-C position. The pos_avg
table below contains the average statistics of each position from the table, excluding the PF-C position.
pos_avg = nba.group('Pos', collect=np.mean).take(make_array(0, 1, 3, 4, 5))
pos_avg
Construct a bar chart that compares the mean points per game, total rebounds, and assists of each position.
# SOLUTION
# We can plot multiple numerical values for a category on a single bar chart!
# The first argument is always the category column label
# The second argument is either the label of the numerical value column to chart (if plotting only 1 value column)
# If plotting multiple value columns, the second argument should be an array that holds the labels of the desired
# value columns.
pos_avg.barh('Pos', make_array('PPG mean', 'TRB mean', 'AST mean'))
Scatter plots are used for plotting numerical data. Numerical data have meaningful differences (it makes sense to subtract one value from another) and are ordered. Note that data with numerical values are not always numerical data. For example, the census example from lecture had a numerical SEX
code (0, 1, or 2). These values were used as categories and the numbers did not have meaningful differences, so the data were categorical.
We'll use the Table.scatter()
method to construct scatter plots. The first argument should be the label of the column to use for the x-axis values. The second argument should be the label of the column to use for the y-axis values. If your table has just two columns with numerical data, you don't need to pass in the second argument since the method will automatically default to it.
We'd expect that players that get more playing time get more chances to score. Draw a scatter plot that compares minutes played to points per game. Does the scatter plot show a reasonable trend?
# SOLUTION
# We see that players that play more minutes tend to also score more points per game. This is a positive,
# roughly linear correlation.
nba.scatter('MP','PPG')
We'd expect that players that make a high percentage of their shots are also good at shooting free throws. Draw a scatter plot that compares the field goal percentage to free throw percentage. Does the scatter plot show a trend you'd expect? Why do you think some players have high field goal percentage but not free throw percentage?
# SOLUTION
# Some players are pretty tall and have high field goal percentages because they dunk the ball often. Some players
# are just not good at shooting free throws. Some examples include Shaquille O'Neal (retired) and Andre Drummond.
nba.scatter('FG%', 'FT%')