Sections 31 and 36, solutions at dchotai.github.io/resources
# Import some modules to use
import numpy as np
from datascience import *
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
This week, we'll use data from Basketball-Reference again that describes the average statistics of NBA players for the 2016-17 season. Here's the key again for reference:
Label | Meaning |
---|---|
Pos | Position (Point Guard, Shooting Guard, Small Forward, Power Forward, Center) |
Tm | Team abbreviation |
G / GS | Games played / Games started |
MP | Minutes played |
FG[A][%] | Field goals [attempted] [% made] |
3P / 2P | Three-point / two-point field goals |
FT | Free throws |
[O][D][T]RB | Offensive / Defensive / Total rebounds |
AST | Assists |
STL | Steals |
BLK | Blocks |
TOV | Turnovers |
PF | Personal Fouls |
PPG | Points per game |
# Just run this
nba = Table().read_table('nba_16_17.csv').drop('ORB', 'DRB').where('FGA', are.above(0))
nba
Histograms are used to visualize the distribution of numerical values. We use bins to group numerical variables into continuous intervals. Bins are inclusive of their lower bounds but exclusive of their upper bounds, which is often expressed as [lower, upper). We place bins on the x-axis and count the number of numerical values that lie in the bins. Note that bin widths may not always be equal, so it is always important to pay attention to the ranges on the x-axis.
The height of each bar measures the proportion of data in the bin relative to the amount of space in the bin. This is another key reason histograms are different than bar charts, because the height of each bar is relative to the width of the bin. In other words, the height of a bar refers to the density, or how crowded the bin is.
The formula for the height of a bar is
Height of bar = (Proportion of data in the bin) / (width of the bin).
Rearranging this formula gives us the formula for the area of a bar (which corresponds to the proportion of data in the bin), which is
Area of bar = (Height of bar) x (Width of bar)
Generally for histograms, to determine the number of items in a bin, it makes sense to calculate the area of the bar, which corresponds to the proportion of total values in the bin. You can then multiply this proportion by the total number of items in the distribution to get the number of items that are in the desired bin. When determining the density of each bin, we use the height of bars.
When creating histograms using the Table.hist
method, the first argument we pass in is the label of the column of desired values. By default, the Table.hist
method automatically determines appropriate bins for the histogram based on the numerical values passed in. If you wish to use your own bins, you can pass in an array of values using the optional bins=YOUR_BINS_ARRAY
parameter.
Let's visualize the distribution of the ages of last season's NBA players. NBA players must be at least 19 years old to play on a team. The oldest player last season was 40 years old. Assign age_bins
to an array of bin values that describe the ages of NBA players with an appropriate bin width.
Hint: To make equally sized bins, we can use the np.arange
function.
# SOLUTION
# Any bin widths in the interval [1, 5] should work fine
# A slightly narrower range of bins, such as np.arange(19, 41, 2) is also valid so long as no values are excluded
age_bins = np.arange(18, 44, 2)
age_bins
To visualize the distribution of player ages, create a histogram of the ages using the age_bins
you just created.
# SOLUTION
nba.hist('Age', bins=age_bins)
Let's look at the distribution of points scored per game. Create a histogram of points scored per game using the bins defined in the cell below.
points_bins = np.arange(0, 35, 5)
# SOLUTION
nba.hist('PPG', bins=points_bins)
Based on this histogram, roughly what percentage of NBA players scored between 10 (inclusive) and 15 (exclusive) points per game?
SOLUTION: We know that the area of a bar corresponds to the proportion of data in the bin. Recall that the formula to calculate the area of a bin is to multiply the width and height of the bin. The width of the [10, 15) bin is 5. We don't have the exact height of the bar, but fortunately we only need an estimate of the percentage of players that scored between 10 and 15 points per game. We can eyeball the height of the bar and see that it is roughly 4. Therefore, the percentage of players that scored between 10 and 15 points per games is (5 x 4) = 20%.
Using only the histogram above, can we find the percentage of players that scored beteween 12 and 15 points per game? Why or why not?
SOLUTION: Using only the histogram above, we cannot find the percentage of players that scored between 12 and 15 points per game. This is because the bins have regular widths of 5; have no way to determine how many players are in the [12, 15) interval out of the [10, 15) bin using only the histogram above.
Functions are used to define a name that refers to blocks of code. Every function is composed of a function name, arugments/parameters in parentheses, and a body, which often contains a return statement.
def hello(x):
"""Returns a greeting for the input"""
return "Hello " + x
In this function above, the function name is hello
, which takes in a single parameter x
, and the body returns the string "Hello x"
, with the x
replaced by whatever the value of x
is. The parameters/arguments can have any names because they serve as placeholders for values passed in by the user.
When defining functions, it's important to appropriately name your function and make sure your code is readable. Often times, you will see a docstring, which is a description of what the function does. Docstrings are written in triple quotation marks directly under the function name and above the body, as shown in the example above.
Often times, you will want to perform a set of operations on every value of a column. An easy way to do this is by defining a function that performs the set of operations on a general input, and using that function with the Table.apply
method.
The syntax for this method is table_name.apply(function_name, 'column_label(s)')
. Note that unlike many of the other the other Table methods, the Table.apply
method does not take a column label as the first argument. The Table.apply
method returns an array containing the ouputs of calling the specified function on each element of the specified column(s).
There are 82 games per season in the NBA. Let's find out what proportion of games each player played in during the 2016-17 season. Define a function named proportion_of_games
that takes in one parameter and returns the value of the parameter divided by the number of games in a NBA season.
# SOLUTION
def proportion_of_games(x):
return x / 82
Using the function we've just made, construct a table that is the same as the nba
table but with an additional column labeled 'Proportion of Season Played'
, which describes the proportion of games that each player played in.
# SOLUTION
# First apply the function to the 'G' column, which represents the number of games played last season
# This will give us an array of the proportion of games played last season for each player
# We use the Table.with_column table to attach these values to the table as a new column
proportions = nba.apply(proportion_of_games, 'G')
with_prop_played = nba.with_column('Proportion of Season Played', proportions)
with_prop_played
Which players were at every game? Assign all_games
to a table of players that participated in every game of the season.
# SOLUTION
# A valid way to find the players that participated in every game of the season is to use .where('G', 82)
# but we were really looking for you to use the proportions calculated above, where a proportion of 1
# corresponds to all 82 games played out of the 82 games of the regular season.
all_games = with_prop_played.where('Proportion of Season Played', 1)
all_games
Players with common first and last names can form "combined names." For example, LeBron James and James Harden have the common name "James," and can therefore combine to form LeBron James Harden. We'll define combined names to have a prefix player and a suffix player; in the example, LeBron James is the prefix player and James Harden is the suffix player. Let's find the combined names in our nba
table.
To find the common names in our table, we first need to isolate the first and last names of each player. Some players may have more than two terms in their full name, so we'll define first names to be the first term in their full name, and last names to be the remaining terms in their full name. Fill out the first_name
and last_name
functions below, which should take in a name as a string and return the first name or last name respectively.
Hint: For a string str
that has words separated by spaces, we can retrieve the first word with the expression str.split(" ")[0]
, and we can retrieve the remaining words in the string using the expression " ".join(str.split(" ")[1:])
. Don't worry, you're not expected to know how these expressions work.
# SOLUTION
# The parameter name is just a variable/placeholder, so we need to use the same parameter names to apply
# the methods given in the hint
def firstname(name):
"""Returns the first name of an input full name string"""
return name.split(" ")[0]
def lastname(name):
"""Returns the last name of an input full name string"""
return " ".join(name.split(" ")[1:])
Now that we have functions to do the heavy lifting for us, assign first_names
to an array containing the first names of all the players in the nba
table. Similarly, assign last_names
to an array containing the last names of all the players in the nba
table.
Hint: Can you think of a way to apply
the functions we just made on the player names?
# SOLUTION
# The hint indicates we want to apply our functions on the 'Player' columns to get the first and last names
# We can do this using the Table.apply method
first_names = nba.apply(firstname, 'Player')
last_names = nba.apply(lastname, 'Player')
Let's organize the information we have so far. Create a names
table that contains the full names of the players, the first names of the players, and the last names of the players. Make sure to label the first name column as first
and the last name column as last
.
# SOLUTION
# Some students may use the Table.with_column method twice, which is valid, but it is a bit easier to use the
# Table.with_columns method when adding multiple columns, as shown in lecture
# Additionally, note that selecting columns before adding new columns is more efficient than
# adding columns and selecting/dropping columns afterwards
names = nba.select('Player').with_columns('first', first_names,
'last', last_names)
names
Now that we've separated each player's name into a first name and a last name, we can determine our prefix and suffix players. A prefix player is a player whose last name matches the first name of any other player in the table. Similarly, a suffix player is a player whose first name matches the last name of any other player in the table. Assign prefixes
to a table containing prefix players, and assign suffixes
to a table containing suffix players.
Hint: We've already made an array of all the first names and an array of all the last names in the table.
# SOLUTION
# The hint indicates that we need to check which players' first names exist in the array of last names,
# and which players' last names exist in the array of first names.
# We can check if a value is in an array using the are.contained_in predicate when we use the Table.where method
# to keep only the rows of prefix or suffix players in our respective tables
prefixes = names.where('last', are.contained_in(first_names))
suffixes = names.where('first', are.contained_in(last_names))
Now that we know our prefix players and suffix players, we can match them up based on their common names. Ideally, we'd have you use the Table.join
method to achieve this on your own, but we'll cover joins in more detail next week. For now, just run the cell below, which joins by prefix name in prefix_joined
and joins by suffix name in suffix_name
.
# We haven't covered joins in detail yet, just run this
prefix_joined = prefixes.join('last', suffixes, other_label='first').select(2, 1, 0, 3, 4)
suffix_joined = suffixes.join('first', prefixes, other_label='last').select(4, 3, 0, 1, 2)
# You don't need to understand the body of this function, just know how to use it
def name_joiner(full, partial):
"""
Combines an array of full names with their corresponding partial names to return an list of combined names.
Note: the first parameter of this function must be a full name
combined name = full + partial
"""
return [pre + " " + suff for pre, suff in zip(full, partial)]
prefix_joined
suffix_joined
You'll notice that the prefix_joined
and suffix_joined
tables contain full names and partial names of players. We want to combine the player names by the common name in the middle column. To do this, we can use the name_joiner
function provided in the previous cell, which takes in an array of full names and an array of their corresponding partial names, and combines them to return a list of combined names. Note that the first parameter of the function must be a full name, which corresponds to the left common terms of a combined name.
The prefix_joined
and suffix_joined
tables have already been reorganized to make it easier for you to identify which columns to use with the name_joiner
function. Assign pref_combined
to the list of names that result from combining names in the prefix_joined
table, and assign suff_combined
to the list of names that result from combining names in the suffix_joined
table.
# SOLUTION
# In the prefix_joined table, the full name is in the 'Player' column and the partial name to combine
# is in the 'last_2' column
# In the suffix_joined table, the full name is in the 'Player_2' column and the partial name to combine
# is in the 'last' column
# We pass in these columns as arrays in the appropriate order into the 'name_joiner' function
pref_combined = name_joiner(prefix_joined.column('Player'), prefix_joined.column('last_2'))
suff_combined = name_joiner(suffix_joined.column('Player_2'), suffix_joined.column('last'))
All that's left to do is compile our combined names into a single table. We can sort the names in alphabetical order and remove duplicate values using the Table.sort
method. Run the cell below to see the combined names!
combined_names = Table().with_column('Combined Names', pref_combined + suff_combined).sort('Combined Names', distinct=True)
combined_names # Use combined_names.show() to see all of the rows
We use the Table.group
method to aggregate rows with similar values for a column into single rows for those values. Grouping is often useful if you want to organize your data by a specific column or variable.
The syntax for the method is Table.group(label_to_group_by, collect=aggregateFunction)
. The first argument of the method is the column label to group by. If grouping by multiple columns, the first argument should be an array of column labels to group by. The second parameter is a function that aggregates the grouped values into a single value. The second parameter is optional, and uses the len
function by default that returns a count of the number of grouped values. Note that the second parameter is only applied on values in the columns not specified in the first parameter.
Pivots are useful for cross-classifying your data by two categorical variables. Using the Table.pivot
method yields a grid of aggregated values after grouping together rows with shared values. The syntax of the method is Table.pivot('col_label_one', 'col_label_two', values='AGG_VALUES_LABEL', collect=AGG_FUNCTION)
The Table.pivot('col_label_one', 'col_label_two')
method requires 2 parameters. The first parameter will form the column labels of the resulting pivot table, and the second parameter will form the row labels of the table. By default, the method fills in the counts of values that correspond to the first and second category. It may be helpful to imagine the table as a grid where every value's x-coordinate is a category from the first parameter, and every value's y-coordinate is a category from the second parameter.
If you wish to aggregate custom values using a different function, you can use the two optional arguments values='AGG_VALUES_LABEL', collect=AGG_FUNCTION
. When using pivot, you must include either both of these optional arguments, or neither of them. The values
argument should be the label of the column you'd like to supply as values instead of counts. The collect
function aggregates the grouped values into a single value, similar to the collect
function in the Table.group
method.
Create a table avg_team_age
that contains teams and the average age of each team's players.
Hint: It may be easier to work with just the 'Tm' and 'Age' columns instead of the entire table.
# SOLUTION
# As per the hint, we only need the 'Tm' and 'Age' columns, so our computation may be faster if we exclude
# unneccessary columns before grouping
# We can then use the Table.group method to aggregate the values in the table by team. We pass in the optional
# `collect` argument as np.average to combine all of the ages per team using the np.average function. This gives us
# the average age per team. If we wanted to find the total age per team, we could pass in the `np.sum` or `sum`
# as the collect function.
avg_team_age = nba.select('Tm', 'Age').group('Tm', collect=np.average)
avg_team_age
Which team had the youngest players on average? Which team had the oldest players on average?
SOLUTION: We can sort the avg_team_age
table to answer these questions. The Portland Trailblazers had the youngest players on average. The Cleveland Cavaliers had the oldest players on average.
Create a histogram that compares average field goal percentage by team. Set the most_accurate
variable to the abbreviation of the most accurate team.
# SOLUTION
# Similar to the previous problem, we group the table by the 'Tm' column and aggregate using the `np.average` function
# Note that because we did not select or drop any specific columns, all columns of the table with numerical values
# will be aggregated into an average by team
# For instance, if you look at the `team_avg` table, you'll notice that column's like 'Player' will be blank because
# we can't take an average of strings or names. When the `collect` function does not work on a column's type, the
# aggregated value for that column will be empty
# Once we've grouped by Team, we can sort by 'FG%' to get the most accurate teams, then use the .row and .item methods
# to get the team abbreviation of the team with the highest field goal percentage
team_avg = nba.group('Tm', np.average)
most_accurate = team_avg.sort('FG% average', descending=True).row(0).item('Tm')
print('Most accurate team:', most_accurate)
team_avg.barh('Tm', 'FG% average')
Let's further examine the most_accurate
team. Which position on the most_accurate
team had the highest average field goal percentage? Which position on the most_accurate
team had the highest average points per game? Are these the same positions?
# SOLUTION
# We first find the players that were on the most accurate team (in the `most_accurate` table) and
# group them by position using the `np.average` function to aggregate the values into averages for each column
# We can then sort by our desired column and use the .row and .item methods to access the positions that
# had the highest average field goal percentage and the highest average points per game on the most
# accurate team
most_accurate_positions = nba.where('Tm', most_accurate).group('Pos', np.average)
highest_avg_fg = most_accurate_positions.sort('FG% average', descending=True).row(0).item('Pos')
highest_avg_ppg = most_accurate_positions.sort('PPG average', descending=True).row(0).item('Pos')
print("Highest Average FG%:", highest_avg_fg + ",", "Highest average PPG:", highest_avg_ppg)
SOLUTION: The most accurate team was the Golden State Warriors. The Small Forward position had the highest average field goal percentage on this team. The Small Forward position also had the highest average points scored per game.
Create a table that describes how many players of each position are on each team. Your first column should be the team's abbreviation, and the remaining columns should be number of players for the different positions on the team. Be sure to first exclude the TOT team, so that any players that were on multiple teams during the season are left out of the table.
# SOLUTION
# Notice that we want to group by two categorical variables: team and position
# We can use the `Table.pivot` to do this
# First we exclude the players that were listed as being on the "TOT" team
# We then use the `Table.pivot` method with the positions column as the first argument and the team abbreviation
# column as the second argument, as specified in the problem. We don't need to pass in the optional `values` and
# `collect` parameters because we only want the counts of the players
nba.where('Tm', are.not_equal_to('TOT')).pivot('Pos', 'Tm')
Now instead of the counts, find the average points per game of each position for every team. The format of your table should be the same as the table above. Remember to first exclude the TOT team.
# SOLUTION
# This is very similar to the previous problem, except instead of counts, we specifically want aggregates of
# the points per game values. We can do this by supplying the points per game column in the `values` optional
# parameter and the `np.average` function for the `collect` parameter to aggregate the values into an average.
nba.where('Tm', are.not_equal_to('TOT')).pivot('Pos', 'Tm', values='PPG', collect=np.average)