Week 04 - Histograms, Functions, Groups

Sections 31 and 36, solutions at dchotai.github.io/resources

In [1]:
# Import some modules to use
import numpy as np
from datascience import *

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")

This week, we'll use data from Basketball-Reference again that describes the average statistics of NBA players for the 2016-17 season. Here's the key again for reference:

Label Meaning
Pos Position (Point Guard, Shooting Guard, Small Forward, Power Forward, Center)
Tm Team abbreviation
G / GS Games played / Games started
MP Minutes played
FG[A][%] Field goals [attempted] [% made]
3P / 2P Three-point / two-point field goals
FT Free throws
[O][D][T]RB Offensive / Defensive / Total rebounds
AST Assists
STL Steals
BLK Blocks
TOV Turnovers
PF Personal Fouls
PPG Points per game
In [2]:
# Just run this
nba = Table().read_table('nba_16_17.csv').drop('ORB', 'DRB').where('FGA', are.above(0))
nba
Out[2]:
Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% TRB AST STL BLK TOV PF PPG
1 Alex Abrines SG 23 OKC 68 6 15.5 2 5 0.393 1.4 3.6 0.381 0.6 1.4 0.426 0.6 0.7 0.898 1.3 0.6 0.5 0.1 0.5 1.7 6
2 Quincy Acy PF 26 TOT 38 1 14.7 1.8 4.5 0.412 1 2.4 0.411 0.9 2.1 0.413 1.2 1.6 0.75 3 0.5 0.4 0.4 0.6 1.8 5.8
3 Steven Adams C 23 OKC 80 80 29.9 4.7 8.2 0.571 0 0 0 4.7 8.2 0.572 2 3.2 0.611 7.7 1.1 1.1 1 1.8 2.4 11.3
4 Arron Afflalo SG 31 SAC 61 45 25.9 3 6.9 0.44 1 2.5 0.411 2 4.4 0.457 1.4 1.5 0.892 2 1.3 0.3 0.1 0.7 1.7 8.4
5 Alexis Ajinca C 28 NOP 39 15 15 2.3 4.6 0.5 0 0.1 0 2.3 4.5 0.511 0.7 1 0.725 4.5 0.3 0.5 0.6 0.8 2 5.3
6 Cole Aldrich C 28 MIN 62 0 8.6 0.7 1.4 0.523 0 0 nan 0.7 1.4 0.523 0.2 0.4 0.682 2.5 0.4 0.4 0.4 0.3 1.4 1.7
7 LaMarcus Aldridge PF 31 SAS 72 72 32.4 6.9 14.6 0.477 0.3 0.8 0.411 6.6 13.8 0.48 3.1 3.8 0.812 7.3 1.9 0.6 1.2 1.4 2.2 17.3
8 Lavoy Allen PF 27 IND 61 5 14.3 1.3 2.8 0.458 0 0 0 1.3 2.7 0.461 0.4 0.5 0.697 3.6 0.9 0.3 0.4 0.5 1.3 2.9
9 Tony Allen SG 35 MEM 71 66 27 3.9 8.4 0.461 0.2 0.8 0.278 3.6 7.6 0.479 1.1 1.8 0.615 5.5 1.4 1.6 0.4 1.4 2.5 9.1
10 Al-Farouq Aminu SF 26 POR 61 25 29.1 3 7.6 0.393 1.1 3.5 0.33 1.9 4.2 0.445 1.6 2.2 0.706 7.4 1.6 1 0.7 1.5 1.7 8.7

... (475 rows omitted)

Histograms

Histograms are used to visualize the distribution of numerical values. We use bins to group numerical variables into continuous intervals. Bins are inclusive of their lower bounds but exclusive of their upper bounds, which is often expressed as [lower, upper). We place bins on the x-axis and count the number of numerical values that lie in the bins. Note that bin widths may not always be equal, so it is always important to pay attention to the ranges on the x-axis.

The height of each bar measures the proportion of data in the bin relative to the amount of space in the bin. This is another key reason histograms are different than bar charts, because the height of each bar is relative to the width of the bin. In other words, the height of a bar refers to the density, or how crowded the bin is.

The formula for the height of a bar is

Height of bar = (Proportion of data in the bin) / (width of the bin).

Rearranging this formula gives us the formula for the area of a bar (which corresponds to the proportion of data in the bin), which is

Area of bar = (Height of bar) x (Width of bar)

Generally for histograms, to determine the number of items in a bin, it makes sense to calculate the area of the bar, which corresponds to the proportion of total values in the bin. You can then multiply this proportion by the total number of items in the distribution to get the number of items that are in the desired bin. When determining the density of each bin, we use the height of bars.

When creating histograms using the Table.hist method, the first argument we pass in is the label of the column of desired values. By default, the Table.hist method automatically determines appropriate bins for the histogram based on the numerical values passed in. If you wish to use your own bins, you can pass in an array of values using the optional bins=YOUR_BINS_ARRAY parameter.

1a.

Let's visualize the distribution of the ages of last season's NBA players. NBA players must be at least 19 years old to play on a team. The oldest player last season was 40 years old. Assign age_bins to an array of bin values that describe the ages of NBA players with an appropriate bin width.

Hint: To make equally sized bins, we can use the np.arange function.

In [3]:
# SOLUTION
# Any bin widths in the interval [1, 5] should work fine
# A slightly narrower range of bins, such as np.arange(19, 41, 2) is also valid so long as no values are excluded
age_bins = np.arange(18, 44, 2)
age_bins
Out[3]:
array([18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42])

To visualize the distribution of player ages, create a histogram of the ages using the age_bins you just created.

In [4]:
# SOLUTION
nba.hist('Age', bins=age_bins)

1b.

Let's look at the distribution of points scored per game. Create a histogram of points scored per game using the bins defined in the cell below.

In [5]:
points_bins = np.arange(0, 35, 5)
# SOLUTION
nba.hist('PPG', bins=points_bins)

Based on this histogram, roughly what percentage of NBA players scored between 10 (inclusive) and 15 (exclusive) points per game?

SOLUTION: We know that the area of a bar corresponds to the proportion of data in the bin. Recall that the formula to calculate the area of a bin is to multiply the width and height of the bin. The width of the [10, 15) bin is 5. We don't have the exact height of the bar, but fortunately we only need an estimate of the percentage of players that scored between 10 and 15 points per game. We can eyeball the height of the bar and see that it is roughly 4. Therefore, the percentage of players that scored between 10 and 15 points per games is (5 x 4) = 20%.

1c.

Using only the histogram above, can we find the percentage of players that scored beteween 12 and 15 points per game? Why or why not?

SOLUTION: Using only the histogram above, we cannot find the percentage of players that scored between 12 and 15 points per game. This is because the bins have regular widths of 5; have no way to determine how many players are in the [12, 15) interval out of the [10, 15) bin using only the histogram above.

Functions

Functions are used to define a name that refers to blocks of code. Every function is composed of a function name, arugments/parameters in parentheses, and a body, which often contains a return statement.

def hello(x): """Returns a greeting for the input""" return "Hello " + x

In this function above, the function name is hello, which takes in a single parameter x, and the body returns the string "Hello x", with the x replaced by whatever the value of x is. The parameters/arguments can have any names because they serve as placeholders for values passed in by the user.

When defining functions, it's important to appropriately name your function and make sure your code is readable. Often times, you will see a docstring, which is a description of what the function does. Docstrings are written in triple quotation marks directly under the function name and above the body, as shown in the example above.

Often times, you will want to perform a set of operations on every value of a column. An easy way to do this is by defining a function that performs the set of operations on a general input, and using that function with the Table.apply method.

The syntax for this method is table_name.apply(function_name, 'column_label(s)'). Note that unlike many of the other the other Table methods, the Table.apply method does not take a column label as the first argument. The Table.apply method returns an array containing the ouputs of calling the specified function on each element of the specified column(s).

2a.

There are 82 games per season in the NBA. Let's find out what proportion of games each player played in during the 2016-17 season. Define a function named proportion_of_games that takes in one parameter and returns the value of the parameter divided by the number of games in a NBA season.

In [6]:
# SOLUTION
def proportion_of_games(x):
    return x / 82

2b.

Using the function we've just made, construct a table that is the same as the nba table but with an additional column labeled 'Proportion of Season Played', which describes the proportion of games that each player played in.

In [7]:
# SOLUTION
# First apply the function to the 'G' column, which represents the number of games played last season
# This will give us an array of the proportion of games played last season for each player
# We use the Table.with_column table to attach these values to the table as a new column 
proportions = nba.apply(proportion_of_games, 'G')
with_prop_played = nba.with_column('Proportion of Season Played', proportions)
with_prop_played
Out[7]:
Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% TRB AST STL BLK TOV PF PPG Proportion of Season Played
1 Alex Abrines SG 23 OKC 68 6 15.5 2 5 0.393 1.4 3.6 0.381 0.6 1.4 0.426 0.6 0.7 0.898 1.3 0.6 0.5 0.1 0.5 1.7 6 0.829268
2 Quincy Acy PF 26 TOT 38 1 14.7 1.8 4.5 0.412 1 2.4 0.411 0.9 2.1 0.413 1.2 1.6 0.75 3 0.5 0.4 0.4 0.6 1.8 5.8 0.463415
3 Steven Adams C 23 OKC 80 80 29.9 4.7 8.2 0.571 0 0 0 4.7 8.2 0.572 2 3.2 0.611 7.7 1.1 1.1 1 1.8 2.4 11.3 0.97561
4 Arron Afflalo SG 31 SAC 61 45 25.9 3 6.9 0.44 1 2.5 0.411 2 4.4 0.457 1.4 1.5 0.892 2 1.3 0.3 0.1 0.7 1.7 8.4 0.743902
5 Alexis Ajinca C 28 NOP 39 15 15 2.3 4.6 0.5 0 0.1 0 2.3 4.5 0.511 0.7 1 0.725 4.5 0.3 0.5 0.6 0.8 2 5.3 0.47561
6 Cole Aldrich C 28 MIN 62 0 8.6 0.7 1.4 0.523 0 0 nan 0.7 1.4 0.523 0.2 0.4 0.682 2.5 0.4 0.4 0.4 0.3 1.4 1.7 0.756098
7 LaMarcus Aldridge PF 31 SAS 72 72 32.4 6.9 14.6 0.477 0.3 0.8 0.411 6.6 13.8 0.48 3.1 3.8 0.812 7.3 1.9 0.6 1.2 1.4 2.2 17.3 0.878049
8 Lavoy Allen PF 27 IND 61 5 14.3 1.3 2.8 0.458 0 0 0 1.3 2.7 0.461 0.4 0.5 0.697 3.6 0.9 0.3 0.4 0.5 1.3 2.9 0.743902
9 Tony Allen SG 35 MEM 71 66 27 3.9 8.4 0.461 0.2 0.8 0.278 3.6 7.6 0.479 1.1 1.8 0.615 5.5 1.4 1.6 0.4 1.4 2.5 9.1 0.865854
10 Al-Farouq Aminu SF 26 POR 61 25 29.1 3 7.6 0.393 1.1 3.5 0.33 1.9 4.2 0.445 1.6 2.2 0.706 7.4 1.6 1 0.7 1.5 1.7 8.7 0.743902

... (475 rows omitted)

Which players were at every game? Assign all_games to a table of players that participated in every game of the season.

In [8]:
# SOLUTION
# A valid way to find the players that participated in every game of the season is to use .where('G', 82)
# but we were really looking for you to use the proportions calculated above, where a proportion of 1
# corresponds to all 82 games played out of the 82 games of the regular season.
all_games = with_prop_played.where('Proportion of Season Played', 1)
all_games
Out[8]:
Rk Player Pos Age Tm G GS MP FG FGA FG% 3P 3PA 3P% 2P 2PA 2P% FT FTA FT% TRB AST STL BLK TOV PF PPG Proportion of Season Played
56 Corey Brewer SF 30 TOT 82 11 15.6 1.8 4.2 0.422 0.3 1.4 0.229 1.5 2.8 0.522 0.6 0.8 0.735 2 1.2 0.7 0.2 0.7 1.6 4.5 1
79 Marquese Chriss PF 19 PHO 82 75 21.3 3.5 7.7 0.449 0.9 2.7 0.321 2.6 5 0.52 1.4 2.2 0.624 4.2 0.7 0.8 0.9 1.3 3.2 9.2 1
83 Jordan Clarkson SG 24 LAL 82 19 29.2 5.8 13.1 0.445 1.4 4.3 0.329 4.4 8.7 0.503 1.6 2 0.798 3 2.6 1.1 0.1 2 1.8 14.7 1
93 Jamal Crawford SG 36 LAC 82 1 26.3 4.4 10.6 0.413 1.4 3.9 0.36 3 6.7 0.443 2.1 2.5 0.857 1.6 2.6 0.7 0.2 1.6 1.4 12.3 1
111 Gorgui Dieng PF 27 MIN 82 82 32.4 4 8.1 0.502 0.2 0.5 0.372 3.9 7.5 0.511 1.7 2 0.814 7.9 1.9 1.1 1.2 1.3 3.1 10 1
159 Marcin Gortat C 32 WAS 82 82 31.2 4.8 8.2 0.579 0 0 0 4.8 8.2 0.58 1.3 1.9 0.648 10.4 1.5 0.5 0.7 1.5 2.6 10.8 1
180 Tobias Harris PF 24 DET 82 48 31.3 6.2 13 0.481 1.3 3.8 0.347 4.9 9.1 0.537 2.3 2.8 0.841 5.1 1.7 0.7 0.5 1.2 1.6 16.1 1
192 Buddy Hield SG 23 TOT 82 55 23 4 9.4 0.426 1.8 4.6 0.391 2.2 4.7 0.461 0.8 0.9 0.842 3.3 1.5 0.5 0.1 1.2 1.4 10.6 1
199 Justin Holiday SG 27 NYK 82 4 20 2.8 6.6 0.433 1.2 3.3 0.355 1.7 3.2 0.513 0.8 1 0.825 2.7 1.2 0.8 0.4 0.8 1.3 7.7 1
212 Ersan Ilyasova PF 29 TOT 82 52 26.1 4.7 10.9 0.431 1.7 4.9 0.353 3 6 0.496 2 2.5 0.778 5.9 1.7 0.7 0.3 1.4 2.6 13.1 1

... (7 rows omitted)

2c.

Players with common first and last names can form "combined names." For example, LeBron James and James Harden have the common name "James," and can therefore combine to form LeBron James Harden. We'll define combined names to have a prefix player and a suffix player; in the example, LeBron James is the prefix player and James Harden is the suffix player. Let's find the combined names in our nba table.

To find the common names in our table, we first need to isolate the first and last names of each player. Some players may have more than two terms in their full name, so we'll define first names to be the first term in their full name, and last names to be the remaining terms in their full name. Fill out the first_name and last_name functions below, which should take in a name as a string and return the first name or last name respectively.

Hint: For a string str that has words separated by spaces, we can retrieve the first word with the expression str.split(" ")[0], and we can retrieve the remaining words in the string using the expression " ".join(str.split(" ")[1:]). Don't worry, you're not expected to know how these expressions work.

In [9]:
# SOLUTION
# The parameter name is just a variable/placeholder, so we need to use the same parameter names to apply
# the methods given in the hint
def firstname(name):
    """Returns the first name of an input full name string"""
    return name.split(" ")[0]

def lastname(name):
    """Returns the last name of an input full name string"""
    return " ".join(name.split(" ")[1:])

Now that we have functions to do the heavy lifting for us, assign first_names to an array containing the first names of all the players in the nba table. Similarly, assign last_names to an array containing the last names of all the players in the nba table.

Hint: Can you think of a way to apply the functions we just made on the player names?

In [10]:
# SOLUTION
# The hint indicates we want to apply our functions on the 'Player' columns to get the first and last names
# We can do this using the Table.apply method
first_names = nba.apply(firstname, 'Player')
last_names = nba.apply(lastname, 'Player')

Let's organize the information we have so far. Create a names table that contains the full names of the players, the first names of the players, and the last names of the players. Make sure to label the first name column as first and the last name column as last.

In [11]:
# SOLUTION
# Some students may use the Table.with_column method twice, which is valid, but it is a bit easier to use the 
# Table.with_columns method when adding multiple columns, as shown in lecture
# Additionally, note that selecting columns before adding new columns is more efficient than 
# adding columns and selecting/dropping columns afterwards
names = nba.select('Player').with_columns('first', first_names,
                                         'last', last_names)
names
Out[11]:
Player first last
Alex Abrines Alex Abrines
Quincy Acy Quincy Acy
Steven Adams Steven Adams
Arron Afflalo Arron Afflalo
Alexis Ajinca Alexis Ajinca
Cole Aldrich Cole Aldrich
LaMarcus Aldridge LaMarcus Aldridge
Lavoy Allen Lavoy Allen
Tony Allen Tony Allen
Al-Farouq Aminu Al-Farouq Aminu

... (475 rows omitted)

Now that we've separated each player's name into a first name and a last name, we can determine our prefix and suffix players. A prefix player is a player whose last name matches the first name of any other player in the table. Similarly, a suffix player is a player whose first name matches the last name of any other player in the table. Assign prefixes to a table containing prefix players, and assign suffixes to a table containing suffix players.

Hint: We've already made an array of all the first names and an array of all the last names in the table.

In [12]:
# SOLUTION
# The hint indicates that we need to check which players' first names exist in the array of last names, 
# and which players' last names exist in the array of first names.
# We can check if a value is in an array using the are.contained_in predicate when we use the Table.where method
# to keep only the rows of prefix or suffix players in our respective tables
prefixes = names.where('last', are.contained_in(first_names))
suffixes = names.where('first', are.contained_in(last_names))

Now that we know our prefix players and suffix players, we can match them up based on their common names. Ideally, we'd have you use the Table.join method to achieve this on your own, but we'll cover joins in more detail next week. For now, just run the cell below, which joins by prefix name in prefix_joined and joins by suffix name in suffix_name.

In [13]:
# We haven't covered joins in detail yet, just run this
prefix_joined = prefixes.join('last', suffixes, other_label='first').select(2, 1, 0, 3, 4)
suffix_joined = suffixes.join('first', prefixes, other_label='last').select(4, 3, 0, 1, 2)

# You don't need to understand the body of this function, just know how to use it
def name_joiner(full, partial):
    """
    Combines an array of full names with their corresponding partial names to return an list of combined names.

    Note: the first parameter of this function must be a full name
    combined name = full + partial
    """
    return [pre + " " + suff for pre, suff in zip(full, partial)]
In [14]:
prefix_joined
Out[14]:
first Player last Player_2 last_2
Lavoy Lavoy Allen Allen Allen Crabbe Crabbe
Tony Tony Allen Allen Allen Crabbe Crabbe
Alan Alan Anderson Anderson Anderson Varejao Varejao
Justin Justin Anderson Anderson Anderson Varejao Varejao
Kyle Kyle Anderson Anderson Anderson Varejao Varejao
Ryan Ryan Anderson Anderson Anderson Varejao Varejao
Carmelo Carmelo Anthony Anthony Anthony Bennett Bennett
Joel Joel Anthony Anthony Anthony Bennett Bennett
Avery Avery Bradley Bradley Bradley Beal Beal
Tyson Tyson Chandler Chandler Chandler Parsons Parsons

... (21 rows omitted)

In [15]:
suffix_joined
Out[15]:
first_2 Player_2 first Player last
Lavoy Lavoy Allen Allen Allen Crabbe Crabbe
Alan Alan Anderson Anderson Anderson Varejao Varejao
Carmelo Carmelo Anthony Anthony Anthony Bennett Bennett
Carmelo Carmelo Anthony Anthony Anthony Brown Brown
Carmelo Carmelo Anthony Anthony Anthony Davis Davis
Carmelo Carmelo Anthony Anthony Anthony Morrow Morrow
Carmelo Carmelo Anthony Anthony Anthony Tolliver Tolliver
Avery Avery Bradley Bradley Bradley Beal Beal
Tyson Tyson Chandler Chandler Chandler Parsons Parsons
Norris Norris Cole Cole Cole Aldrich Aldrich

... (28 rows omitted)

You'll notice that the prefix_joined and suffix_joined tables contain full names and partial names of players. We want to combine the player names by the common name in the middle column. To do this, we can use the name_joiner function provided in the previous cell, which takes in an array of full names and an array of their corresponding partial names, and combines them to return a list of combined names. Note that the first parameter of the function must be a full name, which corresponds to the left common terms of a combined name.

The prefix_joined and suffix_joined tables have already been reorganized to make it easier for you to identify which columns to use with the name_joiner function. Assign pref_combined to the list of names that result from combining names in the prefix_joined table, and assign suff_combined to the list of names that result from combining names in the suffix_joined table.

In [16]:
# SOLUTION
# In the prefix_joined table, the full name is in the 'Player' column and the partial name to combine 
# is in the 'last_2' column
# In the suffix_joined table, the full name is in the 'Player_2' column and the partial name to combine
# is in the 'last' column
# We pass in these columns as arrays in the appropriate order into the 'name_joiner' function
pref_combined = name_joiner(prefix_joined.column('Player'), prefix_joined.column('last_2'))
suff_combined = name_joiner(suffix_joined.column('Player_2'), suffix_joined.column('last'))

All that's left to do is compile our combined names into a single table. We can sort the names in alphabetical order and remove duplicate values using the Table.sort method. Run the cell below to see the combined names!

In [17]:
combined_names = Table().with_column('Combined Names', pref_combined + suff_combined).sort('Combined Names', distinct=True)
combined_names # Use combined_names.show() to see all of the rows
Out[17]:
Combined Names
Aaron Gordon Hayward
Aaron Harrison Barnes
Alan Anderson Varejao
Andrew Harrison Barnes
Anthony Davis Bertans
Avery Bradley Beal
C.J. Miles Plumlee
Carmelo Anthony Bennett
Carmelo Anthony Brown
Carmelo Anthony Davis

... (39 rows omitted)

Groups and Pivots

We use the Table.group method to aggregate rows with similar values for a column into single rows for those values. Grouping is often useful if you want to organize your data by a specific column or variable.

The syntax for the method is Table.group(label_to_group_by, collect=aggregateFunction). The first argument of the method is the column label to group by. If grouping by multiple columns, the first argument should be an array of column labels to group by. The second parameter is a function that aggregates the grouped values into a single value. The second parameter is optional, and uses the len function by default that returns a count of the number of grouped values. Note that the second parameter is only applied on values in the columns not specified in the first parameter.

Pivots are useful for cross-classifying your data by two categorical variables. Using the Table.pivot method yields a grid of aggregated values after grouping together rows with shared values. The syntax of the method is Table.pivot('col_label_one', 'col_label_two', values='AGG_VALUES_LABEL', collect=AGG_FUNCTION)

The Table.pivot('col_label_one', 'col_label_two') method requires 2 parameters. The first parameter will form the column labels of the resulting pivot table, and the second parameter will form the row labels of the table. By default, the method fills in the counts of values that correspond to the first and second category. It may be helpful to imagine the table as a grid where every value's x-coordinate is a category from the first parameter, and every value's y-coordinate is a category from the second parameter.

If you wish to aggregate custom values using a different function, you can use the two optional arguments values='AGG_VALUES_LABEL', collect=AGG_FUNCTION. When using pivot, you must include either both of these optional arguments, or neither of them. The values argument should be the label of the column you'd like to supply as values instead of counts. The collect function aggregates the grouped values into a single value, similar to the collect function in the Table.group method.

3a.

Create a table avg_team_age that contains teams and the average age of each team's players.

Hint: It may be easier to work with just the 'Tm' and 'Age' columns instead of the entire table.

In [18]:
# SOLUTION
# As per the hint, we only need the 'Tm' and 'Age' columns, so our computation may be faster if we exclude
# unneccessary columns before grouping
# We can then use the Table.group method to aggregate the values in the table by team. We pass in the optional
# `collect` argument as np.average to combine all of the ages per team using the np.average function. This gives us
# the average age per team. If we wanted to find the total age per team, we could pass in the `np.sum` or `sum`
# as the collect function.
avg_team_age = nba.select('Tm', 'Age').group('Tm', collect=np.average)
avg_team_age
Out[18]:
Tm Age average
ATL 27
BOS 25.2667
BRK 26.6429
CHI 25.5385
CHO 25.5
CLE 29.9286
DAL 27.0625
DEN 25.5333
DET 25.4667
GSW 27.6

... (21 rows omitted)

Which team had the youngest players on average? Which team had the oldest players on average?

SOLUTION: We can sort the avg_team_age table to answer these questions. The Portland Trailblazers had the youngest players on average. The Cleveland Cavaliers had the oldest players on average.

3b.

Create a histogram that compares average field goal percentage by team. Set the most_accurate variable to the abbreviation of the most accurate team.

In [19]:
# SOLUTION
# Similar to the previous problem, we group the table by the 'Tm' column and aggregate using the `np.average` function
# Note that because we did not select or drop any specific columns, all columns of the table with numerical values
# will be aggregated into an average by team
# For instance, if you look at the `team_avg` table, you'll notice that column's like 'Player' will be blank because
# we can't take an average of strings or names. When the `collect` function does not work on a column's type, the
# aggregated value for that column will be empty
# Once we've grouped by Team, we can sort by 'FG%' to get the most accurate teams, then use the .row and .item methods
# to get the team abbreviation of the team with the highest field goal percentage
team_avg = nba.group('Tm', np.average)
most_accurate = team_avg.sort('FG% average', descending=True).row(0).item('Tm')
print('Most accurate team:', most_accurate)
team_avg.barh('Tm', 'FG% average')
Most accurate team: GSW

Let's further examine the most_accurate team. Which position on the most_accurate team had the highest average field goal percentage? Which position on the most_accurate team had the highest average points per game? Are these the same positions?

In [20]:
# SOLUTION
# We first find the players that were on the most accurate team (in the `most_accurate` table) and
# group them by position using the `np.average` function to aggregate the values into averages for each column
# We can then sort by our desired column and use the .row and .item methods to access the positions that 
# had the highest average field goal percentage and the highest average points per game on the most 
# accurate team
most_accurate_positions = nba.where('Tm', most_accurate).group('Pos', np.average)
highest_avg_fg = most_accurate_positions.sort('FG% average', descending=True).row(0).item('Pos')
highest_avg_ppg = most_accurate_positions.sort('PPG average', descending=True).row(0).item('Pos')
print("Highest Average FG%:", highest_avg_fg + ",", "Highest average PPG:", highest_avg_ppg)
Highest Average FG%: SF, Highest average PPG: SF

SOLUTION: The most accurate team was the Golden State Warriors. The Small Forward position had the highest average field goal percentage on this team. The Small Forward position also had the highest average points scored per game.

3c.

Create a table that describes how many players of each position are on each team. Your first column should be the team's abbreviation, and the remaining columns should be number of players for the different positions on the team. Be sure to first exclude the TOT team, so that any players that were on multiple teams during the season are left out of the table.

In [21]:
# SOLUTION
# Notice that we want to group by two categorical variables: team and position
# We can use the `Table.pivot` to do this
# First we exclude the players that were listed as being on the "TOT" team
# We then use the `Table.pivot` method with the positions column as the first argument and the team abbreviation
# column as the second argument, as specified in the problem. We don't need to pass in the optional `values` and 
# `collect` parameters because we only want the counts of the players
nba.where('Tm', are.not_equal_to('TOT')).pivot('Pos', 'Tm')
Out[21]:
Tm C PF_2 PG SF SG
ATL 2 4 2 4 3
BOS 3 3 3 3 3
BRK 2 3 4 2 3
CHI 2 2 3 2 4
CHO 2 3 3 1 5
CLE 4 1 2 4 3
DAL 3 5 5 1 2
DEN 2 3 2 4 4
DET 3 3 3 3 3
GSW 6 2 2 2 3

... (20 rows omitted)

Now instead of the counts, find the average points per game of each position for every team. The format of your table should be the same as the table above. Remember to first exclude the TOT team.

In [22]:
# SOLUTION
# This is very similar to the previous problem, except instead of counts, we specifically want aggregates of 
# the points per game values. We can do this by supplying the points per game column in the `values` optional 
# parameter and the `np.average` function for the `collect` parameter to aggregate the values into an average.
nba.where('Tm', are.not_equal_to('TOT')).pivot('Pos', 'Tm', values='PPG', collect=np.average)
Out[22]:
Tm C PF_2 PG SF SG
ATL 9.85 6.7 11.65 6.65 6.1
BOS 8.83333 3.93333 12.1333 8.7 9.73333
BRK 13.7 6.7 7.875 8.45 8.83333
CHI 7.6 8.7 6.76667 14.7 7
CHO 6.35 8.06667 10.9667 9.2 7.52
CLE 5.075 19 14.6 10.975 6.83333
DAL 3.93333 8.42 8.2 2.8 7.75
DEN 9.1 6.96667 10.1 9.025 10.575
DET 8 9.83333 9.9 7.63333 5.83333
GSW 3.75 6.5 15.2 16.35 11.0333

... (20 rows omitted)