Week 05 - Midterm Practice

Sections 31 and 36, solutions at dchotai.github.io/resources

In [1]:
# Import some modules to use
import numpy as np
from datascience import *

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")

Tables

We'll look at some data regarding the prices of mineral water in different countries, as well as the average precipitation of some countries. Unfortunately, our datasets don't describe the world in the same year, but for our purposes we'll overlook the differences in recency.

In [2]:
# Just run this
mineral_water = Table().read_table('mineral-water-price-by-city-2015.csv').drop('FREQ', 
                                                                               'TIME_FORMAT', 
                                                                               'UNIT_MEASURE', 
                                                                               'UNIT_MULT')
# OBS_VALUE represents approximate price ranking of a 1.5 liter bottle of mineral water
mineral_water
Out[2]:
Year OBS_VALUE COUNTRY CITY
2014 0.28 Algeria Algiers
2012 1.24 Argentina Buenos Aires
2011 1.77 Australia Melbourne
2013 2.04 Australia Melbourne
2011 0.54 Azerbaijan Baku
2011 1.59 Belgium Antwerp
2011 1.36 Belgium Brussels
2012 0.58 Bosnia And Herzegovina Banja Luka
2012 0.55 Bosnia And Herzegovina Sarajevo
2014 0.63 Brazil Belo Horizonte

... (1573 rows omitted)

In [3]:
# Just run this
precipitation = Table().read_table('precipitation2014.csv')
# Average precipitation in 2014 by country in millimeters
precipitation
Out[3]:
Country mm
Afghanistan 327
Albania 1485
Algeria 89
Angola 1010
Argentina 591
Armenia 562
Australia 534
Austria 1110
Azerbaijan 447
Bahamas, The 1292

... (156 rows omitted)

Does mineral water cost more in countries with lower average precipitation throughout the year? To find out, we need to determine if there is a relationship between mineral water prices and average precipitation in a country.

1a.

Notice that our mineral_water table has data from different years, whereas our precipitation table only contains data from 2014. How can we construct a table water2014 that contains the 2014 data from mineral_water as well as each country's corresponding average precipitation?

In [4]:
# SOLUTION
water2014 = mineral_water.where('Year', 2014).join('COUNTRY', precipitation, 'Country')
water2014
Out[4]:
COUNTRY Year OBS_VALUE CITY mm
Albania 2014 0.47 Tirana 1485
Algeria 2014 0.28 Algiers 89
Argentina 2014 1.35 Buenos Aires 591
Armenia 2014 0.53 Yerevan 562
Australia 2014 1.86 Adelaide 534
Australia 2014 1.93 Brisbane 534
Australia 2014 1.92 Perth 534
Australia 2014 4.26 Darwin 534
Australia 2014 1.39 Melbourne 534
Australia 2014 1.86 Canberra 534

... (306 rows omitted)

1b.

The water2014 table has multiple cities listed for each country, which means we have multiple OBS values for a specific country. However, each city of a given country has the same average precipitation. We can make our table more uniform by listing the average OBS value of each country. Construct a table uniform2014 that contains the countries from water2014, their average OBS values, and their average precipitation.

In [5]:
# SOLUTION
uniform2014 = water2014.group('COUNTRY', np.average).drop('Year average', 'CITY average')
uniform2014
Out[5]:
COUNTRY OBS_VALUE average mm average
Albania 0.47 1485
Algeria 0.28 89
Argentina 1.35 591
Armenia 0.53 562
Australia 1.99375 534
Austria 0.84 1110
Azerbaijan 0.86 447
Bangladesh 0.35 2666
Belarus 0.94 618
Belgium 1.32 847

... (82 rows omitted)

1c.

Is there a relationship between average OBS value and average precipitation? Why or why not? Make an appropriate visualization to help answer this question.

In [6]:
# SOLUTION
uniform2014.scatter('mm average', 'OBS_VALUE average')

SOLUTION: According to the scatter plot, here seems to be no association between the two variables.

Functions

2a.

Suppose you open a special bank account that has a 4.2% annual interest rate (your balance grows 4.2% every year). Because the interest rate is relatively high, the bank says that the maximum amount you can initially deposit is \$100; you cannot deposit anything after opening the account. You decide to take the bank’s offer and deposit \$100 when you open your account. Write a function new_balance that returns the balance of your account after x years. Assume that you never withdraw any money from the account within those x years.

In [7]:
# SOLUTION
def new_balance(x):
    return 100 * (1 + .042) ** x

2b.

Even Steven and Odd Todd decide to have a contest to see who can do more pushups. Even Steven is allergic to odd numbers, and Odd Todd has a strong aversion to even numbers. The two friends settle on a compromise for the rules: Even Steven will do pushups on even-indexed days, and Odd Todd will do pushups on odd-indexed days. The decide to hold the contest for 20 days, so that each person does pushups on 10 days. Suppose they log their results in a table called pushups after ending the contest. Here are the first 10 rows of pushups:

Day Pushups
0 44
1 29
2 34
3 25
4 28
5 19
6 28
7 1
8 22
9 5

Even Steven claims that the winner should be the person that had the most pushups per day on average. Odd Todd is still mad that he only had one pushup on day 7, so he claims that the winner should be the person that had more pushups overall. Write a function steven_method that returns the name of the winner using Even Steven’s method, and write a function todd_method that returns the name of the winner using Odd Todd’s method. For both functions, in the case of a tie, return the string “Tie”. Both functions should take in some table as input; assume this table will have the same labels and dimensions as pushups above (20 rows, 2 columns).

In [8]:
# SOLUTION
def steven_method(tbl):
    even_pushups = tbl.take(np.arange(0, 20, 2)).column('Pushups')
    odd_pushups = tbl.take(np.arange(1, 20, 2)).column('Pushups')
    even_mean = np.average(even_pushups)
    odd_mean = np.average(odd_pushups)
    if even_mean > odd_mean:
        return "Even Steven"
    elif even_mean < odd_mean:
        return "Odd Todd"
    else:
        return "Tie"
    
def todd_method(tbl):
    even_pushups = pushups.take(np.arange(0, 20, 2)).column('Pushups')
    odd_pushups = pushups.take(np.arange(1, 20, 2)).column('Pushups')
    even_total = np.sum(even_pushups)
    odd_total = np.sum(odd_pushups)
    if even_total > odd_total:
        return "Even Steven"
    elif even_total < odd_total:
        return "Odd Todd"
    else:
        return "Tie"

2c.

Even Steven and Odd Todd decide to hold another contest with the same rules. This contest becomes so popular that people begin betting on who they think will win. The betting community's consensus is that Even Steven's method of determining a winner is best for simulation. The grand prize for betting on a tie with Even Steven's method includes a large sum of money and free lunch with both Even Steven and Odd Todd after the contest. An obsessed fan wants to know his chances of winning the grand prize for betting on a tie. Write a function tie_prob that simulates 1000 contests of the same rules between Even Steven and Odd Todd, and returns the probability of the end result being a tie using Even Steven's method of determining a winner. Assume that the maximum possible pushups either person can do is 50.

Hint: You'll need to construct a new table in the same format as pushups to use with the steven_method. You can use np.random.randint(start, end, size=n) to make an array of n random numbers in the interval [start, end]

In [9]:
# SOLUTION
def tie_prob():
    outcomes = make_array()
    for i in np.arange(1000):
        sim_table = Table().with_columns('Day', np.arange(20),
                            'Pushups', np.random.randint(0, 50, size=20))
        outcomes = np.append(outcomes, steven_method(sim_table))
    tie_proportion = np.count_nonzero(outcomes == 'Tie') / 1000
    return tie_proportion
In [10]:
# Just how small is this probability?
tie_prob()
Out[10]:
0.003

Probability

Marshawn Lynch has a jar of assorted Skittles. This table describes the distribution of flavors present in the jar:

Flavor Quantity
Grape 36
Lemon 14
Green Apple 41
Orange 34
Strawberry 25

3a.

Write a Python expression that evaluates to the probability that Marshawn draws a lemon Skittle followed by a grape Skittle (without replacement).

In [11]:
# SOLUTION
(14/150)*(36/149)
Out[11]:
0.0225503355704698

3b.

What is the probability that Marshawn will draw three non-orange Skittles in a row without replacement? Write a Python expression that evaluates to this probability.

In [12]:
# SOLUTION
# There are 150 total Skittles at the start, 34 of which are orange
# After drawing one, we have 149 left (sampling without replacement), and 148 after another draw
(150 - 34)/150 * (149 - 34)/149 * (148-34)/148
Out[12]:
0.45974968256847454

3c.

Suppose Marshawn draws four Skittles without replacement. Write a Python expression that evaluates to the probability that at least one of his drawn Skittles is grape-flavored.

In [13]:
# SOLUTION
# There are 150 total Skittles at the start, 36 of which are grape-flavored
# P(least one of his Skittles is grape-flavored) is the same as 1 - P(none are grape-flavored)
1 - ((150 - 36)/150 * (149 - 36)/149 * (148 - 36)/148 * (147 - 36)/147)
Out[13]:
0.6706423777564717