04 04 Case Study Summer Olympics - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki

Medals in the Summer Olympics

Loading Olympic edition DataFrame

#Import pandas
import pandas as pd

# Create file path: file_path
file_path = 'Summer Olympic medallists 1896 to 2008 - EDITIONS.tsv'

# Load DataFrame from file_path: editions
editions = pd.read_csv(file_path,sep='\t')

# Extract the relevant columns: editions, dont need 'gold', 'silver', 'bronze'
editions = editions['Edition','Grand Total','City','Country'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'Edition','Grand-Total','City','Country')

# Print editions DataFrame
print(editions)

output

    Edition  Grand Total         City                     Country
0      1896          151       Athens                      Greece
1      1900          512        Paris                      France
2      1904          470    St. Louis               United States
3      1908          804       London              United Kingdom
4      1912          885    Stockholm                      Sweden
5      1920         1298      Antwerp                     Belgium
6      1924          884        Paris                      France
7      1928          710    Amsterdam                 Netherlands
8      1932          615  Los Angeles               United States
9      1936          875       Berlin                     Germany
10     1948          814       London              United Kingdom
11     1952          889     Helsinki                     Finland
12     1956          885    Melbourne                   Australia
13     1960          882         Rome                       Italy
14     1964         1010        Tokyo                       Japan
15     1968         1031  Mexico City                      Mexico
16     1972         1185       Munich  West Germany (now Germany)
17     1976         1305     Montreal                      Canada
18     1980         1387       Moscow       U.S.S.R. (now Russia)
19     1984         1459  Los Angeles               United States
20     1988         1546        Seoul                 South Korea
21     1992         1705    Barcelona                       Spain
22     1996         1859      Atlanta               United States
23     2000         2015       Sydney                   Australia
24     2004         1998       Athens                      Greece
25     2008         2042      Beijing                       China

Loading IOC codes DataFrame

For the analysis that follows, you want to keep only the useful columns from ioc_codes: 'Country' and 'NOC' (the column 'NOC' contains three-letter codes representing each country).

# Import pandas
import pandas as pd

# Create the file path: file_path
file_path = 'Summer Olympic medallists 1896 to 2008 - IOC COUNTRY CODES.csv'

# Load DataFrame from file_path: ioc_codes
ioc_codes = pd.read_csv(file_path)

# Extract the relevant columns: ioc_codes
ioc_codes = ioc_codes['Country','NOC'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'Country','NOC')

# Print first and last 5 rows of ioc_codes
print(ioc_codes.head())
print(ioc_codes.tail())

output

          Country  NOC
0      Afghanistan  AFG
1          Albania  ALB
2          Algeria  ALG
3  American Samoa*  ASA
4          Andorra  AND
             Country  NOC
196          Vietnam  VIE
197  Virgin Islands*  ISV
198            Yemen  YEM
199           Zambia  ZAM
200         Zimbabwe  ZIM

Building medals DataFrame

You have a sequence of files summer_1896.csv, summer_1900.csv, …, summer_2008.csv, one for each Olympic edition (year).

# Import pandas
import pandas as pd

# Create empty dictionary: medals_dict
medals_dict = {}

for year in editions['Edition']:

    # Create the file path: file_path
    file_path = 'summer_{:d}.csv'.format(year)
    
    # Load file_path into a DataFrame: medals_dict[year]
    medals_dict[year] = pd.read_csv(file_path)
    
    # Extract relevant columns: medals_dict[year]
    medals_dict[year] = medals_dict[year]['Athlete','NOC','Medal'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'Athlete','NOC','Medal')
    
    # Assign year to column 'Edition' of medals_dict
    medals_dict[year]['Edition'] = year
    
# Concatenate medals_dict: medals
medals = pd.concat(medals_dict,ignore_index=True)

# Print first and last 5 rows of medals
print(medals.head())
print(medals.tail())

output

              Athlete  NOC   Medal  Edition
0       HAJOS, Alfred  HUN    Gold     1896
1    HERSCHMANN, Otto  AUT  Silver     1896
2   DRIVAS, Dimitrios  GRE  Bronze     1896
3  MALOKINIS, Ioannis  GRE    Gold     1896
4  CHASAPIS, Spiridon  GRE  Silver     1896
                    Athlete  NOC   Medal  Edition
29211        ENGLICH, Mirko  GER  Silver     2008
29212  MIZGAITIS, Mindaugas  LTU  Bronze     2008
29213       PATRIKEEV, Yuri  ARM  Bronze     2008
29214         LOPEZ, Mijain  CUB    Gold     2008
29215        BAROEV, Khasan  RUS  Silver     2008

Quantifying Performance

Counting medals by country/edition in a pivot table

# Construct the pivot_table: medal_counts
medal_counts = medals.pivot_table(values='Athlete',index='Edition',aggfunc='count',columns='NOC')

# Print the first & last 5 rows of medal_counts
print(medal_counts.head())
print(medal_counts.tail())

output

NOC      AFG  AHO  ALG   ANZ  ARG  ...  VIE  YUG  ZAM  ZIM   ZZX
Edition                            ...                          
1896     NaN  NaN  NaN   NaN  NaN  ...  NaN  NaN  NaN  NaN   6.0
1900     NaN  NaN  NaN   NaN  NaN  ...  NaN  NaN  NaN  NaN  34.0
1904     NaN  NaN  NaN   NaN  NaN  ...  NaN  NaN  NaN  NaN   8.0
1908     NaN  NaN  NaN  19.0  NaN  ...  NaN  NaN  NaN  NaN   NaN
1912     NaN  NaN  NaN  10.0  NaN  ...  NaN  NaN  NaN  NaN   NaN

[5 rows x 138 columns]
NOC      AFG  AHO  ALG  ANZ   ARG  ...  VIE   YUG  ZAM  ZIM  ZZX
Edition                            ...                          
1992     NaN  NaN  2.0  NaN   2.0  ...  NaN   NaN  NaN  NaN  NaN
1996     NaN  NaN  3.0  NaN  20.0  ...  NaN  26.0  1.0  NaN  NaN
2000     NaN  NaN  5.0  NaN  20.0  ...  1.0  26.0  NaN  NaN  NaN
2004     NaN  NaN  NaN  NaN  47.0  ...  NaN   NaN  NaN  3.0  NaN
2008     1.0  NaN  2.0  NaN  51.0  ...  1.0   NaN  NaN  4.0  NaN

[5 rows x 138 columns]

Computing fraction of medals per Olympic edition

# Set Index of editions: totals
totals = editions.set_index('Edition')

# Reassign totals['Grand Total']: totals is a Pandas Series
totals = totals['Grand Total']

# Divide medal_counts by totals: fractions
fractions = medal_counts.divide(totals,axis='rows')

# Print first & last 5 rows of fractions
print(fractions.head())
print(fractions.tail())

output

NOC      AFG  AHO  ALG       ANZ  ARG  ...  VIE  YUG  ZAM  ZIM       ZZX
Edition                                ...                              
1896     NaN  NaN  NaN       NaN  NaN  ...  NaN  NaN  NaN  NaN  0.039735
1900     NaN  NaN  NaN       NaN  NaN  ...  NaN  NaN  NaN  NaN  0.066406
1904     NaN  NaN  NaN       NaN  NaN  ...  NaN  NaN  NaN  NaN  0.017021
1908     NaN  NaN  NaN  0.023632  NaN  ...  NaN  NaN  NaN  NaN       NaN
1912     NaN  NaN  NaN  0.011299  NaN  ...  NaN  NaN  NaN  NaN       NaN

[5 rows x 138 columns]
NOC          AFG  AHO       ALG  ANZ       ARG  ...       VIE       YUG       ZAM       ZIM  ZZX
Edition                                         ...                                             
1992         NaN  NaN  0.001173  NaN  0.001173  ...       NaN       NaN       NaN       NaN  NaN
1996         NaN  NaN  0.001614  NaN  0.010758  ...       NaN  0.013986  0.000538       NaN  NaN
2000         NaN  NaN  0.002481  NaN  0.009926  ...  0.000496  0.012903       NaN       NaN  NaN
2004         NaN  NaN       NaN  NaN  0.023524  ...       NaN       NaN       NaN  0.001502  NaN
2008     0.00049  NaN  0.000979  NaN  0.024976  ...  0.000490       NaN       NaN  0.001959  NaN

[5 rows x 138 columns]

Computing percentage change in fraction of medals won

The expanding mean provides a way to see this down each column. It is the value of the mean with all the data available up to that point in time.

# Apply the expanding mean: mean_fractions
mean_fractions = fractions.expanding().mean()

# Compute the percentage change: fractions_change
fractions_change = mean_fractions.pct_change()*100

# Reset the index of fractions_change: fractions_change ??why reset
fractions_change = fractions_change.reset_index()

# Print first & last 5 rows of fractions_change
print(fractions_change.head())
print(fractions_change.tail())

output

NOC  Edition  AFG  AHO  ALG        ANZ  ...  VIE  YUG  ZAM  ZIM        ZZX
0       1896  NaN  NaN  NaN        NaN  ...  NaN  NaN  NaN  NaN        NaN
1       1900  NaN  NaN  NaN        NaN  ...  NaN  NaN  NaN  NaN  33.561198
2       1904  NaN  NaN  NaN        NaN  ...  NaN  NaN  NaN  NaN -22.642384
3       1908  NaN  NaN  NaN        NaN  ...  NaN  NaN  NaN  NaN   0.000000
4       1912  NaN  NaN  NaN -26.092774  ...  NaN  NaN  NaN  NaN   0.000000

[5 rows x 139 columns]
NOC  Edition  AFG  AHO        ALG  ANZ  ...       VIE       YUG        ZAM        ZIM  ZZX
21      1992  NaN  0.0  -7.214076  0.0  ...       NaN  0.000000   0.000000   0.000000  0.0
22      1996  NaN  0.0   8.959211  0.0  ...       NaN -2.667732 -10.758472   0.000000  0.0
23      2000  NaN  0.0  19.762488  0.0  ...       NaN -2.696445   0.000000   0.000000  0.0
24      2004  NaN  0.0   0.000000  0.0  ...  0.000000  0.000000   0.000000 -43.491929  0.0
25      2008  NaN  0.0  -8.197807  0.0  ... -0.661117  0.000000   0.000000 -23.316533  0.0

[5 rows x 139 columns]

Reshaping and plotting

Building hosts DataFrame

# Import pandas
import pandas as pd

# Left join editions and ioc_codes: hosts
hosts = pd.merge(editions,ioc_codes,on='Country',how='left')

# Extract relevant columns and set index: hosts
hosts = hosts['Edition','NOC'](/HannaAA17/Data-Scientist-With-Python-datacamp/wiki/'Edition','NOC').set_index('Edition')

# Fix missing 'NOC' values of hosts
print(hosts.loc[hosts.NOC.isnull()])
hosts.loc[1972, 'NOC'] = 'FRG'
hosts.loc[1980, 'NOC'] = 'URS'
hosts.loc[1988, 'NOC'] = 'KOR'

# Reset Index of hosts: hosts
hosts = hosts.reset_index()

# Print hosts
print(hosts)

output

         NOC
Edition     
1972     NaN
1980     NaN
1988     NaN
    Edition  NOC
0      1896  GRE
1      1900  FRA
2      1904  USA
3      1908  GBR
4      1912  SWE
5      1920  BEL
6      1924  FRA
7      1928  NED
8      1932  USA
9      1936  GER
10     1948  GBR
11     1952  FIN
12     1956  AUS
13     1960  ITA
14     1964  JPN
15     1968  MEX
16     1972  FRG
17     1976  CAN
18     1980  URS
19     1984  USA
20     1988  KOR
21     1992  ESP
22     1996  USA
23     2000  AUS
24     2004  GRE
25     2008  CHN

Reshaping for analysis

reshape fraction_change

# Import pandas
import pandas as pd

# Reshape fractions_change: reshaped
reshaped = pd.melt(fractions_change, id_vars='Edition',value_name='Change')

# Print reshaped.shape and fractions_change.shape
print(reshaped.shape, fractions_change.shape)

# Extract rows from reshaped where 'NOC' == 'CHN': chn
chn = reshaped[reshaped['NOC']=='CHN']

# Print last 5 rows of chn with .tail()
print(chn.tail())

output

(3588, 3) (26, 139)
     Edition  NOC     Change
567     1992  CHN   4.240630
568     1996  CHN   7.860247
569     2000  CHN  -3.851278
570     2004  CHN   0.128863
571     2008  CHN  13.251332

Merging to compute influence

# Import pandas
import pandas as pd

# Merge reshaped and hosts: merged
merged = pd.merge(reshaped,hosts,how="inner")

# Print first 5 rows of merged
print(merged.head())

# Set Index of merged and sort it: influence
influence = merged.set_index('Edition').sort_index()

# Print first 5 rows of influence
print(influence.head())

output

   Edition  NOC     Change
0     1956  AUS  54.615063
1     2000  AUS  12.554986
2     1920  BEL  54.757887
3     1976  CAN  -2.143977
4     2008  CHN  13.251332
         NOC      Change
Edition                 
1896     GRE         NaN
1900     FRA  198.002486
1904     USA  199.651245
1908     GBR  134.489218
1912     SWE   71.896226

Plotting influence of host country

# Import pyplot
import matplotlib.pyplot as plt

# Extract influence['Change']: change
change = influence['Change']

# Make bar plot of change: ax
ax = change.plot(kind='bar')

# Customize the plot to improve readability
ax.set_ylabel("% Change of Host Country Medal Count")
ax.set_title("Is there a Host Country Advantage?")
ax.set_xticklabels(editions['City'])

# Display the plot
plt.show()