Coursera – Introduction to Data Science in Python #12 (2週目その7)

Coursera『Introduction to Data Science in Python – by University of Michigan』の学習メモ

Assignment 2 – Pandas Introduction

Part 1

引き継ぎWikipediaのオリンピックメダルのデータを使ったアサインメント。


import pandas as pd

df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#'+col[1:]}, inplace=True)

names_ids = df.index.str.split('\s\(') # split the index by '('

df.index = names_ids.str[0] # the [0] element is the country name (new index) 
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)

df = df.drop('Totals')
df.head()

Question 1

Which country has won the most gold medals in summer games?

sort_valuesを使って指定したカラムの降順で並び替える。


# This function should return a single string value.
def answer_one():
    answer = df.sort_values(by='Gold', ascending=False).iloc[0].name
    return answer
answer_one()
# 'United States'

Question 2

Which country had the biggest difference between their summer and winter gold medal counts?

“Gold dif”カラムを追加し、夏と冬の金メダル数の差を絶対値 “abs()” で求める。


# This function should return a single string value.
def answer_two():
    df["Gold dif"] = abs(df["Gold"] - df["Gold.1"])
    answer = df.sort_values(by='Gold dif', ascending=False).iloc[0].name
    return answer
answer_two()
# 'United States'

Question 3

Which country has the biggest difference between their summer gold medal counts and winter gold medal counts relative to their total gold medal count? Only include countries that have won at least 1 gold in both summer and winter.

夏・冬ともに金メダルを獲得したことがある国のみにソートしてから並べ替え。


# This function should return a single string value.
def answer_three():
    df_three = df[(df['Gold'] > 0) & (df['Gold.1']  > 0)]
    df_three["Dif ratio"] = abs((df_three["Gold"] - df_three["Gold.1"]) / (df_three["Gold"] + df_three["Gold.1"]))
    answer = df_three.sort_values(by='Dif ratio', ascending=False).iloc[0].name
    return answer
answer_three()
# 'Bulgaria'

Question 4

Write a function that creates a Series called “Points” which is a weighted value where each gold medal (Gold.2) counts for 3 points, silver medals (Silver.2) for 2 points, and bronze medals (Bronze.2) for 1 point. The function should return only the column (a Series object) which you created, with the country names as indices.

新しく作ったカラムの値と国名(インデックス)でSeriesを生成する。


# This function should return a Series named Points of length 146
def answer_four():
    df["Points"] = (df["Gold.2"] * 3) + (df["Silver.2"] * 2) +  (df["Bronze.2"] * 1)
    countries = pd.Series(df["Points"], index=df.index.values)
    return countries
answer_four()

Part 2

続いて Census Bureauのアメリカの人口統計データを使ったアサインメント。


census_df = pd.read_csv('census.csv')
census_df.head()

Question 5

Which state has the most counties in it? (hint: consider the sumlevel key carefully! You’ll need this for future questions too…)

ドキュメントを見ると “SUMLEV” に関する説明がある。

ここでは Countryと同等レベルの “050” のみを取り扱う。


# This function should return a single string value.
def answer_five():
    c_df = census_df[census_df['SUMLEV'] == 50]
    c_df = c_df.groupby("STNAME").count()
    return  c_df.sort_values(by='COUNTY', ascending=False).iloc[0].name
answer_five()
# 'Texas'

Question 6

Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.


# This function should return a list of string values.
def answer_six():
    c_df = census_df[census_df['SUMLEV'] == 50]
    df = pd.DataFrame([], columns=["STNAME", "TOTALPOP"])
    index_array = c_df["STNAME"].unique()    
    c_df = c_df.sort_values(by='CENSUS2010POP', ascending=False)
    for i in index_array:
        totalpop = c_df[c_df["STNAME"] == i].iloc[:3]["CENSUS2010POP"].sum()
        i_df = pd.DataFrame([[i, totalpop]], columns=["STNAME", "TOTALPOP"])
        df = df.append(i_df)
    answer = df.sort_values(by='TOTALPOP', ascending=False).iloc[:3]["STNAME"].tolist()
    return answer
answer_six()

Question 7

Which county has had the largest absolute change in population within the period 2010-2015? (Hint: population values are stored in columns POPESTIMATE2010 through POPESTIMATE2015, you need to consider all six columns.)


# This function should return a single string value.
def answer_seven():
    c_df = census_df[census_df['SUMLEV'] == 50]
    columns_to_keep = ['CTYNAME',
                   'POPESTIMATE2010',
                   'POPESTIMATE2011',
                   'POPESTIMATE2012',
                   'POPESTIMATE2013',
                   'POPESTIMATE2014',
                   'POPESTIMATE2015']
    df = c_df[columns_to_keep]
    columns_to_calculate = [
                   'POPESTIMATE2010',
                   'POPESTIMATE2011',
                   'POPESTIMATE2012',
                   'POPESTIMATE2013',
                   'POPESTIMATE2014',
                   'POPESTIMATE2015']
    df["CHANGE"] = df[max(df[columns_to_calculate])] - df[min(df[columns_to_calculate])]
    answer = df.sort_values(by='CHANGE', ascending=False).iloc[0]["CTYNAME"]
    return answer
answer_seven()
# 'Harris County'

Question 8

In this datafile, the United States is broken up into four regions using the “REGION” column. Create a query that finds the counties that belong to regions 1 or 2, whose name starts with ‘Washington’, and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014.


# This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_df (sorted ascending by index).
def answer_eight():
    c_df = census_df[census_df['SUMLEV'] == 50][census_df['POPESTIMATE2015'] > census_df['POPESTIMATE2014']]
    c_df = c_df.query('REGION in [1, 2]')
    c_df = c_df[c_df['CTYNAME'].str.startswith('Washington')]
    return c_df[['STNAME', 'CTYNAME']]
answer_eight()

所感

調べながら解いてなんとか全問正解できた。データの加工の仕方とか全然慣れてなくて、絶対冗長な部分が多いと思う。3週目に進む前に軽くpandasの作法や使えるメソッドを復習した方が良さそう。でも、実際に考えながらコード書いてみて、データ整形するのには便利なライブラリだと実感した。

Source

兵庫県西宮市生まれのフリーランスRailsエンジニア。海外を拠点にデジタルノマド生活中。/ 前職・資格:公認会計士 / プログラミング言語:Ruby, JavaScript, HTML, CSS / 日本語・英語
コメントを残す

メールアドレスが公開されることはありません。 * が付いている欄は必須項目です