04 02 Concatenating data - HannaAA17/Data-Scientist-With-Python-datacamp GitHub Wiki
Appending and Concatenating Series
.append()
:Series and DataFrame method -> s1.append(s2) row-wiseconcat()
: pandas module function -> pd.concat([s1,s2,s3)] row-wise and column wise
Concatenating panda Series along row axis
# Initialize empty list: units
units = []
# Build the list of Series
for month in [jan, feb, mar]:
units.append(month['Units'])
# Concatenate the list: quarter1
quarter1 = pd.concat(units)
# Print slices from quarter1
print(quarter1.loc['jan 27, 2015':'feb 2, 2015'])
print(quarter1.loc['feb 26, 2015':'mar 7, 2015'])
Output:
Date
2015-01-27 07:11:55 18
2015-02-02 08:33:01 3
2015-02-02 20:54:49 9
Name: Units, dtype: int64
Date
2015-02-26 08:57:45 4
2015-02-26 08:58:51 1
2015-03-06 10:11:45 17
2015-03-06 02:03:56 17
Name: Units, dtype: int64
.reset_Index()
:new_east = northeast.append(south).reset_index(drop=True)
如果不加reset_index的话 就是0-10,0-10; 加了后是0-21ignore_index()
:new_east = pd.concat([northeast,south],ignore_index=True)
Appending and concatenating DataFrames
Appending DataFrames with ignore_index
# Add 'year' column to names_1881 and names_1981
names_1881['year'] = 1881
names_1981['year'] = 1981
# Append names_1981 after names_1881 with ignore_index=True: combined_names
combined_names = names_1881.append(names_1981, ignore_index=True)
# Print shapes of names_1981, names_1881, and combined_names
print(names_1981.shape)
print(names_1881.shape)
print(combined_names.shape)
# Print all rows that contain the name 'Morgan'
print(combined_names.loc[combined_names.name == 'Morgan'])
# output
(19455, 4)
(1935, 4)
(21390, 4)
name gender count year
1283 Morgan M 23 1881
2096 Morgan F 1769 1981
14390 Morgan M 766 1981
Concatenating pandas DataFrames along column axis
- specify
axis=1
oraxis='columns'
to concatenate DFs horizontally pd.concat([population, unemployment], axis=1)
Reading multiple files to build a DataFrame
It is often convenient to build a large DataFrame by parsing many files as DataFrames and concatenating them all at once. You'll do this here with three files, but, in principle, this approach can be used to combine data from dozens or hundreds of files.
#Initialize an empty list: medals
medals =[]
for medal in medal_types: #preloads ['bronze','silver','gold']
# Create the file name: file_name
file_name = "%s_top5.csv" % medal
# Create list of column names: columns
columns = ['Country', medal]
# Read file_name into a DataFrame: medal_df
medal_df = pd.read_csv(file_name,header=0,index_col='Country',names=columns)
# Append medal_df to medals
medals.append(medal_df)
# Concatenate medals horizontally: medals_df
medals_df = pd.concat(medals,axis='columns')
# Print medals_df
print(medals_df)
# output
bronze silver gold
France 475.0 461.0 NaN
Germany 454.0 NaN 407.0
Italy NaN 394.0 460.0
Soviet Union 584.0 627.0 838.0
United Kingdom 505.0 591.0 498.0
United States 1052.0 1195.0 2088.0
Concatenation, keys and MultiIndexes
- specify the keys parameter in the call to pd.concat(), which generates a hierarchical index with the labels from keys as the outermost index label
for medal in medal_types:
file_name = "%s_top5.csv" % medal
# Read file_name into a DataFrame: medal_df
medal_df = pd.read_csv(file_name,index_col='Country')
# Append medal_df to medals
medals.append(medal_df)
# Concatenate medals: medals
medals = pd.concat(medals, keys=['bronze', 'silver', 'gold'])
# Print medals in entirety
print(medals)
# output
Total
Country
bronze United States 1052.0
Soviet Union 584.0
United Kingdom 505.0
France 475.0
Germany 454.0
silver United States 1195.0
Soviet Union 627.0
United Kingdom 591.0
France 461.0
Italy 394.0
gold United States 2088.0
Soviet Union 838.0
United Kingdom 498.0
Italy 460.0
Germany 407.0
Slicing MultiIndexed DataFrames
# Sort the entries of medals: medals_sorted
medals_sorted = medals.sort_index(level=0)
# Print the number of Bronze medals won by Germany
print(medals_sorted.loc[('bronze','Germany')])
# Print data about silver medals
print(medals_sorted.loc['silver'])
# Create alias for pd.IndexSlice: idx
idx = pd.IndexSlice
# Print all the data on medals won by the United Kingdom
print(medals_sorted.loc[idx[:,'United Kingdom'],:])
# Output
Total 454.0
Name: (bronze, Germany), dtype: float64
Total
Country
France 461.0
Italy 394.0
Soviet Union 627.0
United Kingdom 591.0
United States 1195.0
Total
Country
bronze United Kingdom 505.0
gold United Kingdom 498.0
silver United Kingdom 591.0
Concatenating DataFrames from a dict
# Make the list of tuples: month_list
month_list = [('january', jan),('february', feb),('march', mar)]
# Create an empty dictionary: month_dict
month_dict = {}
for month_name, month_data in month_list:
# Group month_data: month_dict[month_name]
month_dict[month_name] = month_data.groupby('Company').sum()
# Concatenate data in month_dict: sales
sales = pd.concat(month_dict)
# Print sales
print(sales)
# Print all sales by Mediacore
idx = pd.IndexSlice
print(sales.loc[idx[:,'Mediacore'],:])
Outer and Inter
- Outer Join: Missing fields with NaN, all labels
- Inner Join: Intersection of index sets (onlu common labels)
Resampling & concatenating DataFrames with inner join
# Resample and tidy china: china_annual
china_annual = china.resample('A').last().pct_change(10).dropna()
# Resample and tidy us: us_annual
us_annual = us.resample('A').last().pct_change(10).dropna()
# Concatenate china_annual and us_annual: gdp
gdp = pd.concat([china_annual,us_annual],join='inner',axis=1)
# Resample gdp and print
print(gdp.resample('10A').last())