Groupby and Multi Index in Data Frame
In this class, We discuss Groupby and Multi-Index in Data Frame.
For Complete YouTube Video: Click Here
The reader should have prior knowledge of data frame methods. Click here.
Groupby and Multi-Index in Data Frame
Groupby Method
The groupby method is used to group the values in the data frame.
Take an example and try to understand the concept of groupby in the data frame.
We consider sample superstore data set. The discussion about the data set is done in our previous class.
In our data set, the sale is considered in four regions.
Central, south, east, and west regions.
We want to consider the total sale according to region. We have to group the data according to region.
To group the elements according to the region, we use the groupby method.
The groupby method will group the data according to the region and return the groupby object.
On the groupby object, we select the sales and sum all the sales according to region.
The final result is converted to a data frame.
The code to groupby region and sum sales is given below.
import pandas as pd
df=pd.read_excel('sampledata.xls',sheet_name='Orders')
print(df.head())
# Applying group by on Region and Region wise sum of sale
temp=pd.DataFrame(df.groupby(['Region'])['Sales'].agg('sum'))
print(temp)
Output:
Sales
Region
Central 501239.8908
East 678781.2400
South 391721.9050
West 725457.8245
print(temp.loc['Central'])
Sales 501239.8908
Name: Central, dtype: float64
The result obtained after the above code consists of region values as index and sales sum as a data column.
Multi-Index using Groupby
If we apply the groupby method on two columns. We get multi-index.
We consider region and the category-wise sum of sales.
Example: Central region Furniture category total sale.
The categories available are furniture, office supplies, and Technology.
Groupby on multiple columns code is shown below.
# Groupby on Multiple columns we get multi index
temp1=pd.DataFrame(df.groupby(['Region','Category'])['Sales'].agg('sum'))
print(temp1)
Output:
Sales
Region Category
Central Furniture 163797.1638
Office Supplies 167026.4150
Technology 170416.3120
East Furniture 208291.2040
Office Supplies 205516.0550
Technology 264973.9810
South Furniture 117298.6840
Office Supplies 125651.3130
Technology 148771.9080
West Furniture 252612.7435
Office Supplies 220853.2490
Technology 251991.8320
print(temp1.index)
Output:
MultiIndex([('Central', 'Furniture'),
('Central', 'Office Supplies'),
('Central', 'Technology'),
( 'East', 'Furniture'),
( 'East', 'Office Supplies'),
( 'East', 'Technology'),
( 'South', 'Furniture'),
( 'South', 'Office Supplies'),
( 'South', 'Technology'),
( 'West', 'Furniture'),
( 'West', 'Office Supplies'),
( 'West', 'Technology')],
names=['Region', 'Category'])
In the output shown above, we have a multi-index. The first index is a region, and the second index is in the category.
How to access data using multi-index in a data frame?
Accessing data using multi-index
One way to access data using multi-index is by the xs method.
The code to access data is given below.
# xs method to access elements using multi index
print(temp1.xs('Central'))
Output:
Sales
Category
Furniture 163797.1638
Office Supplies 167026.4150
Technology 170416.3120
print(temp1.xs('Furniture',level='Category'))
Output:
Sales
Region
Central 163797.1638
East 208291.2040
South 117298.6840
West 252612.7435
In the first option, we have given the input argument central. The index related to the central region is displayed.
To consider the second index, we pass the argument category to the level parameter.
In our second example, we have given argument furniture and level as a category to access only furniture values.
Another way to access the data using multi-index by loc attribute
# accessing multi index using loc
print(temp1.loc['Central',:])
Output:
Sales
Category
Furniture 163797.1638
Office Supplies 167026.4150
Technology 170416.3120
In the above example, the argument central is taken to search in the first index.
, : is given to take all the columns in the data. We have only one column of sales.
A tuple of arguments is to be given to take multi-index.
The example of taking a tuple of arguments is shown below.
print(temp1.loc[('Central','Furniture'),:])
Output:
Sales 163797.1638
Name: (Central, Furniture), dtype: float64
An example to represent column names is given below.
# we mention column name
print(temp1.loc[('Central','Furniture'),'Sales'])
Output:
163797.1638
An example of accessing multiple lines is given below.
print(temp1.loc[[("Central", "Furniture"), ("East", "Office Supplies")]])
Output:
Sales
Region Category
Central Furniture 163797.1638
East Office Supplies 205516.0550
Access multiple lines of data using slicing is given below.
print(temp1.loc[("Central", "Furniture"):("East", "Office Supplies")])
Output:
Sales
Region Category
Central Furniture 163797.1638
Office Supplies 167026.4150
Technology 170416.3120
East Furniture 208291.2040
Office Supplies 205516.0550
Slicing can also be done using the slice function.
Examples using the slicing function are provided below.
print(temp1.loc[slice('Central','East'),:])
Output:
Sales
Region Category
Central Furniture 163797.1638
Office Supplies 167026.4150
Technology 170416.3120
East Furniture 208291.2040
Office Supplies 205516.0550
Technology 264973.9810
print(temp1.loc[slice('Central','East'),slice('Furniture','Office Supplies'),:])
Output:
Sales
Region Category
Central Furniture 163797.1638
Office Supplies 167026.4150
East Furniture 208291.2040
Office Supplies 205516.0550
print(temp1.loc[slice(None),slice('Furniture'),:])
Output:
Sales
Region Category
Central Furniture 163797.1638
East Furniture 208291.2040
South Furniture 117298.6840
West Furniture 252612.7435