Python for Data Science Groupby and Multi Index in Data Frame

Groupby and Multi Index in Data Frame

In this class, We discuss Groupby and Multi-Index in Data Frame.

For Complete YouTube Video: Click Here

The reader should have prior knowledge of data frame methods. Click here.

Groupby and Multi-Index in Data Frame

Groupby Method

The groupby method is used to group the values in the data frame.

Take an example and try to understand the concept of groupby in the data frame.

We consider sample superstore data set. The discussion about the data set is done in our previous class.

In our data set, the sale is considered in four regions.

Central, south, east, and west regions.

We want to consider the total sale according to region. We have to group the data according to region.

To group the elements according to the region, we use the groupby method.

The groupby method will group the data according to the region and return the groupby object.

On the groupby object, we select the sales and sum all the sales according to region.

The final result is converted to a data frame.

The code to groupby region and sum sales is given below.

import pandas as pd
df=pd.read_excel('sampledata.xls',sheet_name='Orders')
print(df.head())

# Applying group by on Region and Region wise sum of sale
temp=pd.DataFrame(df.groupby(['Region'])['Sales'].agg('sum'))
print(temp)

Output:
               Sales
Region              
Central  501239.8908
East     678781.2400
South    391721.9050
West     725457.8245

print(temp.loc['Central'])

Sales    501239.8908
Name: Central, dtype: float64

The result obtained after the above code consists of region values as index and sales sum as a data column.

Multi-Index using Groupby

If we apply the groupby method on two columns. We get multi-index.

We consider region and the category-wise sum of sales.

Example: Central region Furniture category total sale.

The categories available are furniture, office supplies, and Technology.

Groupby on multiple columns code is shown below.

# Groupby on Multiple columns we get multi index
temp1=pd.DataFrame(df.groupby(['Region','Category'])['Sales'].agg('sum'))
print(temp1)

Output:
                               Sales
Region  Category                    
Central Furniture        163797.1638
        Office Supplies  167026.4150
        Technology       170416.3120
East    Furniture        208291.2040
        Office Supplies  205516.0550
        Technology       264973.9810
South   Furniture        117298.6840
        Office Supplies  125651.3130
        Technology       148771.9080
West    Furniture        252612.7435
        Office Supplies  220853.2490
        Technology       251991.8320

print(temp1.index)

Output:
MultiIndex([('Central',       'Furniture'),
            ('Central', 'Office Supplies'),
            ('Central',      'Technology'),
            (   'East',       'Furniture'),
            (   'East', 'Office Supplies'),
            (   'East',      'Technology'),
            (  'South',       'Furniture'),
            (  'South', 'Office Supplies'),
            (  'South',      'Technology'),
            (   'West',       'Furniture'),
            (   'West', 'Office Supplies'),
            (   'West',      'Technology')],
           names=['Region', 'Category'])

In the output shown above, we have a multi-index. The first index is a region, and the second index is in the category.

How to access data using multi-index in a data frame?

Accessing data using multi-index

One way to access data using multi-index is by the xs method.

The code to access data is given below.

# xs method to access elements using multi index
print(temp1.xs('Central'))

Output:
                       Sales
Category                    
Furniture        163797.1638
Office Supplies  167026.4150
Technology       170416.3120

print(temp1.xs('Furniture',level='Category'))

Output:
               Sales
Region              
Central  163797.1638
East     208291.2040
South    117298.6840
West     252612.7435

In the first option, we have given the input argument central. The index related to the central region is displayed.

To consider the second index, we pass the argument category to the level parameter.

In our second example, we have given argument furniture and level as a category to access only furniture values.

Another way to access the data using multi-index by loc attribute

# accessing multi index using loc
print(temp1.loc['Central',:])

Output:
                       Sales
Category                    
Furniture        163797.1638
Office Supplies  167026.4150
Technology       170416.3120

In the above example, the argument central is taken to search in the first index.

, : is given to take all the columns in the data. We have only one column of sales.

A tuple of arguments is to be given to take multi-index.

The example of taking a tuple of arguments is shown below.

print(temp1.loc[('Central','Furniture'),:])

Output:
Sales    163797.1638
Name: (Central, Furniture), dtype: float64

An example to represent column names is given below.

# we mention column name
print(temp1.loc[('Central','Furniture'),'Sales'])

Output:
163797.1638

An example of accessing multiple lines is given below.

print(temp1.loc[[("Central", "Furniture"), ("East", "Office Supplies")]])

Output:
                               Sales
Region  Category                    
Central Furniture        163797.1638
East    Office Supplies  205516.0550

Access multiple lines of data using slicing is given below.

print(temp1.loc[("Central", "Furniture"):("East", "Office Supplies")])

Output:
                               Sales
Region  Category                    
Central Furniture        163797.1638
        Office Supplies  167026.4150
        Technology       170416.3120
East    Furniture        208291.2040
        Office Supplies  205516.0550

Slicing can also be done using the slice function.

Examples using the slicing function are provided below.

print(temp1.loc[slice('Central','East'),:])

Output:
                               Sales
Region  Category                    
Central Furniture        163797.1638
        Office Supplies  167026.4150
        Technology       170416.3120
East    Furniture        208291.2040
        Office Supplies  205516.0550
        Technology       264973.9810

print(temp1.loc[slice('Central','East'),slice('Furniture','Office Supplies'),:])

Output:
                               Sales
Region  Category                    
Central Furniture        163797.1638
        Office Supplies  167026.4150
East    Furniture        208291.2040
        Office Supplies  205516.0550

print(temp1.loc[slice(None),slice('Furniture'),:])

Output:
                         Sales
Region  Category              
Central Furniture  163797.1638
East    Furniture  208291.2040
South   Furniture  117298.6840
West    Furniture  252612.7435

Previous Lesson

Back to Course

Next Lesson