Pandas DataFrame – Grouping using DataFrame.groupby()

Python Pandas DataFrame Groupby

Pandas DataFrame – Grouping is a continuation of the post on the pandas DataFrame series.  If you are new to Python or DataFrames then make sure to check the previous two articles on DataFrames.

In this post, we will explore DataFrame.groupby() function.

Grouping Data - groupby

We are using MTCARS dataset for this exercise. You can download the dataset from UCI Machine Learning Reporsitory

mtcars dataset was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles.

Attribute Information:

A data frame with 32 observations on 11 variables.

  • model – model name, categorical variable
  • mpg – Miles/(US) gallon, numeric variable
  • cyl  – Number of cylinders, values are 4, 6 and 8
  • disp – Displacement (cu.in.)
  • hp – Gross horsepower
  • drat –  Rear axle ratio
  • wt – Weight (1000 lbs)
  • qsec – 1/4 mile time
  • vs – Engine (0 = V-shaped, 1 = straight)
  • am – Transmission (0 = automatic, 1 = manual)
  • gear – Number of forward gears
  • carb – Number of carburetors

Source: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html

Load the mtcars dataset to python workspace.

import pandas as pd
mtcars = pd.read_csv("mtcars.csv")
mtcars.head()

Output:

pandas dataframe mtcars

The DataFrame.groupby operation involves the following three steps:

  • Splitting the object
  • Applying a function
  • Combining the results

“cyl” variable, represents the number of cylinders, contains values like 4, 6, and 8. When we group by ‘cyl’ it should create three groups. Let’s check it out.

# Group by Cyl
cylGrp = mtcars.groupby('cyl')
cylGrp.groups

Output:

{4: Int64Index([2, 7, 8, 17, 18, 19, 20, 25, 26, 27, 31], dtype='int64'),
 6: Int64Index([0, 1, 3, 5, 9, 10, 29], dtype='int64'),
 8: Int64Index([4, 6, 11, 12, 13, 14, 15, 16, 21, 22, 23, 24, 28, 30], dtype='int64')}

Other important functions to remember are size(), first(), and last().

print(cylGrp.size())

Output:

cyl
4    11
6     7
8    14
dtype: int64

Selecting the first rows for each group.

print(cylGrp.first())

Output:

                 model   mpg   disp   hp  drat    wt   qsec  vs  am  gear  carb 
cyl                                                                         
4           Datsun 710  22.8  108.0   93  3.85  2.32  18.61   1   1     4     1   
6            Mazda RX4  21.0  160.0  110  3.90  2.62  16.46   0   1     4     4
8    Hornet Sportabout  18.7  360.0  175  3.15  3.44  17.02   0   0     3     2

You can also specify more than one column as a list in groupby() function. This is also known as the multicolumn groupby.

cyl_gear_Grp = mtcars.groupby(['cyl', 'gear'])
cyl_gear_Grp.groups

Output:

{(4, 3): Int64Index([20], dtype='int64'),
 (4, 4): Int64Index([2, 7, 8, 17, 18, 19, 25, 31], dtype='int64'),
 (4, 5): Int64Index([26, 27], dtype='int64'),
 (6, 3): Int64Index([3, 5], dtype='int64'),
 (6, 4): Int64Index([0, 1, 9, 10], dtype='int64'),
 (6, 5): Int64Index([29], dtype='int64'),
 (8, 3): Int64Index([4, 6, 11, 12, 13, 14, 15, 16, 21, 22, 23, 24], dtype='int64'),
 (8, 5): Int64Index([28, 30], dtype='int64')}

Group level information can be extracted using key values.

# Key value - (6, 3)
print(cyl_gear_Grp.get_group((6, 3)))

Output:

            model   mpg  cyl   disp   hp  drat     wt   qsec  vs  am  gear  carb 
3  Hornet 4 Drive  21.4    6  258.0  110  3.08  3.215  19.44   1   0     3     1 
5         Valiant  18.1    6  225.0  105  2.76  3.460  20.22   1   0     3     1

Aggregate Method

We can apply the aggregate function sum() to the group, observe the DataFrame structure. 

cyl_gear_Grp_df  = cyl_gear_Grp.sum()
print(cyl_gear_Grp_df)

Output:

            mpg    disp    hp   drat      wt    qsec  vs  am  carb
cyl gear                                                          
4   3      21.5   120.1    97   3.70   2.465   20.01   1   0     1
    4     215.4   821.0   608  32.88  19.025  156.90   8   6    12
    5      56.4   215.4   204   8.20   3.653   33.60   1   2     4
6   3      39.5   483.0   215   5.84   6.675   39.66   2   0     2
    4      79.0   655.2   466  15.64  12.375   70.68   2   2    16
    5      19.7   145.0   175   3.62   2.770   15.50   0   1     6
8   3     180.6  4291.4  2330  37.45  49.249  205.71   0   0    37
    5      30.8   652.0   599   7.76   6.740   29.10   0   2    12

Applying multiple functions us agg() function.

cyl_gear_Grp_df = cyl_gear_Grp.agg(["sum", "mean", "size"]).reset_index()
cyl_gear_Grp_df)

Output:

groupby multiple functions

This returns the multi-index DataFrame, lets check the structure.

cyl_gear_Grp_df.columns

Output:

MultiIndex([( 'cyl',     ''),
            ('gear',     ''),
            ( 'mpg',  'sum'),
            ( 'mpg', 'mean'),
            ( 'mpg', 'size'),
            ('disp',  'sum'),
            ('disp', 'mean'),
            ('disp', 'size'),
            (  'hp',  'sum'),
            (  'hp', 'mean'),
            (  'hp', 'size'),
            ('drat',  'sum'),
            ('drat', 'mean'),
            ('drat', 'size'),
            (  'wt',  'sum'),
            (  'wt', 'mean'),
            (  'wt', 'size'),
            ('qsec',  'sum'),
            ('qsec', 'mean'),
            ('qsec', 'size'),
            (  'vs',  'sum'),
            (  'vs', 'mean'),
            (  'vs', 'size'),
            (  'am',  'sum'),
            (  'am', 'mean'),
            (  'am', 'size'),
            ('carb',  'sum'),
            ('carb', 'mean'),
            ('carb', 'size')],
           )

How to convert multi-index to single index pandas?

One of the common queries that beginner had. There are many ways to solve it. We would like to solve using converting multi-index to the list then assigning it as columns name.

cyl_gear_Grp_df.columns = [i[0] +"_" +i[1] for i in cyl_gear_Grp_df.columns]
cyl_gear_Grp_df

Output:

multindex to index pandas

Conclusion

Pandas is a very powerful Python package, and you can perform multi-dimensional analysis on the dataset. So it is extremely important to get a good hold on pandas.

The purpose of this article to touch upon the basics of groupby function, and how you can use it for your data analysis. 

We will continue to explore more about pandas in detail in upcoming articles and also look for some advanced data analysis method using python.

Leave a Comment

Your email address will not be published. Required fields are marked *