Pandas DataFrame – Pivot Tables

summary, background, texture

Pivot tables are one of the most frequently used functions in data analysis. It is a table that summarizes the data of a comprehensive table.  Although the term “pivot” is frequently referred to in Excel, and it is one of the most powerful features. 

Pandas, a python package,  has provided two important method pivot() and pivot_table(). We will look into these methods with examples, and also look at some of the most frequent questions related to pivot tables.

This article is the continuation of our Pandas DataFrame tutorial series, recommends you to check previous articles.

What is Pivot Table?

A pivot table is a summary of data based on single or multiple columns from the original table. It is used to group data in meaningful ways so you can inference conclusive information. In other words, the pivot table takes the complex data and simplify the information by applying various summary function in such a way that it can easily comprehend.

Let’s try to understand the pivot table using an example.

A “count” summary table generated on “Grade” as a row label and “Education” as a column label.

pivot table excel

Pandas Pivot Method

Pandas pivot method reshapes data based on the index and columns. You must use this method only if the specified index values are unique. 

Constructor

DataFrame.pivot(index=Nonecolumns=Nonevalues=None)

Parameters:

  • index – column or list of columns to group the data, and will be the index of the new DataFrame
  • columns – accept column or list of columns, and will be the columns in the new DataFrame
  • values – column or list of columns to use for populating new Dataframe’s values. 

Let’s take a couple of examples to sink-in the pivot table concept.

You need to create a DataFrame to perform pivot operations.

# import pandas
import pandas as pd

# creating a dataframe
df = pd.DataFrame({'Names': ['Harvey', 'Michael', 'Olivia', 'George', 'Harry', 'Jack', 'Jacob', 'Emma', 'Sophia'],
'Education': ['Masters', 'Bachelors', 'Bachelors', 'Masters', 'Masters', 'Masters', 'Bachelors', 'Masters', 'Bachelors'],
'Age': [27, 23, 24, 22, 29, 28, 31, 25, 30],
'Gender': ['M', 'M', 'F', 'M', 'M', 'M', 'M', 'F', 'F'],
'Grades': ['A+', 'B+', 'A', 'A', 'B+', 'A+', 'B', 'A+', 'A'],
'Location': ['Chicago', 'New York', 'Chicago', 'New York', 'Chicago', 'New York', 'Chicago', 'New York', 'New York']})

print(df)

Output:

     Names  Education  Age Gender Grades  Location
0   Harvey    Masters   27      M     A+   Chicago
1  Michael  Bachelors   23      M     B+  New York
2   Olivia  Bachelors   24      F      A   Chicago
3   George    Masters   22      M      A  New York
4    Harry    Masters   29      M     B+   Chicago
5     Jack    Masters   28      M     A+  New York
6    Jacob  Bachelors   31      M      B   Chicago
7     Emma    Masters   25      F     A+  New York
8   Sophia  Bachelors   30      F      A  New York

Create a pivot table on the index “Names”, columns as “Education”, and value as “Age”.  This is the simplest form of the pivot method.

# Single Index Pivot Table
print(df.pivot(index ='Names', columns = 'Education', values = 'Age'))

Output:

Education  Bachelors  Masters
Names                        
Emma             NaN     25.0
George           NaN     22.0
Harry            NaN     29.0
Harvey           NaN     27.0
Jack             NaN     28.0
Jacob           31.0      NaN
Michael         23.0      NaN
Olivia          24.0      NaN
Sophia          30.0      NaN

Observe the columns name, so basically the unique values from “Education” become the column names for the output DataFrame. And, all missing values filled with NaN.

How to create a pivot table using multiple columns?

Pass the list of columns to the index to create a pivot table. Here is an example.

# Multi-index Pivot table
print(df.pivot_table(index =['Education', 'Gender'], columns = 'Names', values = 'Age'))

Output:

pivot multindex

Error: Index contains duplicate entries

One of the most frequent errors related to the pandas pivot method. This usually happens when your index column has duplicate values.

# Create piovt table for column with duplicate values
print(df.pivot(index ='Education', columns = 'Gender', values = 'Age'))

Output:

duplicate error pandas pivot

Pandas pivot_table Method

Pandas pivot_table() method is similar to the Excel pivot function. It provides additional parameters to run aggregation such as mean, sum, etc on numeric data.

Constructor

DataFrame.pivot_table(datavalues=Noneindex=Nonecolumns=Noneaggfunc=‘mean’fill_value=None, dropna=True)

Parameters:

  • data – a DataFrame object
  • index – column or list of columns as keys to group on the pivot table on the index
  • columns – column or list of columns as keys to group on the pivot table on the columns
  • values – a column or a list of columns to aggregate.
  • aggfunc – aggregation functions
  • fill-value – replace the missing value with. The default value is NaN
  • dropna – ignore columns if all NaNs

Let’s have a look at the pivot_table example. This calculates the sum of “Age” on “Education” and “Grades”.

# pivot_table
import numpy as np
print(df.pivot_table(index="Education", columns="Grades", values="Age", aggfunc=sum))

Output:

Grades        A    A+     B    B+
Education                        
Bachelors  54.0   NaN  31.0  23.0
Masters    22.0  80.0   NaN  29.0

How to create pivot table using groupby?

As mentioned earlier, it is possible to create the pivot table using groupby method. 

# create pivot table using groupby
print(df.groupby(['Education', 'Grades'])['Age'].sum().unstack())

Output:

Grades        A    A+     B    B+
Education                        
Bachelors  54.0   NaN  31.0  23.0
Masters    22.0  80.0   NaN  29.0

Pivot table on multiple columns and aggregation methods

Pandas is flexible and powerful. Its pivot_table method accepts multiple columns as the index, also you can pass as many aggregation functions to aggfunc. Notice, we have used NumPy here.

# Multiple aggregation
import numpy as np
pd.pivot_table(df, index=['Education', 'Grades'], columns='Gender', values=['Age'], aggfunc=[np.sum, np.mean])

Output:

multiple columns and aggregation functions

Pivot Table: Replacing missing values with Zeros

When there are missing column values, the pivot tables method fills it with NaN values. But you can override this default behavior using fill_value.

import numpy as np
pd.pivot_table(df, index=['Education', 'Grades'], columns='Gender', values=['Age'], aggfunc=[np.sum, np.mean], fill_value=0)

Output:

fill_value_pivot

Conclusion

This is the fifth article in the Pandas DataFrame tutorial series. You may be surprised why so many articles on DataFrames. The simple answer to this question that most of the data scientist effort goes into analyzing the data, so it is extremely important to master pandas. Believe me, this will make your life much easier.

Sharing some more interesting blogs on this site:

Leave a Comment

Your email address will not be published. Required fields are marked *