Pandas DataFrame – Merge and Join

merge and join

Merge and Join are important functions to handle day-to-day challenges associated with the real-world data science problem. In many real-world applications, data may spread across multiple files or tables, and to perform in-depth studies of data requires concatenation of multiple datasets. It may be simple concatenation to complicated key-based joins and merges. Therefore, it is really important to master the merge and join functions. Luckily, pandas has provided various functions to perform the merge and join operation on DataFrames and Series.

In this post, we will explore merge and join functions, and also tries to find out the answers for some of the most frequent questions.

Merge and Join function covered in this tutorial are:

  • pandas.append() – append rows or columns of other to the end of the DataFrame object
  • pandas.concat() – joining DataFrames across rows or columns
  • pandas.merge() – joining is done on columns or indexes
  • pandas.join() – joining columns on an index or on a key column

If you are new to Pandas DataFrame, make sure to follow our Pandas DataFrame series:

Pandas append()

The first function that we are going to cover is append(). It is used to add the rows at the end of the DataFrame and return a new object. Append columns that are not in the original DataFrames are added as new columns. The append() function in Pandas does not modify the original DataFrame object.

DataFrame.append(otherignore_index=Falseverify_integrity=Falsesort=False)

# Pandas.Append()
import pandas as pd
df1 = pd.DataFrame({'Elements': ['a', 'b', 'c', 'd'],
'Numbers': ['1', '2', '3', '4'],
'CapLetters': ['A', 'B', 'C', 'D']})

df2 = pd.DataFrame({'Elements': ['e', 'f', 'g', 'h'],
'Numbers': ['5', '6', '7', '8'],
'CapLetters': ['E', 'F', 'G', 'H']})

# Merge DataFrames
print(df1.append(df2))

Output:

  Elements Numbers CapLetters
0        a       1          A
1        b       2          B
2        c       3          C
3        d       4          D
0        e       5          E
1        f       6          F
2        g       7          G
3        h       8          H

Notice the index of the appended DataFrame. Now, set the ignore_index to true.

print(df1.append(df2, ignore_index=True))

Output:

  Elements Numbers CapLetters
0        a       1          A
1        b       2          B
2        c       3          C
3        d       4          D
4        e       5          E
5        f       6          F
6        g       7          G
7        h       8          H

Append Single Row to DataFrame by Passing Dictionary Object

You can append a single row to a DataFrame by passing a series or dictionary. Create a dictionary object and pass it to the append method.

# Appending dict object to DataFrame
dict1 = {'Elements': 't', 'Numbers': '15', 'CapLetters': 'T'}
print(df1.append(dict1, ignore_index=True))

Output:

  Elements Numbers CapLetters
0        a       1          A
1        b       2          B
2        c       3          C
3        d       4          D
4        t      15          T

Append Single Row to DataFrame by Passing Series Object

Create a series object, and pass it to the append method to add the row to the original DataFrame.

# Appending series object to DataFrame Object
series1 = pd.Series(['x', 30, 'X'], index=['Elements','Numbers', 'CapLetters'])
print(df1.append(series1, ignore_index=True))

Output:

  Elements  Numbers CapLetters
0        a        1          A
1        b        2          B
2        c        3          C
3        d        4          D
4        x       30          X

How to Append Columns to DataFrame using append()

You can also append columns to the original DataFrame using the append method. Non-existed columns filled with NaN values.

# Pandas.Append() - Append Columns to Original DataFrame
import pandas as pd
df1 = pd.DataFrame({'Elements': ['a', 'b', 'c', 'd'],
'Numbers': [1, 2, 3, 4],
'CapLetters': ['A', 'B', 'C', 'D']})

df2 = pd.DataFrame({'Elements': ['e', 'f', 'g', 'h'],
'Numbers': [5, 6, 7, 8],
'Binary': [1, 0, 0, 1]})

print(df1.append(df2))

Output:

  Elements  Numbers CapLetters  Binary
0        a        1          A     NaN
1        b        2          B     NaN
2        c        3          C     NaN
3        d        4          D     NaN
0        e        5        NaN     1.0
1        f        6        NaN     0.0
2        g        7        NaN     0.0
3        h        8        NaN     1.0

How to Concatenate Series using Pandas Append Method

You can append the multiples series using the pandas append method. Let take a look at an example.

# Concatenate Multiple series using append method
series_1 = pd.Series([1, 2, 3, 4])
series_2 = pd.Series([11, 12, 13, 14])
series_3 = pd.Series([21, 22, 23, 24])

series_4 = series_1.append(series_2, ignore_index=True)
series_5 = series_4.append(series_3, ignore_index=True)
print(series_5)

Output:

0      1
1      2
2      3
3      4
4     11
5     12
6     13
7     14
8     21
9     22
10    23
11    24
dtype: int64

Pandas concat()

The pandas concat() method is used to join multiple DataFrames along a specified axis. It is more efficient than the append() method.

Combine multiple DataFrames using the concat method across the index.

df1 = pd.DataFrame({'Elements': ['a', 'b', 'c', 'd'],
'Numbers': ['1', '2', '3', '4'],
'CapLetters': ['A', 'B', 'C', 'D']},
index=[0, 1, 2, 3])

df2 = pd.DataFrame({'Elements': ['e', 'f', 'g', 'h'],
'Numbers': ['5', '6', '7', '8'],
'CapLetters': ['E', 'F', 'G', 'H']},
index=[0, 1, 2, 3])

df3 = pd.DataFrame({'Elements': ['i', 'j', 'k', 'l'],
'Numbers': ['9', '10', '11', '12'],
'CapLetters': ['I', 'J', 'K', 'L']},
index=[4, 5, 6, 7])

print(pd.concat([df1, df2, df3], ignore_index = True))

Output:

  Elements Numbers CapLetters
0        a       1          A
1        b       2          B
2        c       3          C
3        d       4          D
0        e       5          E
1        f       6          F
2        g       7          G
3        h       8          H
4        i       9          I
5        j      10          J
6        k      11          K
7        l      12          L

In our last example, we have used the concat method to join the DataFrames across the index. Now, will see how to merge the two DataFrames across columns. 

# Combine two dataframes - using axis parameter
df1 = pd.DataFrame({'Elements_1': ['a', 'b', 'c', 'd'],
'Numbers_1': ['1', '2', '3', '4'],
'CapLetters_1': ['A', 'B', 'C', 'D']})

df2 = pd.DataFrame({'Elements_2': ['i', 'j', 'k', 'l'],
'Numbers_2': ['9', '10', '11', '12'],
'Letters_2': ['I', 'J', 'K', 'L']})

print(pd.concat([df1, df2], axis=1, ignore_index = False))

Output:

  Elements_1 Numbers_1 CapLetters_1 Elements_2 Numbers_2 Letters_2
0          a         1            A          i         9         I
1          b         2            B          j        10         J
2          c         3            C          k        11         K
3          d         4            D          l        12         L

By default, the axis is set to “0” which means DataFrame concatenate across the index. Whereas axis=1 joins the DataFrames across the columns. The concat method gives the flexibility to concatenate multiple DataFrames in a single line of a python script.

Another important parameter is “join” which controls how to handle indexes on other axes. The default value for the join parameter is “outer”. 

# utilizing join parameter to combine dataframe
df1 = pd.DataFrame({'Elements_1': ['a', 'b', 'c', 'd'],
'Numbers_1': ['1', '2', '3', '4'],
'CapLetters_1': ['A', 'B', 'C', 'D']},
index=[0, 1, 2, 4])

df2 = pd.DataFrame({'Elements_2': ['i', 'j', 'k', 'l'],
'Numbers_2': ['9', '10', '11', '12'],
'CapLetters_2': ['I', 'J', 'K', 'L']},
index=[2, 4, 5, 6])

print(pd.concat([df1, df2], join="inner", ignore_index=False, axis=1))

Output:

  Elements_1 Numbers_1 CapLetters_1 Elements_2 Numbers_2 CapLetters_2
2          c         3            C          i         9            I
4          d         4            D          j        10            J

The concat inner join has resulted into merged DataFrame across the common index.  You can observe some data loss here.

The concat method with join parameter set as “outer”, resultant a DataFrame with all index values included in it. All the missing column values filled with NaN.

# Concat outer join
print(pd.concat([df1, df2], join="outer", ignore_index=False, axis=1))

Output:

  Elements_1 Numbers_1 CapLetters_1 Elements_2 Numbers_2 CapLetters_2
0          a         1            A        NaN       NaN          NaN
1          b         2            B        NaN       NaN          NaN
2          c         3            C          i         9            I
4          d         4            D          j        10            J
5        NaN       NaN          NaN          k        11            K
6        NaN       NaN          NaN          l        12            L

In the last example, we have combined the DataFrames across the column. Notice the changes made to the column names for df1 and df2. There are two common columns between the df1 and df2.

In this example, join is set to “inner”, and the axis is set to 0.

df1 = pd.DataFrame({'Elements_1': ['a', 'b', 'c', 'd'],
'Numbers_1': ['1', '2', '3', '4'],
'CapLetters_1': ['A', 'B', 'C', 'D']},
index=[0, 1, 2, 4])

df2 = pd.DataFrame({'Elements_1': ['i', 'j', 'k', 'l'],
'Numbers_1': ['9', '10', '11', '12'],
'CapLetters_2': ['I', 'J', 'K', 'L']},
index=[2, 4, 5, 6])

print(pd.concat([df1, df2], join="inner", ignore_index=False, axis=0))

Output:

  Elements_1 Numbers_1
0          a         1
1          b         2
2          c         3
4          d         4
2          i         9
4          j        10
5          k        11
6          l        12

Resultant DataFrame doesn’t have two columns CapLetters_1 and CapLetters_2, and only contains common columns.

The output DataFrame contains all the columns from df1 and df2 and missing values filled with NaN.

# Concat function - axis = 0 and join = "outer"
print(pd.concat([df1, df2], join="outer", ignore_index=False, axis=0))

Output:

  Elements_1 Numbers_1 CapLetters_1 CapLetters_2
0          a         1            A          NaN
1          b         2            B          NaN
2          c         3            C          NaN
4          d         4            D          NaN
2          i         9          NaN            I
4          j        10          NaN            J
5          k        11          NaN            K
6          l        12          NaN            L

How to Combine Series using Pandas Concat Method

Have you observed how easily in a single line code you can combine multiple series?

# Combining multiple series using Concat Method
series_1 = pd.Series([1, 2, 3, 4])
series_2 = pd.Series([11, 12, 13, 14])
series_3 = pd.Series([21, 22, 23, 24])

pd.concat([series_1, series_2, series_3])

Output:

0     1
1     2
2     3
3     4
0    11
1    12
2    13
3    14
0    21
1    22
2    23
3    24
dtype: int64

Pandas merge()

Pandas merge() gives the flexibility to perform the database-like join operations. You can combine DataFrame objects based on single or multiple keys. The merge is done on indexes or columns. 

DataFrame.merge(righthow=‘inner’on=Noneleft_on=Noneright_on=Noneleft_index=Falseright_index=Falsesort=Falsesuffixes=‘_x’, ‘_y’copy=Trueindicator=Falsevalidate=None)

Broadly classified, there are three categories of joins: one-to-one, many-to-one, and many-to-many joins. You can apply these three joins to your dataset using pandas merge(). Will discuss all these joins in detail below.

One-to-Many Joins

One of the simplest forms of join, also known as column-wise merge.  Each row in one table is associated with a single row in another table using a key column.

# One-to-One Join
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
'value': [101, 201, 301, 551]})
df2 = pd.DataFrame({'key': ['D', 'A', 'B', 'C'],
'value': [501, 601, 701, 801]})
print(pd.merge(df1, df2, on='key'))

Output:

  key  value_x  value_y
0   A      101      601
1   B      201      701
2   C      301      801
3   D      551      501

Many-to-One Joins

Many-to-one joins are in which each row in one table is associated with one or more rows in another table using a key column. Many-to one joins will result in a DataFrame with duplicate entries.

# Many-to-One joins - Key A
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'A'],
'value': [101, 201, 301, 551]})
df2 = pd.DataFrame({'key': ['A', 'B', 'C'],
'value': [501, 601, 701]})

print(pd.merge(df1, df2, on='key'))

Output:

  key  value_x  value_y
0   A      101      501
1   A      551      501
2   B      201      601
3   C      301      701

Resulting DataFrame has duplicate entries for key “A”.

Many-to-Many Joins

Many-to-many joins are in which one, or more, rows in one table is associated with one, or more, rows in another table using a key column. Key columns in both tables have duplicate value.

# Many-to-Many joins - Key A in table 1 and C in table 2
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'A'],
'value': [101, 201, 301, 551]})
df2 = pd.DataFrame({'key': ['A', 'B', 'C', 'C'],
'value': [501, 601, 701, 801]})

print(pd.merge(df1, df2, on='key'))

Output:

  key  value_x  value_y
0   A      101      501
1   A      551      501
2   B      201      601
3   C      301      701
4   C      301      801

Closely, observe the key value for table 1 and table 2. Both the tables have duplicate key values. 

Types of Joins

Before moving to the joins, first, let’s understand the types of joins. There are four types of joins in python:

  • inner join – inner join return the rows that exist in both the DataFrame
  • outer join – returns all the rows from both the DataFrame
  • left join – return all the rows from left DataFrame, and only rows that match from right DataFrame
  • right join – return all the rows from right DataFrame, and only rows that match from left DataFrame

Let us create sample DataFrame, and perform these join operations one by one.

Create two DataFrames and perform an inner join.

# Create two sample DataFrames
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'E'],
'value': [101, 201, 301, 551, 341]})
df2 = pd.DataFrame({'key': ['B', 'C', 'G', 'H'],
'value': [501, 601, 701, 801]})
# Inner join
print(pd.merge(df1, df2, on='key', how='inner'))

Output:

  key  value_x  value_y
0   B      201      501
1   C      301      601

Inner join operation using merge(), resultant into the DataFrame having only common rows using key column. Key-value “B” and “C” are common to both the DataFrames.

Move into our next example: Outer Join operation

# Outer Join
print(pd.merge(df1, df2, on='key', how='outer'))

Output:

  key  value_x  value_y
0   A    101.0      NaN
1   B    201.0    501.0
2   C    301.0    601.0
3   D    551.0      NaN
4   E    341.0      NaN
5   G      NaN    701.0
6   H      NaN    801.0

Outer join operation returns all the rows, combined, from both the DataFrames. And all the missing values are filled with NaN. The next example is a left outer join using a merge method. 

# Left join
print(pd.merge(df1, df2, on='key', how='left'))

Output:

  key  value_x  value_y
0   A      101      NaN
1   B      201    501.0
2   C      301    601.0
3   D      551      NaN
4   E      341      NaN

This results in the matched rows from both the DataFrames and all the rows from left DataFrame. The last example is right join using merge method.

# Right join
print(pd.merge(df1, df2, on='key', how='right'))

Output:

  key  value_x  value_y
0   B    201.0      501
1   C    301.0      601
2   G      NaN      701
3   H      NaN      801

All unmatched values are filled with NaN. The Right join operation returns the matched rows and all the rows from the right table.

Pandas join()

The pandas join() method merge columns with other DataFrame either on an index or on a key column. The output DataFrame retains all the columns from both the DataFrames.

In this example, df1 and df2 DataFrames have columns as key and value. Check the resultant DataFrame, it has four columns.

# Join Dataframe
df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D', 'E'],
'value': [101, 201, 301, 551, 341]})
df2 = pd.DataFrame({'key': ['B', 'C', 'G', 'H'],
'value': [501, 601, 701, 801]})

print(df1.join(df2, lsuffix='_left', rsuffix='_right'))

Output:

  key_left  value_left key_right  value_right
0        A         101         B        501.0
1        B         201         C        601.0
2        C         301         G        701.0
3        D         551         H        801.0
4        E         341       NaN          NaN

By default, pandas.join() will do a left join on the index. If you would like to join on columns, then set the columns as the indexes. Here is an example to make a thing clear.

# Join on column
df1_key = df1.set_index('key')
df2_key = df2.set_index('key')
print(df1_key.join(df2_key, lsuffix='_left', rsuffix='_right'))

Output:

     value_left  value_right
key                         
A           101          NaN
B           201        501.0
C           301        601.0
D           551          NaN
E           341          NaN

How to Join DataFrames on Multiple Columns

As of now, we have looked into a single column merge or join. We will look into two examples where we combine DataFrame using multiple columns.

Combine DataFrames on Multiple Columns using the pandas merge() method

# Combine DataFrames on Multiple Columns using the pandas merge method
df1 = pd.DataFrame({'keyA': ['A', 'B', 'C', 'D', 'E'],
'keyB': ['a', 'b', 'c', 'd', 'e'],
'value_1': [101, 201, 301, 551, 341],
'value_2': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'keyA': ['B', 'C', 'G', 'E'],
'keyB': ['b', 'c', 'g', 'e'],
'value': [501, 601, 701, 801],
'value_2': [6, 7, 8, 9]})
print(pd.merge(df1,df2, on=['keyA', 'keyB']))

Output:

  keyA keyB  value_1  value_2_x  value  value_2_y
0    B    b      201          2    501          6
1    C    c      301          3    601          7
2    E    e      341          5    801          9

Merge DataFrames on Multiple Columns using the pandas join method

# Merge DataFrames on Multiple Columns using the pandas join method
df1 = pd.DataFrame({'keyA': ['A', 'B', 'C', 'D', 'E'],
'keyB': ['a', 'b', 'c', 'd', 'e'],
'value_1': [101, 201, 301, 551, 341],
'value_2': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'keyA': ['B', 'C', 'G', 'E'],
'keyB': ['b', 'c', 'g', 'e'],
'value': [501, 601, 701, 801],
'value_2': [6, 7, 8, 9]})

print(df1.join(df1.set_index(['keyA', 'keyB']), on=['keyA', 'keyB'], lsuffix='_left', rsuffix='_right'))

Output:

  keyA keyB  value_1_left  value_2_left  value_1_right  value_2_right
0    A    a           101             1            101              1
1    B    b           201             2            201              2
2    C    c           301             3            301              3
3    D    d           551             4            551              4
4    E    e           341             5            341              5

Conclusion

In this tutorial, we have covered various methods to combine DataFrame in pandas using python. These four methods are:

  • pandas.append() – append rows or columns of other to the end of the DataFrame object
  • pandas.concat() – joining DataFrames across rows or columns
  • pandas.merge() – joining is done on columns or indexes
  • pandas.join() – joining columns on an index or on a key column

Apart from that, we have also covered frequent questions related to merge, join, and concatenate. 

Hopefully, you have enjoyed this article.

Leave a Comment

Your email address will not be published. Required fields are marked *