28. Using Pandas to merge or lookup data

This post is also available as a PDF or as a Jupyter Notebook.

Sometimes we may want to cross-reference data between different data tables. This may be in order to perform a full merge of data, or just to produce a summary lookup table referencing across different tables.

import numpy as np
import pandas as pd

# Set up the first data frame

df1 = pd.DataFrame()

names = ['Gandolf','Gimli','Frodo','Legolas','Bilbo']
types = ['Wizard','Dwarf','Hobbit','Elf','Hobbit']
magic = [10, 1, 4, 6, 4]
aggression = [7, 10, 2, 5, 1]
stealth = [8, 2, 5, 10, 5]

df1['names'] = names
df1['type'] = types
df1['magic_power'] = magic
df1['aggression'] = aggression
df1['stealth'] = stealth

# Set up the second dataframe

names = ['Gandolf','Gimli','Frodo','Aragorn','Sauron']
popularity = ['High','Medium','High','Medium','Low']

df2 = pd.DataFrame()

df2['name'] = names
df2['popularity'] = popularity

We will look here at where the reference fields are not the index fields. We are going to want to merge using ’names’ in df1 and ’name’ in df2, and we are going to keep all records from df1, and add data from df2 where it is available.

merged_df = pd.merge(df1, df2, 
                     left_on = 'names', 
                     right_on = 'name', 
                     how='left')

print (merged_df)

OUT:

     names    type  magic_power  aggression  stealth     name popularity
0  Gandolf  Wizard           10           7        8  Gandolf       High
1    Gimli   Dwarf            1          10        2    Gimli     Medium
2    Frodo  Hobbit            4           2        5    Frodo       High
3  Legolas     Elf            6           5       10      NaN        NaN
4    Bilbo  Hobbit            4           1        5      NaN        NaN

We can keep all data from both databases by using how=outer (an outer database join).

merged_df = pd.merge(df1, df2, 
                     left_on = 'names', 
                     right_on = 'name', 
                     how='outer')

print (merged_df)


OUT:

     names    type  magic_power  aggression  stealth     name popularity
0  Gandolf  Wizard         10.0         7.0      8.0  Gandolf       High
1    Gimli   Dwarf          1.0        10.0      2.0    Gimli     Medium
2    Frodo  Hobbit          4.0         2.0      5.0    Frodo       High
3  Legolas     Elf          6.0         5.0     10.0      NaN        NaN
4    Bilbo  Hobbit          4.0         1.0      5.0      NaN        NaN
5      NaN     NaN          NaN         NaN      NaN  Aragorn     Medium
6      NaN     NaN          NaN         NaN      NaN   Sauron        Low

Or we could use how=inner to produce a dataframe for only the rows with reference columns in both dataframes, or how=right to keep all rows in df2 and add data from df1 where it exists.

In the above examples we have returned all fields both both dataframes. We can choose to select just the fields we wish to return (though we need the reference field from both dataframes):

merged_df = pd.merge(df1[['names','type']],
                     df2[['name','popularity']],
                     left_on = 'names', 
                     right_on = 'name', 
                     how='left')

print (merged_df)

OUT:

     names    type     name popularity
0  Gandolf  Wizard  Gandolf       High
1    Gimli   Dwarf    Gimli     Medium
2    Frodo  Hobbit    Frodo       High
3  Legolas     Elf      NaN        NaN
4    Bilbo  Hobbit      NaN        NaN

Using this method we need our reference fields, used to join data, as part of the dataframe and not as the index. To use this method where there is an index column that you wish to use as a joining field you will need to reset the index to a new numbered column. That is done by the command df.reset_index(). Alternatively you can use a method that joins using the index fields of each dataframe, as described in the link below.

There are many ways of joining and merging Pandas dataframes. I have set out just one method here. See https://pandas.pydata.org/pandas-docs/stable/merging.html for all methods.

 

One thought on “28. Using Pandas to merge or lookup data

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s