Pandas append(), assign(), compare(), join(), merge(), update()

The Pandas DataFrame/Series has several methods to combine/compare/join and merge the data.

Preparation

Before any data manipulation can occur, two (2) new libraries will require installation.

The Pandas library enables access to/from a DataFrame.
The NumPy library supports multi-dimensional arrays and matrices in addition to a collection of mathematical functions.

To install these libraries, navigate to an IDE terminal. At the command prompt ($), execute the code below. For the terminal used in this example, the command prompt is a dollar sign ($). Your terminal prompt may be different.

$ pip install pandas

Hit the <Enter> key on the keyboard to start the installation process.

$ pip install numpy

Hit the <Enter> key on the keyboard to start the installation process.

If the installations were successful, a message displays in the terminal indicating the same.

FeFeel free to view the PyCharm installation guide for the required libraries.

Add the following code to the top of each code snippet. This snippet will allow the code in this article to run error-free.

import pandas as pd
import numpy

DataFrame append()

The append() method adds rows to the bottom (end) of a DataFrame/Series. A new DataFrame/Series returns with the appropriate rows appended. Columns not existing in the calling object add as new column(s).

The syntax for this method is as follows:

DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False)

Parameter	Description
`other`	This parameter can be a DataFrame, Series, dictionary, or a list. These column(s) append to the original calling object.
`ignore_index`	If `True`, ignore the original index: `False` use the original index.
`verify_integrity`	If True, raise a ValueError if duplicates exist.
`sort`	Sort the column(s) if the calling object and the other parameter do not align.

For this example, we have two (2) DataFrames. One with existing customer login credentials and one with new customer credentials. The code below appends them to form one (1) DataFrame.

Code – Example 1

df_custs = pd.DataFrame({('jkende',  'Vzs*@4:kNq%)'), 
                         ('sarahJ',  '{M$*3zB~-a-W'), 
                         ('AmyKerr', '*7#<bSt?Y_Z<')}, 
                         columns=['username', 'password'], 
                         index=['user-a', 'user-b', 'user-c'])
print(df_custs)

df_new = pd.DataFrame({('twilles',    '&4&F#@[>g$+%'), 
                         ('cindylou',   'JBW!ktA3;9sD')},
                         columns=['username', 'password'], 
                         index=['user-d', 'user-e'])
print(df_new)

df = df_custs.append(df_new)
print(df)

Line [1] creates a DataFrame from a dictionary of tuples and assigns it to df_custs.
Line [2] outputs this DataFrame to the terminal.
Line [3] creates a DataFrame from a dictionary of tuples and assigns it to df_new.
Line [4] outputs this DataFrame to the terminal.
Line [5] appends the DataFrame df_new to the end of the DataFrame df_custs. This output saves to a new DataFrame (df).
Line [6] outputs this DataFrame to the terminal.

Output

df_custs

	username	password
user-a	jkende	Vzs*@4:kNq%)
user-b	AmyKerr	*7#<bSt?Y_Z<
user-c	sarahJ	{M$*3zB~-a-W

df_new

	username	password
user-d	twilles	&4&F#@[>g$+%
user-e	cindylou	JBW!ktA3;9sD

df

	username	password
user-a	jkende	Vzs*@4:kNq%)
user-b	AmyKerr	*7#<bSt?Y_Z<
user-c	sarahJ	{M$*3zB~-a-W
user-d	twilles	&4&F#@[>g$+%
user-e	cindylou	JBW!ktA3;9sD

For this example, one (1) record is appended to the DataFrame df_custs using loc.

Code – Example 2

df_custs = pd.DataFrame({('jkende',  'Vzs*@4:kNq%)'), 
                         ('sarahJ',  '{M$*3zB~-a-W'), 
                         ('AmyKerr', '*7#<bSt?Y_Z<')}, 
                         columns=['username', 'password'], 
                         index=['user-a', 'user-b', 'user-c'])

df_custs.loc['user-d'] = ('jkende',  'Vzs*@4:kNq%)')
print(df_custs)

Line [1] creates a DataFrame from a Dictionary of Tuples and assigns it to df_custs.
Line [2] uses loc to append one (1) record to the end of the DataFrame.
Line [3] outputs the DataFrame to the terminal.

Output

df_custs

	username	password
user-a	jkende	Vzs*@4:kNq%)
user-b	AmyKerr	*7#<bSt?Y_Z<
user-c	sarahJ	{M$*3zB~-a-W

updated df_custs

	username	password
user-a	jkende	Vzs*@4:kNq%)
user-b	AmyKerr	*7#<bSt?Y_Z<
user-c	sarahJ	{M$*3zB~-a-W
user-d	twilles	&4&F#@[>g$+%

DataFrame assign()

The assign() method adds (assigns) column(s) to an existing DataFrame.

The syntax for this method is as follows:

DataFrame.assign(**kwargs)

Parameter	Description
`**kwargs`	The column(s) name(s) is assigned as keywords.

For this example, a new column (accessed) adds to the DataFrame df_custs. The column fills in with random integer values. In a real-life, this column would keep track of how many times the user logged in to their account.

df_custs = pd.DataFrame({('jkende',  'Vzs*@4:kNq%)'), 
                         ('sarahJ',  '{M$*3zB~-a-W'), 
                         ('AmyKerr', '*7#<bSt?Y_Z<')}, 
                         columns=['username', 'password'], 
                         index=['user-a', 'user-b', 'user-c'])

result = df_custs.assign(accessed=pd.Series(np.random.randint(0,500,size=3)).values)
print(result)

Line [1] creates a DataFrame from a Dictionary of Tuples and assigns it to df_custs.
Line [2] assigns a new column (accessed) with random integers values to fill in this column. This output saves to result.
Line [3] outputs result to the terminal.

Output

df_custs

	username	password	accessed
user-a	sarahJ	{M$*3zB~-a-W	155
user-b	jkende	Vzs*@4:kNq%)	472
user-c	AmyKerr	*7#<bSt?Y_Z<	354

💡 Note: The order of the records changes.

DataFrame compare()

The compare() method compares two (2) DataFrames. This method returns the differences between them.

The syntax for this method is as follows:

DataFrame.compare(other, align_axis=1, keep_shape=False, keep_equal=False)

Parameter	Description
`other`	This parameter is the object (DataFrame) to use for comparison.
`align_axis`	This parameter determines the axis to align the comparison. If zero (0) or index is selected, apply to each column. Default is 0 (column). If zero (1) or columns, apply to each row.
`keep_shape`	If set to `True`, all column(s) stay. If `False`, only the ones with differing values remain.
`keep_equal`	If set to `True`, keep equal values. If `False`, equal values display as `NaN` values.

For this example, we have two (2) DataFrames. One with existing customer login credentials and one with new customer credentials. This code compares the DataFrames and returns the results (the differences).

df_custs = pd.DataFrame({('jkende',  'Vzs*@4:kNq%)'), 
                         ('sarahJ',  '{M$*3zB~-a-W'), 
                         ('AmyKerr', '*7#<bSt?Y_Z<')}, 
                         columns=['username', 'password'], 
                         index=['user-a', 'user-b', 'user-c'])
print(df_custs)

df_new    = pd.DataFrame({('jkende',  'Vzs*@4:kNq%)'), 
                         ('sarahJ',   'xc^O3&43P'), 
                         ('AmyKerr',  '*7#<bSt?Y_Z<')}, 
                         columns=['username', 'password'], 
                         index=['user-a', 'user-b', 'user-c'])
print(df_new)

result = df_custs.compare(df_new)
print(result)

Line [1] creates a DataFrame from a Dictionary of Tuples and assigns it to df_custs.
Line [2] outputs the DataFrame to the terminal.
Line [3] creates a DataFrame from a Dictionary of Tuples and assigns it to df_new.
Line [4] outputs the DataFrame to the terminal.
Line [5] compares the two DataFrames. This output saves to result.
Line [6] outputs the result to the terminal.

Output

df_custs

	username	password
user-a	AmyKerr	*7#<bSt?Y_Z<
user-b	sarahJ	{M$*3zB~-a-W
user-c	jkende	Vzs*@4:kNq%)

df_new

	username	password
user-a	AmyKerr	*7#<bSt?Y_Z<
user-b	sarahJ	xc^O3&43P
user-c	jkende	Vzs*@4:kNq%)

result

	password
	self	other
user-b	{M$*3zB~-a-W	xc^O3&43P

💡 Note: The user sarahJ resides in each DataFrame with different passwords.

DataFrame join()

The join() method joins columns by an index/key column. This method is great for joining multiple objects by their index.

The syntax for this method is as follows:

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)

Parameter	Description
`other`	This parameter can be a DataFrame/Series/List and must be the same as a column in the existing object. If a Series, the name attribute needs to be set.
`on`	This parameter is the column(s)/index(es) to join the index in `other`. If empty, the join uses `index-on-index`.
`how`	This parameter can be one of the following options: – `left`: uses calling index (or column `on`, if used). – `right`: uses the index specified in the `other` parameter. – `outer`: creates union on calling index (or column, if `on`) with `other` index and `sort`. – `inner`: creates intersection on calling index (or column, if `on`) and preserves the order. – `cross`: creates the cartesian product from both and preserves the order of the left keys.
`lsuffix`	This parameter is the suffix to use from the left DataFrame over-lapping column(s).
`rsuffix`	This parameter is the suffix to use from the right DataFrame over-lapping column(s).
`sort`	If `False`, the order of the join key depends on the how parameter selection.

For this example, we have two (2) DataFrames.

The first DataFrame contains student names and ages. The second DataFrame includes student classes and grades. The code below joins the two (2) DataFrames.

df = pd.DataFrame({'key': ['Lucy', 'Carl', 'Wanda'],
                   'age':  [21, 18, 18]})

other = pd.DataFrame({key:  ['Python', 'PHP', 'Java'],
                   'grade':  [80, 56, 76]})

result = df.join(other, lsuffix='_caller', rsuffix='_other')
print(result)

Line [1] creates a DataFrame and saves it to df.
Line [2] creates a DataFrame and saves it to other.
Line [3] joins the two (2) DataFrames and appends suffixes to each column name. This output saves to result.
Line [4] outputs the result to the terminal.

Output

	key_caller	age	key_other	grade
0	Lucy	21	Python	80
1	Carl	18	PHP	56
2	Wanda	18	Java	76

DataFrame merge()

The merge() method joins the DataFrame(s) on columns/indexes.

Points to note:

If joining columns, the DataFrame indexes ignore.
If joining indexes, the index passes on.
If cross-merge, no column specs to merge done.

💡 Note: If the key columns contain rows where the key is NULL (empty), the rows match against each other.

This method does not act like a SQL join.

The syntax for this method is as follows:

DataFrame.merge(right, how='inner', on=None, left_on=None, right_on=None, 
                left_index=False, right_index=False, sort=False, 
                suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)

Description	Parameter
`right`	This parameter is the DataFrame/Series to merge.
`how`	This parameter can be one of the following options: – `left`: uses calling index (or column on, if used). – `right`: uses the index specified in the other parameter. – `outer`: creates union on calling index (or column, if on) with other index and sort. – `inner`: creates intersection on calling index (or column, if on) and preserves the order. – `cross`: creates the cartesian product from both and preserves the order of the left keys.
`on`	This parameter is the column/index name(s) to join and must exist in DataFrames/Series. The format is a label/list.
`left_on`	The column/index names to join on the left DataFrame.
`right_on`	The column/index name(s) to join on the right DataFrame.
`left_index`	The index from the left DataFrame uses as join key(s).
`right_index`	The index from the right DataFrame uses as join key(s).
`sort`	This parameter sorts the resultant DataFrame.
`suffixes`	This parameter must be two (2) characters. The suffix adds to the end of each over-lapping column on the left/right.
`copy`	If `True`, applies a merge on the original DataFrame.
`indicator`	If `True`, a column adds to the resultant DataFrame called `_merge`. This column provides details on the source of each row.
`validate`	These options indicate validation type (strings): – `one_to_one`/`1:1`: are merge keys unique in both DataFrames. – `one_to_many`/`1:m`: are merge keys unique in the left DataFrame. – `many_to_one`/`m:1`: are merge keys unique in right DataFrame. – `many_to_many`/`m:m`: allowed but does not check.

For these examples, merge using various options appear below (see above for details).

The inner option returns rows df1 and df2 that share a common key.

Code – Inner Join

df1 = pd.DataFrame({'key': ['Germany', 'France', 'Spain'], 'value': [83783942, 67081000, 47431256]})
df2 = pd.DataFrame({'key': ['DE', 'France', 'Spain'], 'value': [49, 33, 34]})

result = df1.merge(df2, on='key', how='inner')
print(result)

Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df1.
Line [2] creates a DataFrame from a Dictionary of Lists and saves it to df2.
Line [3] merges the two (2) DataFrames on the key using the inner option.
Line [4] outputs the result to the terminal.

Output

	key	value_x	value_y
0	France	67081000	33
1	Spain	47431256	34

The details for Germany (in df1) and DE (in df2) do not display.

Note: To rename the column names, use the rename() method.

The outer option uses the keys from the DataFrames (df1 and df2). A NaN value adds for missing data in both DataFrames.

Code – Outer Join

df1 = pd.DataFrame({'key': ['Germany', 'France', 'Spain'], 'value': [83783942, 67081000, 47431256]})
df2 = pd.DataFrame({'key': ['DE', 'France', 'Spain'], 'value': [49, 33, 34]})

result = df1.merge(df2, on='key', how='outer')
print(result)

Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df1.
Line [2] creates a DataFrame from a Dictionary of Lists and saves it to df2.
Line [3] merges the two (2) DataFrames on the key using the outer option.
Line [4] outputs the result to the terminal.

Output

	key	value_x	value_y
0	Germany	83783942.0	NaN
1	France	67081000.0	33.0
2	Spain	47431256.0	34.0
3	DE	NaN	49.0

Details for Germany (value_y field) and DE (value_x field) display NaN values.

The left option uses the keys from df1. A NaN value adds for missing data in both DataFrames.

Code – Left Join

df1 = pd.DataFrame({'key': ['Germany', 'France', 'Spain'], 'value': [83783942, 67081000, 47431256]})
df2 = pd.DataFrame({'key': ['DE', 'France', 'Spain'], 'value': [49, 33, 34]})

result = df1.merge(df2, on='key', how='left')
print(result)

Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df1.
Line [2] creates a DataFrame from a Dictionary of Lists and saves it to df2.
Line [3] merges the two (2) DataFrames on the key using the left option.
Line [4] outputs the result to the terminal.

Output:

	key	value_x	value_y
0	Germany	83783942.0	NaN
1	France	67081000.0	33.0
2	Spain	47431256.0	34.0

The row DE does not display. The Germany value_y field displays NaN.

The right option takes keys from the df2. Missing data from df1 changes to NaN value(s).

Code – Right Join

df1 = pd.DataFrame({'key': ['Germany', 'France', 'Spain'], 'value': [83783942, 67081000, 47431256]})
df2 = pd.DataFrame({'key': ['DE', 'France', 'Spain'], 'value': [49, 33, 34]})

result = df1.merge(df2, on='key', how='right')
print(result)

Line [1] creates a DataFrame from a Dictionary of Lists and saves it to df1.
Line [2] creates a DataFrame from a Dictionary of Lists and saves it to df2.
Line [3] merges the two (2) DataFrames on the key using the right option.
Line [4] outputs the result to the terminal.

Output

	key	value_x	value_y
0	DE	NaN	49
1	France	67081000.0	33
2	Spain	47431256.0	34

The row for Germany does not display. The value_x of DE displays NaN.

DataFrame update()

The update() method performs modifications in-place using non-NA values from a second DataFrame/Series. This method aligns with the index(es).

The syntax for this method is as follows:

DataFrame.update(other, join='left', overwrite=True, filter_func=None, errors='ignore'

Parameter	Description
`other`	Must have (minimum) one matching column/index with the original DataFrame. If a Series, the name attribute is used as the align column with the original DataFrame.
`join`	Left is the only option. This option keeps the index/columns of the original DataFrame/Series.
`overwrite`	This parameter determines how to deal with non-NA values for over-lapping keys. – If `True`, over-write original with values from other. By default, True. – If `False`, only update values that are NA in the original.
`filter_func`	This parameter takes a 1-dimensional array or 1-dimension Boolean array.
`errors`	If ‘`raise`’ is selected, a ValueError occurs if both originating and other contain non-NA values in the same position.

For this example, the first three (3) records of the countries.csv file are read in. The population is increased and updated.

df1  = pd.read_csv('countries.csv').head(3)
amt = 1.4
tmp = list(df1['Population'].apply(lambda x: x*amt))
df2 = pd.DataFrame({'Population': tmp})
df1.update(df2, overwrite=True)
print(df1)

Line [1] reads in the first three (3) records of the comma-delimited CSV file and saves to df1.
Line [2] sets the amount of the increase applied to the Population column.
Line [3] applies the increase (amt) to each Population cell and converts this column to a list. The output saves to df2.
Line [4] updates df1 with the new Population values residing in df2.
Line [5] outputs the updated DataFrame to the terminal.

Further Learning Resources

This is Part 16 of the DataFrame method series.

Part 1 focuses on the DataFrame methods abs(), all(), any(), clip(), corr(), and corrwith().
Part 2 focuses on the DataFrame methods count(), cov(), cummax(), cummin(), cumprod(), cumsum().
Part 3 focuses on the DataFrame methods describe(), diff(), eval(), kurtosis().
Part 4 focuses on the DataFrame methods mad(), min(), max(), mean(), median(), and mode().
Part 5 focuses on the DataFrame methods pct_change(), quantile(), rank(), round(), prod(), and product().
Part 6 focuses on the DataFrame methods add_prefix(), add_suffix(), and align().
Part 7 focuses on the DataFrame methods at_time(), between_time(), drop(), drop_duplicates() and duplicated().
Part 8 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()
Part 9 focuses on the DataFrame methods equals(), filter(), first(), last(), head(), and tail()
Part 10 focuses on the DataFrame methods reset_index(), sample(), set_axis(), set_index(), take(), and truncate()
Part 11 focuses on the DataFrame methods backfill(), bfill(), fillna(), dropna(), and interpolate()
Part 12 focuses on the DataFrame methods isna(), isnull(), notna(), notnull(), pad() and replace()
Part 13 focuses on the DataFrame methods drop_level(), pivot(), pivot_table(), reorder_levels(), sort_values() and sort_index()
Part 14 focuses on the DataFrame methods nlargest(), nsmallest(), swap_level(), stack(), unstack() and swap_axes()
Part 15 focuses on the DataFrame methods melt(), explode(), squeeze(), to_xarray(), t() and transpose()
Part 16 focuses on the DataFrame methods append(), assign(), compare(), join(), merge() and update()
Part 17 focuses on the DataFrame methods asfreq(), asof(), shift(), slice_shift(), tshift(), first_valid_index(), and last_valid_index()
Part 18 focuses on the DataFrame methods resample(), to_period(), to_timestamp(), tz_localize(), and tz_convert()
Part 19 focuses on the visualization aspect of DataFrames and Series via plotting, such as plot(), and plot.area().
Part 20 focuses on continuing the visualization aspect of DataFrames and Series via plotting such as hexbin, hist, pie, and scatter plots.
Part 21 focuses on the serialization and conversion methods from_dict(), to_dict(), from_records(), to_records(), to_json(), and to_pickles().
Part 22 focuses on the serialization and conversion methods to_clipboard(), to_html(), to_sql(), to_csv(), and to_excel().
Part 23 focuses on the serialization and conversion methods to_markdown(), to_stata(), to_hdf(), to_latex(), to_xml().
Part 24 focuses on the serialization and conversion methods to_parquet(), to_feather(), to_string(), Styler.
Part 25 focuses on the serialization and conversion methods to_bgq() and to_coo().

Also, have a look at the Pandas DataFrame methods cheat sheet!