The `pandas.concat( )`

function combines the data from multiple Series and/or DataFrames fast and in an intuitive manner. It is one of the most basic data wrangling operations used in Pandas. In general, we draw some conclusions from the data by analyzing it. The confidence in our conclusions increases as we include more variables or meta-data about our data. This is achieved by combining data from a variety of different data sources. The basic Pandas objects, Series, and DataFrames are created by keeping these relational operations in mind. For example, `pd.concat([df1, df2])`

concatenates two DataFrames `df1`

, `df2`

together horizontally and results in a new DataFrame.

# Pandas Concat Two or More DataFrames

The most important and widely used use-case of Pandas concat – `pd.concat( )`

is to concatenate DataFrames.

For example, when you’re buying a new smartphone, often you might like to compare the specifications and price of the phones. This makes you take an informed decision. Such a comparison can be viewed below as an example from the amazon website for recent OnePlus phones.

In the above image, the data about four different smartphones are concatenated with their features as an index.

Let us construct two DataFrames and combine them to see how it works.

>>> import pandas as pd >>> df1 = pd.DataFrame( ... {"Key": ["A", "B", "A", "C"], "C1":[1, 2, 3, 4], "C2": [10, 20, 30, 40]}) >>> df1.index = ["L1", "L2", "L3", "L4"] >>> print(df1) Key C1 C2 L1 A 1 10 L2 B 2 20 L3 A 3 30 L4 C 4 40 >>> df2 = pd.DataFrame( ... {"Key": ["A", "B", "C", "D"], "C3": [100, 200, 300, 400]}) >>> df2.index = ["R1", "R2", "R3", "R4"] >>> print(df2) Key C3 R1 A 100 R2 B 200 R3 C 300 R4 D 400

From the official Pandas documentation of Pandas concat;

The two major arguments used in `pandas.concat( )`

from the above image are,

**objs**– A sequence of Series and/or DataFrame objects**axis**– Axis along which**objs**are concatenated

Out of the two arguments, `objs`

remains constant. But, based on the value of the axis, the concatenation operation differs. Possible values of the axis are,

`axis = 0`

– Concatenate or stack the DataFrames down the rows`axis = 1`

– Concatenate or stack the DataFrames along the columns

Remember this `axis`

argument functionality, because it comes in many other Pandas functions. Let us see them in action using the above created Dataframes.

## 1. Row-Wise Concatenation (axis = 0 / ’index’)

>>> df3 = pd.concat([df1, df2], axis=0) >>> print(df3) Key C1 C2 C3 L1 A 1.0 10.0 NaN L2 B 2.0 20.0 NaN L3 A 3.0 30.0 NaN L4 C 4.0 40.0 NaN R1 A NaN NaN 100.0 R2 B NaN NaN 200.0 R3 C NaN NaN 300.0 R4 D NaN NaN 400.0 >>> df3_dash = pd.concat([df1, df2]) >>> print(df3_dash) Key C1 C2 C3 L1 A 1.0 10.0 NaN L2 B 2.0 20.0 NaN L3 A 3.0 30.0 NaN L4 C 4.0 40.0 NaN R1 A NaN NaN 100.0 R2 B NaN NaN 200.0 R3 C NaN NaN 300.0 R4 D NaN NaN 400.0 >>> print(len(df3) == len(df1) + len(df2)) True

Any number of DataFrames can be given in the first argument which has a list of DataFrames like `[df1, df2, df3, ..., dfn]`

.

Some observations from the above results:

- Note the outputs of
`df3`

and`df3_dash`

are the same. So, we need not explicitly mention the axis when we want to concatenate down the rows. - The number of rows in the output DataFrame = Total number of rows in all the input DataFrames.
- The columns of the output DataFrame = Combination of distinct columns of all the input DataFrames.
- There are unique columns present in the input DataFrames. The corresponding values at the row labels of different input DataFrames are filled with
`NaN`

s (Not a Number – missing values) in the output DataFrame.

Let’s visualize the above process in the following animation:

## 2. Column-Wise Concatenation (axis = 1 / ’columns’)

>>> df3 = pd.concat([df1, df2], axis=1) >>> print(df3) Key C1 C2 Key C3 L1 A 1.0 10.0 NaN NaN L2 B 2.0 20.0 NaN NaN L3 A 3.0 30.0 NaN NaN L4 C 4.0 40.0 NaN NaN R1 NaN NaN NaN A 100.0 R2 NaN NaN NaN B 200.0 R3 NaN NaN NaN C 300.0 R4 NaN NaN NaN D 400.0 >>> print("The unique row indexes of df1 and df2:", '\n\t', df1.index.append(df2.index).unique()) The unique row indexes of df1 and df2: Index(['L1', 'L2', 'L3', 'L4', 'R1', 'R2', 'R3', 'R4'], dtype='object') >>> print("The row indexes of df3:", "\n\t", df3.index) The row indexes of df3: Index(['L1', 'L2', 'L3', 'L4', 'R1', 'R2', 'R3', 'R4'], dtype='object') >>> print("The column indexes of df1 and df2:", "\n\t", df1.columns.append(df2.columns)) The column indexes of df1 and df2: Index(['Key', 'C1', 'C2', 'Key', 'C3'], dtype='object') >>> print("The column indexes of df3:", "\n\t", df3.columns) The column indexes of df3: Index(['Key', 'C1', 'C2', 'Key', 'C3'], dtype='object')

Some observations from the above results:

- The DataFrames are concatenated side by side.
- The columns in the output DataFrame = Total columns in all the input DataFrames.
- Rows in the output DataFrame = Unique rows in all the input DataFrames.
- There are unique rows present in all the input DataFrames. The corresponding values at the column labels of different input DataFrames are filled with
`NaN`

s (Not a Number – missing values) in the output DataFrame.

Let’s visualize the above process in the following animation:

# Pandas Concat Columns

Please take a look at the initial OnePlus phones comparison table from the amazon website. A column in that table constitutes all the specifications of a given smartphone. Such all equivalent specifications (row labels) of all varieties (phones – column labels) are concatenated as columns to form the final comparison table.

So, to concatenate columns, we should have the same row indexes. In Pandas, the Series data structure is exactly designed to represent the columns and their combination forms the DataFrame data structure.

Let us construct two Series and concatenate them as columns to form a resultant DataFrame.

>>> ser1 = pd.Series([10, 20, 30, 40], name='C1') >>> ser2 = pd.Series([100, 200, 300, 400], name='C2') >>> print("Series 1:", "\n", ser1, "\n\n", "Series 2:", "\n", ser2) Series 1: 0 10 1 20 2 30 3 40 Name: C1, dtype: int64 Series 2: 0 100 1 200 2 300 3 400 Name: C2, dtype: int64 >>> df = pd.concat([ser1, ser2], axis=1) >>> print("DataFrame:", "\n", df) DataFrame: C1 C2 0 10 100 1 20 200 2 30 300 3 40 400

# Pandas Concat MultiIndex

Let us consider a use-case where we have hourly weather data for 4 hours about two cities. The data that we have are only the temperature (degC) and wind speed (kmph). One way of storing their data is to store them in different DataFrames per city. It can be done the following way,

>>> Date_Hourly = pd.date_range(start = '2020-11-20', periods = 4, freq = 'H') >>> df_city1 = pd.DataFrame( ... {"temp(degC)": [27, 24, 22, 20], ... "windspeed(kmph)": [18, 17, 17, 18]}, ... index = Date_Hourly ... ) >>> df_city2 = pd.DataFrame( ... {"temp(degC)": [30, 33, 33, 34], ... "windspeed(kmph)": [23, 25, 27, 30]}, ... index = Date_Hourly ... ) >>> print("Weather Data of City 1:", "\n", df_city1) Weather Data of City 1: temp(degC) windspeed(kmph) 2020-11-20 00:00:00 27 18 2020-11-20 01:00:00 24 17 2020-11-20 02:00:00 22 17 2020-11-20 03:00:00 20 18 >>> print("Weather Data of City 2:", "\n", df_city2) Weather Data of City 2: temp(degC) windspeed(kmph) 2020-11-20 00:00:00 30 23 2020-11-20 01:00:00 33 25 2020-11-20 02:00:00 33 27 2020-11-20 03:00:00 34 30

Now, we might want to collect data of two cities into one DataFrame for easier analysis. MultiIndex keys serve as identifiers to specify the source of the data. This can be achieved by MultiIndex concatenation.

Multi-Index Concatenation is done in two ways;

## 1. Row-Wise Concatenation (axis = 0 / ’index’)

>>> df_concat_rowwise = pd.concat([df_city1, df_city2], axis=0, keys=['City1', 'City2']) >>> print("Row-Wise Multi-Index Concatenation:", "\n", df_concat_rowwise) Row-Wise Multi-Index Concatenation: temp(degC) windspeed(kmph) City1 2020-11-20 00:00:00 27 18 2020-11-20 01:00:00 24 17 2020-11-20 02:00:00 22 17 2020-11-20 03:00:00 20 18 City2 2020-11-20 00:00:00 30 23 2020-11-20 01:00:00 33 25 2020-11-20 02:00:00 33 27 2020-11-20 03:00:00 34 30

## 2. Column-Wise Concatenation (axis = 1 / ’columns’)

>>> df_concat_rowwise = pd.concat([df_city1, df_city2], axis=1, keys=['City1', 'City2']) >>> print("Column-Wise Multi-Index Concatenation:", "\n", df_concat_colwise) Column-Wise Multi-Index Concatenation: City1 City2 temp(degC) windspeed(kmph) temp(degC) windspeed(kmph) 2020-11-20 00:00:00 27 18 30 23 2020-11-20 01:00:00 24 17 33 25 2020-11-20 02:00:00 22 17 33 27 2020-11-20 03:00:00 20 18 34 30

The same can be achieved for many cities. After concatenation, all of the data is in one single DataFrame. This makes us analyze the weather efficiently instead of fetching data from multiple sources.

# Pandas concat vs append

Concatenation along the rows (`axis = 0`

) is very common. If you observe the weather data scenario after each hour data gets appended in the next row. So, for that purpose, a method called `append( )`

is built on top of DataFrame to append another DataFrame row-wise. This makes you achieve the same results as `pd.concat( )`

with few keystrokes.

It can be implemented as follows,

>>> df1 = pd.DataFrame({'C1': ['A', 'B', 'C', 'D']}) >>> df2 = pd.DataFrame({'C1': ['E', 'F', 'G', 'H']}) >>> print("DataFrame 1:", "\n", df1) DataFrame 1: C1 0 A 1 B 2 C 3 D >>> print("DataFrame 2:", "\n", df2) DataFrame 2: C1 0 E 1 F 2 G 3 H >>> pd.concat([df1, df2]) C1 0 A 1 B 2 C 3 D 0 E 1 F 2 G 3 H >>> df1.append(df2) C1 0 A 1 B 2 C 3 D 0 E 1 F 2 G 3 H

You can observe above the same results for `pd.concat([df1, df2])`

and `df1.append(df2)`

.

# Pandas concat slow

Each and every time we do a concatenation operation, it creates a new DataFrame. DataFrame concatenation operates equivalent to an SQL join operation. So, the output DataFrame’s index is formed first by join operation. Resolving all the mismatches between indexes of input DataFrames makes it slow. In some scenarios, indexes might not be of importance. In such cases, we can ignore indexes to make the concat operation faster.

Ignoring the index is done by the following way,

>>> df = pd.DataFrame({'C1': [10, 20, 30, 40]}, index=['R1', 'R2', 'R3', 'R4']) >>> df C1 R1 10 R2 20 R3 30 R4 40 >>> df.reset_index(drop=True) C1 0 10 1 20 2 30 3 40

Along with `concat`

, all other Pandas functions are executed by utilizing only a single core in the CPU. Operations on smaller datasets run in a seamless manner. As the dataset size increases, the functions of Pandas start to throttle because they do only one operation at once.

Modin is the python package created to speed up the execution of Pandas functions. It distributes the computation load to all the available cores. It does so by fragmenting the DatFrame and making the function run on DataFrame fragments in other cores parallelly. Please look after this article to know about it in detail.