When working with data in Python, you might encounter a situation where you need to add a new column to an existing Pandas DataFrame. This new column could be based on calculations, static values, or derived from other columns. Let’s say we have a DataFrame containing product sales data, and we want to add a column that calculates the sales tax for each product. The desired outcome is a DataFrame that preserves the original data and includes the new sales tax column.
Method 1: Using DataFrame Assignment
Direct assignment is a straightforward way to add a new column to a DataFrame. You simply assign a value or an array of values to a new column name on the DataFrame. If the column doesn’t exist, it will be created.
Here’s an example:
import pandas as pd
df = pd.DataFrame({'Product': ['Widget', 'Gadget'], 'Price': [20.00, 30.00]})
df['Sales Tax'] = df['Price'] * 0.08
print(df)Output:
Product Price Sales Tax 0 Widget 20.0 1.6 1 Gadget 30.0 2.4
This code snippet creates a new column called ‘Sales Tax’ by multiplying the ‘Price’ column by a sales tax rate of 8%. The new column is appended to the existing DataFrame.
Method 2: Insert Function
The insert() method offers more control over the location of the new column within the DataFrame by specifying the index position.
Here’s an example:
df.insert(1, 'Quantity', [10, 15]) print(df)
Output:
Product Quantity Price Sales Tax 0 Widget 10 20.0 1.6 1 Gadget 15 30.0 2.4
By using the insert() method, we’ve added a ‘Quantity’ column at index position 1 in the DataFrame, shifting other columns to the right.
Method 3: Using Assign Function
The assign() method is a powerful tool that allows for the addition of new columns to a DataFrame in a single operation, which can be particularly useful for method chaining.
Here’s an example:
df = df.assign(Total=df['Price'] * df['Quantity']) print(df)
Output:
Product Quantity Price Sales Tax Total 0 Widget 10 20.0 1.6 200.0 1 Gadget 15 30.0 2.4 450.0
This code utilises the assign() method to create a new ‘Total’ column, which is the product of the ‘Price’ and ‘Quantity’ columns. The assign() method returns a new DataFrame with all original columns in addition to the new one.
Method 4: Concatenating DataFrames
Concatenation is useful when you want to add a column that you’ve calculated separately or that comes from another DataFrame with the same number of rows.
Here’s an example:
new_col = pd.DataFrame({'Discount': [0.1, 0.15]})
df = pd.concat([df, new_col], axis=1)
print(df)Output:
Product Quantity Price Sales Tax Total Discount 0 Widget 10 20.0 1.6 200.0 0.1 1 Gadget 15 30.0 2.4 450.0 0.15
In this snippet, we’ve concatenated a new DataFrame ‘new_col’ containing a ‘Discount’ column to our original DataFrame ‘df’. We specify axis=1 for column-wise concatenation.
Bonus One-Liner Method 5: Using Lambda Expressions
Lambda expressions provide a quick and efficient way to add a column with values derived from operations on existing columns, using the apply() function.
Here’s an example:
df['Net Price'] = df.apply(lambda row: row['Price'] * (1 - row['Discount']), axis=1) print(df)
Output:
Product Quantity Price Sales Tax Total Discount Net Price 0 Widget 10 20.0 1.6 200.0 0.1 18.0 1 Gadget 15 30.0 2.4 450.0 0.15 25.5
The lambda function calculates the ‘Net Price’ after the discount per row in the DataFrame. The apply() function applies this lambda function across the DataFrame row-wise, i.e., axis=1.
Summary/Discussion
- Method 1: Direct Assignment. Simple and straightforward. Best for adding a single column. Not suitable for complex operations that involve multiple columns or additional logic.
- Method 2: Insert Function. Offers positional control. Good for inserting columns in specific places. More verbose compared to direct assignment.
- Method 3: Assign Function. Ideal for method chaining. Versatile and can return a modified copy without altering the original DataFrame. However, can be less intuitive than direct assignment methods.
- Method 4: Concatenating DataFrames. Useful when adding externally calculated columns. It can be less efficient for large DataFrames due to the overhead of creating an additional DataFrame before concatenation.
- Method 5: Lambda Expressions. A concise one-liner approach. Great for complex calculations that reference multiple columns. Can be less readable and more difficult to debug for complex expressions.
