5 Best Ways to Select Multiple Columns from a Pandas DataFrame in Python

πŸ’‘ Problem Formulation: In data analysis tasks, it’s often necessary to extract specific columns from a dataset to perform operations, visualization, or further analysis. Given a Pandas DataFrame, suppose you want to create a new DataFrame with only a subset of its columns. This article explores how to select and extract these columns using various methods available in Python’s Pandas library, detailing their use cases and syntax.

Method 1: Square Brackets with Column Names List

This method is the most straightforward way to select multiple columns from a Pandas DataFrame. By using square brackets and passing a list of column names, you can slice the DataFrame and create a new DataFrame with only the specified columns.

Here’s an example:

import pandas as pd

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35],
        'City': ['New York', 'Los Angeles', 'Chicago']}

df = pd.DataFrame(data)

# Select 'Name' and 'City' columns
selected_columns = df[['Name', 'City']]

print(selected_columns)

Output:

      Name        City
0   Alice     New York
1     Bob  Los Angeles
2 Charlie      Chicago

This code snippet creates a new DataFrame selected_columns containing only the ‘Name’ and ‘City’ columns from the original DataFrame df. The list ['Name', 'City'] specifies which columns to include.

Method 2: Using the DataFrame’s .loc[] Method

The .loc[] method is used for label-based indexing and can be used to select columns based on their names. It offers a more powerful way to slice a DataFrame by allowing you to specify both row and column selection.

Here’s an example:

# Select 'Name' and 'Age' columns using .loc
selected_columns = df.loc[:, ['Name', 'Age']]

print(selected_columns)

Output:

      Name  Age
0   Alice   25
1     Bob   30
2 Charlie   35

This example shows the use of .loc[] where the colon : indicates the selection of all rows, and ['Name', 'Age'] specifies the columns to select. The result is a DataFrame with the ‘Name’ and ‘Age’ columns only.

Method 3: DataFrame’s .iloc[] Method

The .iloc[] method allows for integer-location based indexing, which means you can select columns by their integer position instead of their names. This can be particularly useful when dealing with large DataFrames with many columns.

Here’s an example:

# Select the first two columns using .iloc
selected_columns = df.iloc[:, [0, 2]]

print(selected_columns)

Output:

      Name        City
0   Alice     New York
1     Bob  Los Angeles
2 Charlie      Chicago

In this code, .iloc[:, [0, 2]] selects all rows and the first and third columns (since Python uses zero-based indexing). As a result, we obtain a DataFrame with the ‘Name’ and ‘City’ columns.

Method 4: The filter() Function

The filter() function is a built-in Pandas method for subsetting columns based on specific criteria like labels. It allows wildcard matching and can be useful when you need to select columns that share a certain naming pattern or prefix.

Here’s an example:

# Suppose we have additional columns with a prefix 'Info_'
df['Info_Salary'] = [70000, 80000, 90000]
df['Info_Hobbies'] = ['Skiing', 'Surfing', 'Photography']

# Use the filter method to select columns that start with 'Info'
selected_columns = df.filter(like='Info')

print(selected_columns)

Output:

   Info_Salary Info_Hobbies
0        70000       Skiing
1        80000      Surfing
2        90000  Photography

The filter(like='Info') function call returns a new DataFrame with columns that include ‘Info’ in their names. Thus, the output DataFrame includes ‘Info_Salary’ and ‘Info_Hobbies’ columns.

Bonus One-Liner Method 5: List Comprehension and startswith()

List comprehensions combined with the startswith() method provide a concise way to select columns. This can be an elegant solution when working with columns that have a common prefix or specific pattern, and you want a quick one-liner.

Here’s an example:

# Select columns that start with 'Info' using list comprehension
selected_columns = df[[col for col in df.columns if col.startswith('Info')]]

print(selected_columns)

Output:

   Info_Salary Info_Hobbies
0        70000       Skiing
1        80000      Surfing
2        90000  Photography

This one-line code iterates over all column names in df.columns and selects only those that start with ‘Info’. A new DataFrame is then created using these filtered column names.

Summary/Discussion

  • Method 1: Square Brackets with Column Names List. Simple and direct. Limited to specifying exact names.
  • Method 2: .loc[] Method. Flexible label-based selection. Can be slower with larger datasets.
  • Method 3: .iloc[] Method. Good for position-based selection. Requires knowing column positions.
  • Method 4: filter() Function. Enables wildcard matching. Less intuitive for direct name selection.
  • Bonus Method 5: List Comprehension with startswith(). Elegant one-liner. Requires familiarity with list comprehensions.