5 Best Ways to Select a Subset of Rows and Columns in Python Pandas

💡 Problem Formulation: When working with data in Python Pandas, it’s a common task to extract just the relevant piece of your dataset. Whether it’s for initial data inspection, further data analysis, or preprocessing for machine learning tasks, being able to slice your DataFrame efficiently is essential. This article dives into how to select a subset of both rows and columns simultaneously, using different methods that Pandas offers, taking a DataFrame as an input and showcasing the corresponding slicing outputs.

Method 1: Using `loc[]` for label-based indexing

The loc[] accessor is perfect for selecting rows and columns by labels. It takes two single arguments; the first specifies the row labels while the second specifies the column labels. The magic of loc[] is that it can handle slicing, single labels, label lists, and boolean arrays.

Here’s an example:

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
}, index=['row1', 'row2', 'row3'])

result = df.loc['row1':'row2', 'A':'B']
print(result)

Output:

      A  B
row1  1  4
row2  2  5

This snippet creates a DataFrame and uses loc[] to select the first two rows and the first two columns. The result is a new DataFrame containing only the specified subset.

Method 2: Using `iloc[]` for position-based indexing

The iloc[] accessor is useful for selecting rows and columns by integer position. Like loc[], it takes two arguments for row and column selection but works with indices instead of labels. It’s great for scenarios where the positional location is more relevant than the label.

Here’s an example:

result = df.iloc[0:2, 0:2]
print(result)

Output:

      A  B
row1  1  4
row2  2  5

In this code, iloc[] selects the subset of the DataFrame corresponding to the first two rows and columns by their integer indices.

Method 3: Boolean indexing with `loc[]`

Boolean indexing allows for more complex selection criteria. It uses a boolean vector to filter rows. The loc[] accessor pairs well with a boolean series to select rows based on a condition, combined with specific column labels.

Here’s an example:

result = df.loc[df['A'] > 1, ['B', 'C']]
print(result)

Output:

      B  C
row2  5  8
row3  6  9

The example uses boolean indexing to select rows where column ‘A’ has values greater than 1 and only the columns ‘B’ and ‘C’.

Method 4: Using `query()` and `loc[]`

The combination of query() and loc[] methods provides a clean and readable way to select subsets. The query() method is used to filter rows based on a query string, while loc[] is still used for column selection.

Here’s an example:

result = df.query('A > 1').loc[:, ['B', 'C']]
print(result)

Output:

      B  C
row2  5  8
row3  6  9

This snippet uses the query() method to filter rows where ‘A’ is greater than 1, and then loc[] selects only columns ‘B’ and ‘C’.

Bonus One-Liner Method 5: Using `df[]` and Boolean indexing

For a quick, one-liner solution to select rows based on a condition and certain columns, you can combine boolean indexing directly with column selection using df[].

Here’s an example:

result = df[df['A'] > 1][['B', 'C']]
print(result)

Output:

      B  C
row2  5  8
row3  6  9

The code uses boolean indexing to filter rows where ‘A’ is greater than 1 and then selects columns ‘B’ and ‘C’ using a list.

Summary/Discussion

Method 1: Using loc[]. Strengths: Selection by labels, versatile for different types of indexing. Weaknesses: Not suitable for positional selection.
Method 2: Using iloc[]. Strengths: Selection strictly by integer position, good for scenarios ignoring labels. Weaknesses: Less intuitive if data has meaningful labels.
Method 3: Boolean indexing with loc[]. Strengths: Allows complex row selection based on conditionals. Weaknesses: Requires more steps and can be less readable.
Method 4: Using query() and loc[]. Strengths: Provides very readable code for complex queries. Weaknesses: Query strings are less flexible than pure Python expressions.
Bonus One-Liner Method 5: Using df[] and Boolean indexing. Strengths: Quick and straightforward for simple selections. Weaknesses: Can become unwieldy with more complex selections or conditions.

Method 1: Using loc[] for label-based indexing

Method 2: Using iloc[] for position-based indexing

Method 3: Boolean indexing with loc[]

Method 4: Using query() and loc[]

Bonus One-Liner Method 5: Using df[] and Boolean indexing

Summary/Discussion

Method 1: Using `loc[]` for label-based indexing

Method 2: Using `iloc[]` for position-based indexing

Method 3: Boolean indexing with `loc[]`

Method 4: Using `query()` and `loc[]`

Bonus One-Liner Method 5: Using `df[]` and Boolean indexing