5 Best Ways to Iterate and Fetch Rows Containing Desired Text in Python Pandas

πŸ’‘ Problem Formulation: When working with datasets in Python’s Pandas library, a common task is to search for and extract rows that contain specific strings or substrates. For example, given a DataFrame containing text data, the goal might be to retrieve all rows where a particular column contains the word “success”. This article demonstrates five effective methods to accomplish this task, showcasing the versatility of Pandas for text-based data manipulation.

Method 1: Boolean Indexing

Boolean indexing in Pandas lets you filter DataFrames by evaluating each row against a boolean condition. When searching for text, we use the str.contains() method on string Series to create a boolean mask, which is then used to index the DataFrame and fetch the desired rows.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Column': ['The success story', 'Failure is not an option', 'Work hard and succeed']
})

# Boolean indexing to find rows containing 'success'
result = df[df['Column'].str.contains('success')]

print(result)

Output:

                Column
0  The success story

This code snippet creates a DataFrame with a single string column. Using boolean indexing with the str.contains() method, it filters out the rows where the column contains the word “success”. The resulting DataFrame is then printed, showing only the rows that met the criterion.

Method 2: The query() Method

For more complex querying, Pandas offers the query() method, which allows you to express your search criteria in a string that resembles a SQL WHERE clause. The method is especially useful when you need to reference variables in the environment, by prefixing them with an “@”.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Column': ['A successful attempt', 'Just missed the mark', 'Crowning achievement']
})

# Searching for rows with a variable
text_to_search = 'successful'
result = df.query('Column.str.contains(@text_to_search)', engine='python')

print(result)

Output:

                   Column
0  A successful attempt

In this snippet, we use the query() method to search for a variable-defined string within a Pandas DataFrame. By using @text_to_search within the query string, we can dynamically filter rows based on the contents of a Python variable.

Method 3: Regular Expressions

Regular expressions (regex) offer powerful pattern matching capabilities that go beyond simple substring searches. In Pandas, you can use the str.contains() method with a regex pattern to search for rows matching more complex rules.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Column': ['Learning is key to success', 'Fail, learn, and succeed', 'No shortcuts to success']
})

# Using regex to find words starting with 'suc'
result = df[df['Column'].str.contains(r'\bsuc', regex=True)]

print(result)

Output:

                         Column
0  Learning is key to success
2     No shortcuts to success

The code above uses a regex pattern within the str.contains() method to find rows where the ‘Column’ contains words starting with “suc”. The \b in the regex pattern denotes a word boundary, ensuring we match only whole words.

Method 4: Applying a Custom Function

Sometimes, you may need a custom function to define how text is matched. Using the DataFrame apply() method allows you to apply a custom function to each row or column of the DataFrame, and boolean indexing can then be used to select the rows that match the function’s criteria.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Column': ['Failure teaches success', 'Your successful moments', 'Persistence ensures success']
})

# Custom function for matching
def contains_success(text):
    return 'success' in text

# Applying the custom function
result = df[df['Column'].apply(contains_success)]

print(result)

Output:

                        Column
0  Failure teaches success
2  Persistence ensures success

Here we define a function contains_success() that checks if the word “success” is present in a given text string. This function is applied to each element of the ‘Column’ using the apply() method, and then we filter the DataFrame using the boolean array resulting from this operation.

Bonus One-Liner Method 5: The filter() Function

While not as widely applicable as other methods, the Pandas filter() function can be useful for quickly selecting rows where a string pattern is present in any of the specified columns.

Here’s an example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Name': ['Alice Success', 'Bob', 'Carol'],
    'Occupation': ['Engineer', 'Painter with success', 'Writer']
})

# One-liner using filter
result = df[df.apply(lambda row: row.astype(str).str.contains('success').any(), axis=1)]

print(result)

Output:

             Name           Occupation
0  Alice Success             Engineer
1             Bob  Painter with success

The filter() function in this example checks whether any column in a row contains the word “success”. This is implemented with a lambda function that applies the str.contains() to every element of the row and checks if any are True.

Summary/Discussion

  • Method 1: Boolean Indexing. Straightforward and concise. Limited to straightforward substring matching.
  • Method 2: The query() Method. Flexible with SQL-like querying. Can be less performant with large datasets.
  • Method 3: Regular Expressions. Highly versatile for complex patterns. Can be slower and less readable for simple tasks.
  • Method 4: Applying a Custom Function. Most flexible for complex conditions. Generally slower due to row-wise operation.
  • Bonus Method 5: The filter() Function. Quick for checking multiple columns. Less intuitive and may be less efficient than other methods.