Real-world data is noisy. But as a data scientist, you get paid to get rid of the noise, make the data accessible, and create meaning. Thus, filtering data is vital for real-world data science tasks.
In this article, you’ll learn how to create a minimal filter function in a single line of code. I first give you the code and explain the basics afterward.
# Option 1 my_list = [x for x in my_list if x.attribute == value] # Option 2 my_list = filter(lambda x: x.attribute == value, my_list)
A popular StackOverflow answer discusses which of the solutions is better. In my opinion, the first option is better because list comprehension is very efficient, there are no function calls, and it has fewer characters. 🤓
But this is only my opinion. Comment your opinion at the end of the article!
So how to create a function in one line? The lambda function is your friend! Lambda functions are anonymous functions that can be defined in a single line of code. If you want to learn more about lambda functions, check out this 3-min article.
lambda <arguments> : <expression>
You define a comma-separated list of arguments that serve as an input. The lambda function then evaluates the expression and returns the result of the expression.
Without further discussion of the basics, let’s explore how to solve the following data science problem by creating a filter function using the lambda function definition.
Consider the following problem: “Create a filter function that takes a list of books x and a minimal rating y and returns a list of potential bestsellers that have higher than minimal rating y’>y.”
## Dependencies import numpy as np ## Data (row = [title, rating]) books = np.array([['Coffee Break NumPy', 4.6], ['Lord of the Rings', 5.0], ['Harry Potter', 4.3], ['Winnie Pooh', 3.9], ['The Clown of God', 2.2], ['Coffee Break Python', 4.7]]) ## One-liner predict_bestseller = lambda x, y : x[x[:,1].astype(float) > y] ## Results print(predict_bestseller(books, 3.9))
Take a guess, what’s the output of this code snippet?
The data consists of a two-dimensional NumPy array where each row holds the name of the book title and the average user rating (a floating point number between 0.0 and 5.0). There are six different books in the rated data set.
The goal is to create a filter function which takes as input such a book rating data set x and a threshold rating y, and returns a sequence of books so that the books have a higher rating than the threshold y.
The one-liner achieves this objective by defining an anonymous lambda function that simply returns the result of the following expression:
x[x[:,1].astype(float) > y]
The array “x” is assumed to have a shape like our book rating array “books”.
First, we carve out the second column which holds the book ratings and converts it to a float array using the astype(float) method on the NumPy array “x”. This is necessary because the initial array “x” consists of mixed data types (float and strings).
Second, we create a Boolean array which holds the value “True” if the book at the respective row index has a rating larger than “y”. Note that the float “y” is implicitly broadcasted to a new NumPy array so that both operands of the Boolean operator “>” have the same shape.
Third, we use the Boolean array as an indexing array on the original book rating array to carve out all the books that have above-threshold ratings.
The result of this one-liner is the following array:
## Results print(predict_bestseller(books, 3.9)) """ [['Coffee Break NumPy' '4.6'] ['Lord of the Rings' '5.0'] ['Harry Potter' '4.3'] ['Coffee Break Python' '4.7']] """
This article is based on the book “Coffee Break NumPy” which I co-authored in 2019. It uses the scientifically proven method of teaching called “puzzle-based learning”. By solving the NumPy puzzles you will improve your NumPy skills and learn about your true skill level in comparison to other coders. Check it out, it’s fun!