nonzero() function in NumPy — using a practical data science example.
The megatrend urbanization does not come to a halt soon. Cars, factories, and other sources of pollution ultimately lead to poor air quality. This one-liner shows you how to deal with it from a data scientists’ perspective!
This example is taken from my book Python One-Liners:
Python One-Liners Book
Python programmers will improve their computer science skills with these useful one-liners.
Python One-Liners will teach you how to read and write “one-liners”: concise statements of useful functionality packed into a single line of code. You’ll learn how to systematically unpack and understand any line of Python code, and write eloquent, powerfully compressed Python like an expert.
The book’s five chapters cover tips and tricks, regular expressions, machine learning, core data science topics, and useful algorithms. Detailed explanations of one-liners introduce key computer science concepts and boost your coding and analytical skills. You’ll learn about advanced Python features such as list comprehension, slicing, lambda functions, regular expressions, map and reduce functions, and slice assignments. You’ll also learn how to:
• Leverage data structures to solve real-world problems, like using Boolean indexing to find cities with above-average pollution
• Use NumPy basics such as array, shape, axis, type, broadcasting, advanced indexing, slicing, sorting, searching, aggregating, and statistics
• Calculate basic statistics of multidimensional data arrays and the K-Means algorithms for unsupervised learning
• Create more advanced regular expressions using grouping and named groups, negative lookaheads, escaped characters, whitespaces, character sets (and negative characters sets), and greedy/nongreedy operators
• Understand a wide range of computer science topics, including anagrams, palindromes, supersets, permutations, factorials, prime numbers, Fibonacci numbers, obfuscation, searching, and algorithmic sorting
By the end of the book, you’ll know how to write Python at its most refined, and create concise, beautiful pieces of “Python art” in merely a single line.
The Nonzero() Trick
The air quality index (AQI) measures the danger of adverse health effects. It is commonly used to compare the air quality of different cities. In this one-liner, you are going to dive into the air quality index of four cities: Hong Kong, New York, Berlin, and Montreal.
The one-liner addresses the problem of finding above-average polluted cities. We define them as cities where the peak AQI value is above the overall average among all the measurements of all cities.
An important element of our solution will be to find elements that meet a certain condition in a NumPy array. This is a common problem in data science and you will need it constantly in your practical code projects.
So, let’s explore how you can find array elements that meet a certain condition.
NumPy offers the function nonzero() that finds indices of elements in an array that are, well, not equal to zero. Here is an example:
import numpy as np X = np.array([[1, 0, 0], [0, 2, 2], [3, 0, 0]]) print(np.nonzero(X)) # (array([0, 1, 1, 2], dtype=int64), array([0, 1, 2, 0], dtype=int64))
The result is a tuple of two NumPy arrays. The first array gives the row indices of non-zero elements. The second array gives the column indices of non-zero elements.
There are four non-zero elements in the two-dimensional array: 1, 2, 2, and 3. These four non-zero elements are at positions (0,0), (1,1), (1,2), and (2,0) in the array.
Now, how can you use nonzero() to find elements that meet a certain condition in your array? Simply use another great NumPy feature: Boolean array operations with broadcasting!
import numpy as np X = np.array([[1, 0, 0], [0, 2, 2], [3, 0, 0]]) print(X == 2) """ [[False False False] [False True True] [False False False]] """
This is actually an instance of broadcasting: the integer value “2” is copied (conceptually) into a new array with equal shape. NumPy then performs an element-wise comparison and returns the resulting Boolean array.
Do you have an idea of how to combine both features nonzero() and Boolean array operations to find elements which meet a certain condition?
Tip: Python represents the “False” data type as an integer with value “0”.
Have a look at the following code snippet to see how this can be done.
In the code snippet, we explore the following problem: “Find cities with above-average pollution peaks!”
## Dependencies import numpy as np ## Data: air quality index AQI data (row = city) X = np.array( [[ 42, 40, 41, 43, 44, 43 ], # Hong Kong [ 30, 31, 29, 29, 29, 30 ], # New York [ 8, 13, 31, 11, 11, 9 ], # Berlin [ 11, 11, 12, 13, 11, 12 ]]) # Montreal cities = np.array(["Hong Kong", "New York", "Berlin", "Montreal"]) ## One-liner polluted = set(cities[np.nonzero(X > np.average(X))]) ## Result print(polluted)
The data array X contains four rows (one row for each city) and six columns (one column for each measurement unit – e.g. days). The string array cities contains four names of the cities in the order of their occurrences in the data array.
The question is to find the names of the cities for which there are above average observed AQI values. Again, here is the one-liner that accomplishes that:
## One-liner polluted = set(cities[np.nonzero(X > np.average(X))])
Let’s deconstruct the one-liner starting from within:
print(X > np.average(X)) """ [[ True True True True True True] [ True True True True True True] [False False True False False False] [False False False False False False]] """
The Boolean expression uses broadcasting to bring both operands to the same shape. Then it performs an element-wise comparison to come up with a Boolean array that contains “True” if the respective measurement observed an above average AQI value. We use the function np.average() to compute the average AQI value of all NumPy array elements.
By generating this Boolean array, we know exactly which elements satisfy the condition of being above-average and which elements don’t.
Recall that Python’s “True” value is represented by “1” and “False” is represented by “0”. Hence, we can use the function nonzeros() to find all row and column indices that meet the condition. Here is how you can do this:
print(np.nonzero(X > np.average(X))) """ (array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2], dtype=int64), array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 2], dtype=int64)) """
The first tuple value holds all row indices with non-zero elements and the second tuple value holds their respective column indices. In other words, these are all rows and columns where the AQI values exceed the average pollution in the data matrix.
We are only interested in the row indices because we are looking for cities where such pollution peeks happened in our data. We can use these row indices to extract the string names from our string array:
print(cities[np.nonzero(X > np.average(X))]) """ ['Hong Kong' 'Hong Kong' 'Hong Kong' 'Hong Kong' 'Hong Kong' 'Hong Kong' 'New York' 'New York' 'New York' 'New York' 'New York' 'New York' 'Berlin'] """
There are a lot of duplicates in the resulting sequence of strings. The reason is that Hong Kong and New York have many such above-average AQI measurements.
Now, there is only one thing left: removing duplicates. This can be easily achieved by converting the sequence to a Python set. Sets are duplicate-free, so the result gives us all city names where pollution exceeded the average AQI values.
Where to go from here?
This article has shown you one of many NumPy tricks that can boost your productivity and data science skills.
Want to learn more such tricks? Get your “Coffee Break NumPy“! The book is full of tips, tricks, and NumPy puzzles to lift you to an expert NumPy level. But the best thing is: it’s 100% fun!
While working as a researcher in distributed systems, Dr. Christian Mayer found his love for teaching computer science students.
To help students reach higher levels of Python success, he founded the programming education website Finxter.com. He’s author of the popular programming book Python One-Liners (NoStarch 2020), coauthor of the Coffee Break Python series of self-published books, computer science enthusiast, freelancer, and owner of one of the top 10 largest Python blogs worldwide.
His passions are writing, reading, and coding. But his greatest passion is to serve aspiring coders through Finxter and help them to boost their skills. You can join his free email academy here.