The megatrend urbanization does not come to a halt soon. Cars, factories, and other sources of pollution ultimately lead to poor air quality. This one-liner shows you how to deal with it from a data scientists’ perspective!
The Nonzero() Trick
The air quality index (AQI) measures the danger of adverse health effects. It is commonly used to compare the air quality of different cities. In this one-liner, you are going to dive into the air quality index of four cities: Hong Kong, New York, Berlin, and Montreal.
The one-liner addresses the problem of finding above-average polluted cities. We define them as cities where the peak AQI value is above the overall average among all the measurements of all cities.
An important element of our solution will be to find elements that meet a certain condition in a NumPy array. This is a common problem in data science and you will need it constantly in your practical code projects.
So, let’s explore how you can find array elements that meet a certain condition.
NumPy offers the function nonzero() that finds indices of elements in an array that are, well, not equal to zero. Here is an example:
import numpy as np X = np.array([[1, 0, 0], [0, 2, 2], [3, 0, 0]]) print(np.nonzero(X)) # (array([0, 1, 1, 2], dtype=int64), array([0, 1, 2, 0], dtype=int64))
The result is a tuple of two NumPy arrays. The first array gives the row indices of non-zero elements. The second array gives the column indices of non-zero elements.
There are four non-zero elements in the two-dimensional array: 1, 2, 2, and 3. These four non-zero elements are at positions (0,0), (1,1), (1,2), and (2,0) in the array.
Now, how can you use nonzero() to find elements that meet a certain condition in your array? Simply use another great NumPy feature: Boolean array operations with broadcasting!
import numpy as np X = np.array([[1, 0, 0], [0, 2, 2], [3, 0, 0]]) print(X == 2) """ [[False False False] [False True True] [False False False]] """
This is actually an instance of broadcasting: the integer value “2” is copied (conceptually) into a new array with equal shape. NumPy then performs an element-wise comparison and returns the resulting Boolean array.
Do you have an idea of how to combine both features nonzero() and Boolean array operations to find elements which meet a certain condition?
Tip: Python represents the “False” data type as an integer with value “0”.
Have a look at the following code snippet to see how this can be done.
In the code snippet, we explore the following problem: “Find cities with above-average pollution peaks!”
## Dependencies import numpy as np ## Data: air quality index AQI data (row = city) X = np.array( [[ 42, 40, 41, 43, 44, 43 ], # Hong Kong [ 30, 31, 29, 29, 29, 30 ], # New York [ 8, 13, 31, 11, 11, 9 ], # Berlin [ 11, 11, 12, 13, 11, 12 ]]) # Montreal cities = np.array(["Hong Kong", "New York", "Berlin", "Montreal"]) ## One-liner polluted = set(cities[np.nonzero(X > np.average(X))]) ## Result print(polluted)
The data array X contains four rows (one row for each city) and six columns (one column for each measurement unit – e.g. days). The string array cities contains four names of the cities in the order of their occurrences in the data array.
The question is to find the names of the cities for which there are above average observed AQI values. Again, here is the one-liner that accomplishes that:
## One-liner polluted = set(cities[np.nonzero(X > np.average(X))])
Let’s deconstruct the one-liner starting from within:
print(X > np.average(X)) """ [[ True True True True True True] [ True True True True True True] [False False True False False False] [False False False False False False]] """
The Boolean expression uses broadcasting to bring both operands to the same shape. Then it performs an element-wise comparison to come up with a Boolean array that contains “True” if the respective measurement observed an above average AQI value. We use the function np.average() to compute the average AQI value of all NumPy array elements.
By generating this Boolean array, we know exactly which elements satisfy the condition of being above-average and which elements don’t.
Recall that Python’s “True” value is represented by “1” and “False” is represented by “0”. Hence, we can use the function nonzeros() to find all row and column indices that meet the condition. Here is how you can do this:
print(np.nonzero(X > np.average(X))) """ (array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2], dtype=int64), array([0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 2], dtype=int64)) """
The first tuple value holds all row indices with non-zero elements and the second tuple value holds their respective column indices. In other words, these are all rows and columns where the AQI values exceed the average pollution in the data matrix.
We are only interested in the row indices because we are looking for cities where such pollution peeks happened in our data. We can use these row indices to extract the string names from our string array:
print(cities[np.nonzero(X > np.average(X))]) """ ['Hong Kong' 'Hong Kong' 'Hong Kong' 'Hong Kong' 'Hong Kong' 'Hong Kong' 'New York' 'New York' 'New York' 'New York' 'New York' 'New York' 'Berlin'] """
There are a lot of duplicates in the resulting sequence of strings. The reason is that Hong Kong and New York have many such above-average AQI measurements.
Now, there is only one thing left: removing duplicates. This can be easily achieved by converting the sequence to a Python set. Sets are duplicate-free, so the result gives us all city names where pollution exceeded the average AQI values.
Where to go from here?
This article has shown you one of many NumPy tricks that can boost your productivity and data science skills.
Want to learn more such tricks? Get your “Coffee Break NumPy“! The book is full of tips, tricks, and NumPy puzzles to lift you to an expert NumPy level. But the best thing is: it’s 100% fun!