NumPy Boolean Indexing

5/5 - (1 vote)

You can index specific values from a NumPy array using another NumPy array of Boolean values on one axis to specify the indices you want to access. For example, to access the second and third values of array a = np.array([4, 6, 8]), you can use the expression a[np.array([False, True, True])] using the Boolean array as an indexing mask.

1D Boolean Indexing Example

Here’s a minimal example for one-dimensional NumPy arrays:

import numpy as np

# 1D Boolean Indexing
a = np.array([4, 6, 8])
b = np.array([False, True, True])
[6 8]

2D Boolean Indexing Example

And here’s a minimal example for 2D arrays:

# 2D Boolean Indexing
a = np.array([[1, 2, 3],
              [4, 5, 6]])
b = np.array([[True, False, False],
              [False, False, True]])
[6 8]
[1 6]

Let’s dive into another example. Have a look at the following code snippet.

import numpy as np

a = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]])

indices = np.array([[False, False, True],
                    [False, False, False],
                    [True, True, False]])

# [3 7 8]

We create two arrays a and indices.

  • The first array contains two-dimensional numerical data – you can think of it as the data array.
  • The second array has the same shape and contains Boolean values – think of it as the indexing array.

A great feature of NumPy is that you can use the Boolean array as an indexing scheme to access specific values from the second array. In plain English, we create a new NumPy array from the data array containing only those elements for which the indexing array contains True Boolean values at the respective array positions. Thus, the resulting array contains the three values 3, 7, and 8.

Python One-Liner Example Boolean Indexing

Python One-Liner | Data Science 3 | NumPy Boolean Indexing + Broadcasting + Nonzero()

In the following one-liner, you are going to use this feature for miniature social network analysis.

We are examining the following problem: “Find the names of the Instagram superstars with more than 100 million followers!”

## Dependencies
import numpy as np

## Data: popular Instagram accounts (millions followers)
inst = np.array([[232, "@instagram"],
                 [133, "@selenagomez"],
                 [59,  "@victoriassecret"],
                 [120, "@cristiano"],
                 [111, "@beyonce"],
                 [76,  "@nike"]])

## One-liner
superstars = inst[inst[:,0].astype(float) > 100, 1]

## Results

You can compute the result of this one-liner in your head, can’t you?

The data consists of a two-dimensional array where each row represents an Instagram influencer. The first column states their number of followers (in million), and the second column states their Instagram name. The question is to find the names of the Instagram influencers with more than 100 million followers.

The following one-liner is one way of solving this problem. Note that there are many more alternatives – this is just the one which I found has the least number of characters.

## One-liner
superstars = inst[inst[:,0].astype(float) > 100, 1]

Let’s deconstruct this one-liner in a step by step manner.

First, we calculate a Boolean value whether each influencer has more than 100 million followers:

print(inst[:,0].astype(float) > 100)
# [ True  True False  True  True False]

The first column of the data array contains the number of followers, so we use slicing to access this data (inst[:,0] returns all rows but only the first column). However, the data array contains mixed data types (integers and strings). Therefore, NumPy automatically assigns a non-numerical data type to the array.

But as we want to perform numerical comparisons on the first column of the data array (checking whether each value is larger than 100), we first need to convert the array into a numerical type (for example float).

At this point, we check whether a NumPy array of type float is larger than an integer value. What exactly happens here? You have already learned about broadcasting: NumPy automatically brings the two operands into the same shape. Then, it compares the two equally-shaped arrays element-wise. The result is an array of Boolean values. Four influencers have more than 100 million followers.

We now take this Boolean array as an indexing array to select the influencers with more than 100 million followers (the rows).

inst[inst[:,0].astype(float) > 100, 1]

As we are only interested in the names of these influencers, we select the second row as the final result stored in the superstars variable.

The influencers with more than 100 million Instagram followers are:

# ['@instagram' '@selenagomez' '@cristiano' '@beyonce']

What’s Next?

Learning NumPy will not only make you a better Python coder, it will also improve your chances to find profitable positions as a data scientist and solve important real-world problems.

To help you increase your value to the marketplace, I’ve written a new NumPy book — 100% based on the proven principle of puzzle-based learning.