People Who Bought X Also Bought …? [A Simple NumPy Tutorial]

Imagine you are Jeff Bezos. One of the most successful features of your company Amazon is product recommendation.

“People who bought X also bought Y.”

Roughly speaking, this feature alone has made you billions. For you, Jeff Bezos, product recommendation is the most important algorithm in the world, isn’t it?

In this article, you’ll learn about the basics of association analysis, the underlying algorithmic framework of product recommendation used by companies such as Amazon or Netflix.

I’ll show you the basic ideas of association analysis in a single line of code. A one-liner is a powerful lever. Don’t believe me? Read this data science tutorial and learn

  • ideas and applications of the important data science application of association analysis,
  • how to use important NumPy and Python functions and concepts such as slicing, list comprehension, and element-wise array operations, and
  • how to analyze complex code in a rigorous manner.

But first things first: what is association analysis?

A Conceptual Introduction to Association Analysis

Association analysis is based on historical (customer) data. For instance, you may have already read the recommendation “People who bought X also bought Y” on Amazon. This association of different products is a powerful marketing concept because it does not only tie together related but complimentary products, but it also provides you with an element of “social proof” – the fact that other people have bought the product increases psychological safety for you to buy the product yourself. This is an excellent tool for marketers.

Let’s have a look at a practical example:

There are four persons Alice, Bob, Louis, and Larissa. Each person has bought different products (book, game, football, notebook, headphones). Say, we know every product bought by all four persons but not whether Louis has bought the notebook. What would you say: is Louis likely to buy the notebook?

Association analysis (or collaborative filtering) provides an answer to this problem. The underlying assumption is that if two persons performed similar actions in the past (e.g. bought a similar product), it is more likely that they keep performing similar actions in the future. If you look closely into above customer profiles, you will quickly realize that Louis has a similar buying behavior to Alice. Both Louis and Alice have bought the game and the football but not the headphones and the book. For Alice, we also know that she bought the notebook. Thus, the recommender system will predict that Louis is likely to buy the notebook, too.

Let’s explore the topic of association analysis in more detail. Ready?         

Consider the example of the previous section: your customers purchase individual products from a corpus of four different products. Your company wants to upsell products to customers. Thus, your boss tells you to calculate for each combination of products how often they have been purchased by the same customer – and find the two products which were purchased most often together.

How to apply association analysis in a single line of NumPy code?

The problem: find the two items that were purchased most often together.

## Dependencies
import numpy as np


## Data: row is customer shopping basket
## row = [course 1, course 2, ebook 1, ebook 2]
## value 1 indicates that an item was bought.
basket = np.array([[0, 1, 1, 0],
                   [0, 0, 0, 1],
                   [1, 1, 0, 0],
                   [0, 1, 1, 1],
                   [1, 1, 1, 0],
                   [0, 1, 1, 0],
                   [1, 1, 0, 1],
                   [1, 1, 1, 1]])


## One-liner (broken down in two lines;)
copurchases = [(i,j,np.sum(basket[:,i] + basket[:,j] == 2))
               for i in range(4) for j in range(i+1,4)]

## Result
print(max(copurchases, key=lambda x:x[2]))

What’s the output of this one-liner?

Explanation and discussion of the code

The data matrix consists of historical purchasing data with one row per customer and one column per product. Our goal is to find a list of tuples so that each tuple describes a combination of products and how often these were bought together. For each list element, the first two tuple values are column indices (the combination of two products) and the third tuple value is the number of times these products were bought together. Here is an example of such a tuple:

(0,1,4)

The meaning of this tuple is the following: Customers who bought product 0 also bought product 1 four times.

So how can we achieve this objective? Let’s break the one-liner down (I reformatted the one-liner to avoid that the line is too wide).

## One-liner (broken down in two lines;)
copurchases = [(i,j,np.sum(basket[:,i] + basket[:,j] == 2))
               for i in range(4) for j in range(i+1,4)]

The outer format indicates that we create a list of tuples using list comprehension (see Chapter 3). We are interested in every unique combination of column indices of an array with four columns. Here is how the outer part of this one-liner looks like:

print([(i,j) for i in range(4) for j in range(i+1,4)])
# [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

So there are six different tuples in the list – each being a unique combination of column indices.

Knowing this, we can now dive into the third tuple element: the number of times these two products i and j have been bought together:

np.sum(basket[:,i] + basket[:,j] == 2)

We use slicing to extract both columns i and j from the original NumPy array. Then we add them together element-wise. For the resulting array, we check element-wise whether the sum is equal to 2.  Why? Because if it is, we know that both products have been purchased together. The result of this gives us a Boolean array with true values if two products have been purchased together by a single customer.

Using the property that Python represents Boolean values as integers, we simply sum over all array elements to receive the number of customers who bought both products i and j. We store all resulting tuples in the list “copurchases”.

Want to see the elements of the list?

print(copurchases)
# [(0, 1, 4), (0, 2, 2), (0, 3, 2), (1, 2, 5), (1, 3, 3), (2, 3, 2)]

Now there is one thing left: find the two products that have been copurchased most often.

## Result
print(max(copurchases, key=lambda x:x[2]))

We simply use the max function that gives us the maximum element in the list. Maximum for tuples? Yes, simply define a key function that takes a tuple and returns the third tuple value. Roughly speaking, the third tuple value (number of copurchases) determines the maximum of this copurchasing list. Hence, the result of this code snippet is:

## Result
print(max(copurchases, key=lambda x:x[2]))
# (1, 2, 5)

The second and the third products have been purchased together five times by the same customers. No other product combination reaches this high copurchasing power. Hence, you can tell your boss to upsell product 2 when selling product 1 and the other way around.

Where to go from here?

A thorough understanding of the NumPy library is crucial for your data science education. Every single data science expert in the Python space knows how to use, apply, and leverage the NumPy library to solve their problems.

If you feel like you need to read a bit more about the NumPy library, go over my free NumPy tutorial on the blog.

Leave a Comment