Simple Association Analysis in One Line Python

Have you ever bought a product recommended by Amazon’s algorithms? Chances are that you are guilty of having purchased many such products. The recommendation algorithms are often based on a technique called “association analysis”.

In this article, you’ll learn more about the basic idea of association analysis and how to tip your toe into the deep ocean of recommender systems – all in a single line of NumPy code.

Understanding the Basics of Association Analysis

Association analysis is based on historical (customer) data. For instance, you may have already read the recommendation “People who bought X also bought Y” on Amazon. This association of different products is a powerful marketing concept because it does not only tie together related but complimentary products, but it also provides you with an element of “social proof” – the fact that other people have bought the product increases psychological safety for you to buy the product yourself. This is an excellent tool for marketers.

Let’s have a look at a practical example:

There are four persons Alice, Bob, Louis, and Larissa. Each person has bought different products (book, game, football, notebook, headphones). Say, we know every product bought by all four persons but not whether Louis has bought the notebook. What would you say: is Louis likely to buy the notebook?

Association analysis (or collaborative filtering) provides an answer to this problem. The underlying assumption is that if two persons performed similar actions in the past (e.g. bought a similar product), it is more likely that they keep performing similar actions in the future. If you look closely into above customer profiles, you will quickly realize that Louis has a similar buying behavior to Alice. Both Louis and Alice have bought the game and the football but not the headphones and the book. For Alice, we also know that she bought the notebook. Thus, the recommender system will predict that Louis is likely to buy the notebook, too.

The following code snippet simplifies this problem.

The Code

We consider the following problem: What’s the fraction of customers who bought both eBooks together?

## Dependencies
import numpy as np

## Data: row is customer shopping basket
## row = [course 1, course 2, ebook 1, ebook 2]
## value 1 indicates that an item was bought.
basket = np.array([[0, 1, 1, 0],
                   [0, 0, 0, 1],
                   [1, 1, 0, 0],
                   [0, 1, 1, 1],
                   [1, 1, 1, 0],
                   [0, 1, 1, 0],
                   [1, 1, 0, 1],
                   [1, 1, 1, 1]])

## One-liner
copurchases = np.sum(np.all(basket[:,2:], axis = 1)) / basket.shape[0]

## Result

What is the output of this code snippet?

Explaining the Code

The basket data array consists of customer data with one row per customer and one column per product (see the Figure above). Say, the first two products with column indices 0 and 1 are online courses and the latter two products with column indices 2 and 3 are eBooks. The value “1” in cell (i,j) indicates that customer i has bought the product j.

The problem is to find the fraction of customers who bought both eBooks (columns 2 and 3). In other words, we need to count the number of customers who have a value “1” at both columns 2 and 3. Thus, we first carve out the relevant columns from the original array to get the following sub-array:

[[1 0]
 [0 1]
 [0 0]
 [1 1]
 [1 0]
 [1 0]
 [0 1]
 [1 1]]

The slicing operation ensures that only the third and the fourth column – but all rows – remain in the array.

As you would intuitively guess, the NumPy all() function checks whether all values in a NumPy array evaluate to “True”. If this is the case, it returns “True”, otherwise it returns “False”. When used with the axis argument, the function performs this operation along the specified axis. Note that the axis argument is a recurring element for many different NumPy functions. Take your time to understand the axis argument properly: The specified axis is collapsed into a single value.

Thus, the result of applying the all() function on the sub-array is the following:

print(np.all(basket[:,2:], axis = 1))
# [False False False  True False False False  True]

In plain English: only the fourth and the last customers have bought both ebooks.

As we are interested in the fraction of customers, we sum over this Boolean array (side note: the Boolean value “True” is represented by an integer value of “1” and “False” by an integer value of “0”) and divide by the number of customers. The result is the fraction of customers who bought both eBooks (which is 0.25).

Where to go from here?

The code in this article uses several advanced NumPy features. If you are serious with your career as a data scientist, you need to learn NumPy.

There is no way around — so why not tackle NumPy heads-on and become a NumPy master? This book uses a combination of practical tutorial-style introduction and puzzle-based learning based on the latest research of educational science. It’s effective and fun!

Leave a Comment