People Who Bought X Also Bought ...? An Introduction to NumPy Association Analysis

Imagine you are Jeff Bezos. One of the most successful features of your company Amazon is product recommendation.

“People who bought X also bought Y.”

Roughly speaking, this feature alone has made you billions. For you, Jeff Bezos, product recommendation is the most important algorithm in the world, isn’t it?

In this article, you’ll learn about the basics of association analysis, the underlying algorithmic framework of product recommendation used by companies such as Amazon or Netflix.

I’ll show you the basic ideas of association analysis in a single line of code. In this data science tutorial you’ll learn

ideas and applications of the important data science application of association analysis,
how to use important NumPy and Python functions and concepts such as slicing, list comprehension, and element-wise array operations, and
how to analyze complex code in a rigorous manner.

But first things first: what is association analysis?

A Conceptual Introduction to Association Analysis

Association analysis is based on historical (customer) data. For instance, you may have already read the recommendation “People who bought X also bought Y” on Amazon. This association of different products is a powerful marketing concept because it does not only tie together related but complimentary products, but it also provides you with an element of “social proof” – the fact that other people have bought the product increases psychological safety for you to buy the product yourself. This is an excellent tool for marketers.

Let’s have a look at a practical example:

Example Association Analysis Product Matrix

There are four persons Alice, Bob, Louis, and Larissa. Each person has bought different products (book, game, football, notebook, headphones). Say, we know every product bought by all four persons but not whether Louis has bought the notebook. What would you say: is Louis likely to buy the notebook?

Definition: Association analysis (or collaborative filtering) provides an answer to this problem. The underlying assumption is that if two persons performed similar actions in the past (e.g. bought a similar product), it is more likely that they keep performing similar actions in the future.

If you look closely into above customer profiles, you will quickly realize that Louis has a similar buying behavior to Alice. Both Louis and Alice have bought the game and the football but not the headphones and the book. For Alice, we also know that she bought the notebook. Thus, the recommender system will predict that Louis is likely to buy the notebook, too.

Let’s explore the topic of association analysis in more detail. Ready?

Consider the example of the previous section: your customers purchase individual products from a corpus of four different products. Your company wants to upsell products to customers. Thus, your boss tells you to calculate for each combination of products how often they have been purchased by the same customer – and find the two products which were purchased most often together.

How to Apply Association Analysis in a Single Line of NumPy Code?

Problem Formulation: find the two items that were purchased most often together.

## Dependencies
import numpy as np


## Data: row is customer shopping basket
## row = [course 1, course 2, ebook 1, ebook 2]
## value 1 indicates that an item was bought.
basket = np.array([[0, 1, 1, 0],
                   [0, 0, 0, 1],
                   [1, 1, 0, 0],
                   [0, 1, 1, 1],
                   [1, 1, 1, 0],
                   [0, 1, 1, 0],
                   [1, 1, 0, 1],
                   [1, 1, 1, 1]])


## One-liner (broken down in two lines;)
copurchases = [(i,j,np.sum(basket[:,i] + basket[:,j] == 2))
               for i in range(4) for j in range(i+1,4)]

## Result
print(max(copurchases, key=lambda x:x[2]))

Exercise: What’s the output of this one-liner?

Code Discussion & Explanation

The data matrix consists of historical purchasing data with one row per customer and one column per product. Our goal is to find a list of tuples so that each tuple describes a combination of products and how often these were bought together. For each list element, the first two tuple values are column indices (the combination of two products) and the third tuple value is the number of times these products were bought together.

Here is an example of such a tuple:

(0,1,4)

The meaning of this tuple is the following: Customers who bought product 0 also bought product 1 four times.

So how can we achieve this objective? Let’s break the one-liner down (I reformatted the one-liner to avoid that the line is too wide).

## One-liner (broken down in two lines;)
copurchases = [(i,j,np.sum(basket[:,i] + basket[:,j] == 2))
               for i in range(4) for j in range(i+1,4)]

The outer format indicates that we create a list of tuples using list comprehension. We are interested in every unique combination of column indices of an array with four columns. Here is how the outer part of this one-liner looks like:

print([(i,j) for i in range(4) for j in range(i+1,4)])
# [(0, 1), (0, 2), (0, 3), (1, 2), (1, 3), (2, 3)]

So there are six different tuples in the list – each being a unique combination of column indices.

Knowing this, we can now dive into the third tuple element: the number of times these two products i and j have been bought together:

np.sum(basket[:,i] + basket[:,j] == 2)

We use slicing to extract both columns i and j from the original NumPy array. Then we add them together element-wise. For the resulting array, we check element-wise whether the sum is equal to 2. Why? Because if it is, we know that both products have been purchased together. The result of this gives us a Boolean array with true values if two products have been purchased together by a single customer.

Using the property that Python represents Boolean values as integers, we simply sum over all array elements to receive the number of customers who bought both products i and j. We store all resulting tuples in the list “copurchases”.

Want to see the elements of the list?

print(copurchases)
# [(0, 1, 4), (0, 2, 2), (0, 3, 2), (1, 2, 5), (1, 3, 3), (2, 3, 2)]

Now there is one thing left: find the two products that have been copurchased most often.

## Result
print(max(copurchases, key=lambda x:x[2]))

We simply use the max function that gives us the maximum element in the list. Maximum for tuples? Yes, simply define a key function that takes a tuple and returns the third tuple value. Roughly speaking, the third tuple value (number of copurchases) determines the maximum of this copurchasing list. Hence, the result of this code snippet is:

## Result
print(max(copurchases, key=lambda x:x[2]))
# (1, 2, 5)

The second and the third products have been purchased together five times by the same customers. No other product combination reaches this high copurchasing power. Hence, you can tell your boss to upsell product 2 when selling product 1 and the other way around.

This tutorial is based on my book Python One-Liners—feel free to check it out!

Python One-Liners Book: Master the Single Line First!

Python programmers will improve their computer science skills with these useful one-liners.

Python One-Liners will teach you how to read and write “one-liners”: concise statements of useful functionality packed into a single line of code. You’ll learn how to systematically unpack and understand any line of Python code, and write eloquent, powerfully compressed Python like an expert.

The book’s five chapters cover (1) tips and tricks, (2) regular expressions, (3) machine learning, (4) core data science topics, and (5) useful algorithms.

Detailed explanations of one-liners introduce key computer science concepts and boost your coding and analytical skills. You’ll learn about advanced Python features such as list comprehension, slicing, lambda functions, regular expressions, map and reduce functions, and slice assignments.

You’ll also learn how to:

Leverage data structures to solve real-world problems, like using Boolean indexing to find cities with above-average pollution
Use NumPy basics such as array, shape, axis, type, broadcasting, advanced indexing, slicing, sorting, searching, aggregating, and statistics
Calculate basic statistics of multidimensional data arrays and the K-Means algorithms for unsupervised learning
Create more advanced regular expressions using grouping and named groups, negative lookaheads, escaped characters, whitespaces, character sets (and negative characters sets), and greedy/nongreedy operators
Understand a wide range of computer science topics, including anagrams, palindromes, supersets, permutations, factorials, prime numbers, Fibonacci numbers, obfuscation, searching, and algorithmic sorting

By the end of the book, you’ll know how to write Python at its most refined, and create concise, beautiful pieces of “Python art” in merely a single line.

Get your Python One-Liners on Amazon!!