Proportional Sampling Using Weighted Values

Probability and Statistics play a very important role in the field of data science and machine learning. In this blog post you will learn the concept of proportional sampling and how can we implement it from scratch without using any library

Proportional Sampling

Let us take an example of tossing a die to better understand the concept of proportional sampling. An unbiased die is a die in which the probability of getting a number between 1 and 6 is equal. Let us now imagine that the die is biased i.e a weight value is given to each side of the die.

123456
Fig 1: Die represented in the form of an array

                                                       

201260583310
Fig 2: Weight values given to the different sides of a die

                     

Proportional sampling is a technique in which the probability of selecting a number is proportional to the weight of that number. So, for instance if we run  an experiment  of tossing a die 100 times, then the probability of getting a 6 would be the lowest since the weight value of the side 6 is 10 which is the lowest amongst all other weight values. On the other hand, the probability of getting a 4 would be the highest since the weight value for 3 is 60 which is the highest amongst all other values.

There are 3 essentials steps to proportionally sample a number from a list.

  1. Computing the cumulative normalized sum values
  2. Choosing a random value from uniform distribution
  3. Sampling a value

Cumulative Normalized Sum

In order to compute the cumulative normalized sum value we first need to calculate the total sum of the weight values and then normalize the weight values by dividing each weight value by the total sum. After normalizing the weight values, we will have all the values between 0 and 1 and the sum of all the values will always be equal to 1.

Let us declare a variable called dice and weights which represents the 6 sides of the die and the corresponding weight values

dice = [1, 2, 3, 4, 5, 6]
weights = [20, 12, 60, 58, 33, 10]

We will now compute the sum of all weights and store it in a variable called total_sum. We  can use the in-built sum function to do this.

total_sum = sum(weights)
normalized_weights = [weight/total_sum for weight in weights]
print(normalized_weights)

The normalized weights have values between 0 and 1 and the sum of all the values is equal to 1

[0.10362694300518134, 0.06217616580310881, 0.31088082901554404, 0.3005181347150259, 0.17098445595854922, 0.05181347150259067]

The cumulative sum is used for monitoring change detection in a sequential data-set. Let us denote the cumulative sum by a variable called weight_cum_sum and computing it as follows

weight_cum_sum[0] = normalized_weights[0]
weight_cum_sum[1] = weight_cum_sum[0] +  normalized_weights[1]
weight_cum_sum[2] = weight_cum_sum[1] +  normalized_weights[2]
weight_cum_sum[3] = weight_cum_sum[2] +  normalized_weights[3]
weight_cum_sum[4] = weight_cum_sum[3] +  normalized_weights[4]
weight_cum_sum[5] = weight_cum_sum[4] +  normalized_weights[5]

We can do this efficiently in python by running a for loop and appending the cumulative sum values in a list

cum_sum = [normalized_weights[0]]
for i in range(1, len(normalized_weights)):
    cum_sum.append(cum_sum[i-1] +  normalized_weights[i])

If we print cum_sum, we will get the following values

[0.10362694300518134, 0.16580310880829013, 0.47668393782383417,  0.7772020725388601,  0.9481865284974094, 1.0]

Choosing a random value

Now that we have calculated the cumulative sum of the weight values, we will now randomly choose a number between 0 and 1 from a uniform distribution. We can do this by using the uniform function from the random module in python. We will denote this number by r.

from random import uniform
r = uniform(0,1)

Sampling

We will now loop through the cum_sum array and if the value of r is less than or equal to the cum_sum value at a particular index, then we will return the die value at that index

for index, value in enumerate(cum_sum):
    if r <= value:
      return dice[index]  

You can see the entire code below

from random import uniform

def proportional_sampling(dice, weights):
    total_sum = sum(weights)
    normalized_weights = [weight/total_sum for weight in weights]
    cum_sum = [normalized_weights[0]]
    r = uniform(0,1)
    for i in range(1, len(normalized_weights)):
        cum_sum.append(cum_sum[i-1] + normalized_weights[i])
    for index, value in enumerate(cum_sum):
        if r <=  value:
            return dice[index]
       
dice = [1,2,3,4,5,6]
weights = [20, 12, 60, 58, 33, 10]  
sampled_value = proportional_sampling(dice, weights)

Experimentation

We will now run an experiment where will call the proportional_sampling 100 times and analyze the result of sampling a number

dice_result = {}
for i in range(0, 100):
    sampled_value = proportional_sampling(dice, weights)
    if sampled_value not in dice_result:
        dice_result[sampled_value] = 1
    else:
        dice_result[sampled_value] += 1

As you can see from the above figure the probability of getting a 3 is the highest since 3  was given a weight of 60 which was the largest number in the weights array. If we  run this experiment for 1000 iterations instead of 100 you can expect to get even more precise results.