Probability and Statistics play a very important role in the field of data science and machine learning. In this blog post you will learn the concept of proportional sampling and how can we implement it from scratch without using any library
Proportional Sampling
Let us take an example of tossing a die to better understand the concept of proportional sampling. An unbiased die is a die in which the probability of getting a number between 1 and 6 is equal. Let us now imagine that the die is biased i.e a weight value is given to each side of the die.
1 | 2 | 3 | 4 | 5 | 6 |
20 | 12 | 60 | 58 | 33 | 10 |
Proportional sampling is a technique in which the probability of selecting a number is proportional to the weight of that number. So, for instance if we run an experiment of tossing a die 100 times, then the probability of getting a 6 would be the lowest since the weight value of the side 6 is 10 which is the lowest amongst all other weight values. On the other hand, the probability of getting a 4 would be the highest since the weight value for 3 is 60 which is the highest amongst all other values.
There are 3 essentials steps to proportionally sample a number from a list.
- Computing the cumulative normalized sum values
- Choosing a random value from uniform distribution
- Sampling a value
Cumulative Normalized Sum
In order to compute the cumulative normalized sum value we first need to calculate the total sum of the weight values and then normalize the weight values by dividing each weight value by the total sum. After normalizing the weight values, we will have all the values between 0 and 1 and the sum of all the values will always be equal to 1.
Let us declare a variable called dice and weights which represents the 6 sides of the die and the corresponding weight values
dice = [1, 2, 3, 4, 5, 6] weights = [20, 12, 60, 58, 33, 10]
We will now compute the sum of all weights and store it in a variable called total_sum
. We can use the in-built sum function to do this.
total_sum = sum(weights) normalized_weights = [weight/total_sum for weight in weights] print(normalized_weights)
The normalized weights have values between 0 and 1 and the sum of all the values is equal to 1
[0.10362694300518134, 0.06217616580310881, 0.31088082901554404, 0.3005181347150259, 0.17098445595854922, 0.05181347150259067]
The cumulative sum is used for monitoring change detection in a sequential data-set. Let us denote the cumulative sum by a variable called weight_cum_sum
and computing it as follows
weight_cum_sum[0] = normalized_weights[0] weight_cum_sum[1] = weight_cum_sum[0] + normalized_weights[1] weight_cum_sum[2] = weight_cum_sum[1] + normalized_weights[2] weight_cum_sum[3] = weight_cum_sum[2] + normalized_weights[3] weight_cum_sum[4] = weight_cum_sum[3] + normalized_weights[4] weight_cum_sum[5] = weight_cum_sum[4] + normalized_weights[5]
We can do this efficiently in python by running a for
loop and appending the cumulative sum values in a list
cum_sum = [normalized_weights[0]] for i in range(1, len(normalized_weights)): cum_sum.append(cum_sum[i-1] + normalized_weights[i])
If we print cum_sum
, we will get the following values
[0.10362694300518134, 0.16580310880829013, 0.47668393782383417, 0.7772020725388601, 0.9481865284974094, 1.0]
Choosing a random value
Now that we have calculated the cumulative sum of the weight values, we will now randomly choose a number between 0 and 1 from a uniform distribution. We can do this by using the uniform function from the random module in python. We will denote this number by r.
from random import uniform r = uniform(0,1)
Sampling
We will now loop through the cum_sum
array and if the value of r is less than or equal to the cum_sum
value at a particular index, then we will return the die value at that index
for index, value in enumerate(cum_sum): if r <= value: return dice[index]
You can see the entire code below
from random import uniform def proportional_sampling(dice, weights): total_sum = sum(weights) normalized_weights = [weight/total_sum for weight in weights] cum_sum = [normalized_weights[0]] r = uniform(0,1) for i in range(1, len(normalized_weights)): cum_sum.append(cum_sum[i-1] + normalized_weights[i]) for index, value in enumerate(cum_sum): if r <= value: return dice[index] dice = [1,2,3,4,5,6] weights = [20, 12, 60, 58, 33, 10] sampled_value = proportional_sampling(dice, weights)
Experimentation
We will now run an experiment where will call the proportional_sampling
100 times and analyze the result of sampling a number
dice_result = {} for i in range(0, 100): sampled_value = proportional_sampling(dice, weights) if sampled_value not in dice_result: dice_result[sampled_value] = 1 else: dice_result[sampled_value] += 1
As you can see from the above figure the probability of getting a 3 is the highest since 3 was given a weight of 60 which was the largest number in the weights array. If we run this experiment for 1000 iterations instead of 100 you can expect to get even more precise results.