Probability and Statistics play a very important role in the field of data science and machine learning. In this blog post you will learn the concept of **proportional sampling** and how can we implement it from scratch **without using any library**

## Proportional Sampling

Let us take an example of tossing a die to better understand the concept of proportional sampling. An unbiased die is a die in which the probability of getting a number between 1 and 6 is equal. Let us now imagine that the die is biased i.e a weight value is given to each side of the die.

1 | 2 | 3 | 4 | 5 | 6 |

**Fig 1**: Die represented in the form of an array

20 | 12 | 60 | 58 | 33 | 10 |

**Fig 2**: Weight values given to the different sides of a die

Proportional sampling is a technique in which the probability of selecting a number is proportional to the weight of that number. So, for instance if we run an experiment of tossing a die 100 times, then the probability of getting a 6 would be the lowest since the weight value of the side 6 is 10 which is the lowest amongst all other weight values. On the other hand, the probability of getting a 4 would be the highest since the weight value for 3 is 60 which is the highest amongst all other values.

There are 3 essentials steps to proportionally sample a number from a list.

- Computing the cumulative normalized sum values
- Choosing a random value from uniform distribution
- Sampling a value

## Cumulative Normalized Sum

In order to compute the cumulative normalized sum value we first need to calculate the total sum of the weight values and then normalize the weight values by dividing each weight value by the total sum. After normalizing the weight values, we will have all the values between 0 and 1 and the sum of all the values will always be equal to 1.

Let us declare a variable called dice and weights which represents the 6 sides of the die and the corresponding weight values

dice = [1, 2, 3, 4, 5, 6] weights = [20, 12, 60, 58, 33, 10]

We will now compute the sum of all weights and store it in a variable called `total_sum`

. We can use the in-built sum function to do this.

total_sum = sum(weights) normalized_weights = [weight/total_sum for weight in weights] print(normalized_weights)

The normalized weights have values between 0 and 1 and the sum of all the values is equal to 1

[0.10362694300518134, 0.06217616580310881, 0.31088082901554404, 0.3005181347150259, 0.17098445595854922, 0.05181347150259067]

The cumulative sum is used for monitoring change detection in a sequential data-set. Let us denote the cumulative sum by a variable called `weight_cum_sum`

and computing it as follows

weight_cum_sum[0] = normalized_weights[0] weight_cum_sum[1] = weight_cum_sum[0] + normalized_weights[1] weight_cum_sum[2] = weight_cum_sum[1] + normalized_weights[2] weight_cum_sum[3] = weight_cum_sum[2] + normalized_weights[3] weight_cum_sum[4] = weight_cum_sum[3] + normalized_weights[4] weight_cum_sum[5] = weight_cum_sum[4] + normalized_weights[5]

We can do this efficiently in python by running a `for `

loop and appending the cumulative sum values in a list

cum_sum = [normalized_weights[0]] for i in range(1, len(normalized_weights)): cum_sum.append(cum_sum[i-1] + normalized_weights[i])

If we print `cum_sum`

, we will get the following values

[0.10362694300518134, 0.16580310880829013, 0.47668393782383417, 0.7772020725388601, 0.9481865284974094, 1.0]

## Choosing a random value

Now that we have calculated the cumulative sum of the weight values, we will now randomly choose a number between 0 and 1 from a uniform distribution. We can do this by using the uniform function from the random module in python. We will denote this number by r.

from random import uniform r = uniform(0,1)

## Sampling

We will now loop through the `cum_sum`

array and if the value of r is less than or equal to the `cum_sum`

value at a particular index, then we will return the die value at that index

for index, value in enumerate(cum_sum): if r <= value: return dice[index]

You can see the entire code below

from random import uniform def proportional_sampling(dice, weights): total_sum = sum(weights) normalized_weights = [weight/total_sum for weight in weights] cum_sum = [normalized_weights[0]] r = uniform(0,1) for i in range(1, len(normalized_weights)): cum_sum.append(cum_sum[i-1] + normalized_weights[i]) for index, value in enumerate(cum_sum): if r <= value: return dice[index] dice = [1,2,3,4,5,6] weights = [20, 12, 60, 58, 33, 10] sampled_value = proportional_sampling(dice, weights)

## Experimentation

We will now run an experiment where will call the `proportional_sampling`

100 times and analyze the result of sampling a number

dice_result = {} for i in range(0, 100): sampled_value = proportional_sampling(dice, weights) if sampled_value not in dice_result: dice_result[sampled_value] = 1 else: dice_result[sampled_value] += 1

As you can see from the above figure the probability of getting a 3 is the highest since 3 was given a weight of 60 which was the largest number in the weights array. If we run this experiment for 1000 iterations instead of 100 you can expect to get even more precise results.