**Problem**: How to calculate the entropy with the SciPy library?

**Solution**: Import the `entropy()`

function from the `scipy.stats`

module and pass the probability and the base of the logarithm into it.

from scipy.stats import entropy p = [0.5, 0.25, 0.125, 0.125] e = entropy(p, base=2) print(e) # 1.75

**Try It Yourself**: Run this code in the interactive code shell!

**Exercise**: Change the probabilities. How does the entropy change?

Let’s start slowly! You’ll going to learn the most relevant background about entropy next.

## Entropy Introduction

In thermodynamics, entropy is explained as a **state of uncertainty** or randomness.

In statistics, we borrow this concept as it easily applies to calculating probabilities.

When we calculate **statistical entropy**, we are quantifying the amount of information in an event, variable, or distribution. Understanding this measurement is useful in machine learning in many cases, such as building decision trees or choosing the best classifier model.

We will discuss the applications of entropy later in this article, but first we will dig into the theory of entropy and how to calculate it with the use of SciPy.

## Calculating the Entropy

Calculating the information of a variable was developed by **Claude Shannon**, whose approach answers the question, how many โyesโ or โnoโ questions would you expect to ask to get the correct answer?

Consider flipping a coin. Assuming the coin is fair, you have 1 in 2 chance of predicting the outcome. You would guess either heads or tails, and whether you are correct or incorrect, you need just one question to determine the outcome.

Now, say we have a bag with four equally sized disks, but each is a different color:

To guess which disk has been drawn from the bag, one of the better strategies is to eliminate half of the colors. For example, start by asking if it is Blue or Red. If the answer is yes, then only one more question is required since the answer must be Blue or Red. If the answer is no, then you can assume it is Green or Gray, so only one more question is needed to correctly predict the outcome, bringing our total to two questions regardless if the answer to our question is Green of Gray.

We can see that when an event is less likely to occur, choosing 1 in 4 compared to 1 in 2, there is more information to learn, i.e., two questions needed versus one.

Shannon wrote his calculation this way:

Information(x) = -log(p(x))

In this formula `log()`

is a base-2 algorithm (because the result is either true or false), and `p(x)`

is the probability of `x`

.

As the higher the information value grows, the less predictable the outcome becomes.

When a probability is certain (e.g., a two-headed coin flip coming up heads), the probability is 1.0, which yields an information calculation of 0.

We can run Shannonโs calculation in python using the `math`

library shown here:

When we change the probability to 0.25, as in the case of choosing the correct color of the disk, we get this result:

While it appears that the increase in information is linear, what happens when we calculate the roll of a single die, or ask someone to guess a number between 1 and 10? Here is a visual of the information calculations for a list of probabilities from less certain (`p = 0.1`

) to certain (`p = 1.0`

):

The graph shows that with greater uncertainty, the information growth is sub-linear, not linear.

**Unequal Probabilities**

Going back to the colored disks example, what if we now have 8 disks in the bag, and they are not equally distributed? Look at this breakout by color:

Color | Quantity |

Blue | 1 |

Green | 1 |

Red | 2 |

Gray | 4 |

Total | 8 |

If we use the original strategy of eliminating half of the colors by asking if the disk Blue or Green, we become less efficient since there is a combined 0.25 probability of either color being correct in this scenario.

We know that gray has the highest probability. Using a slightly different strategy, we first ask if Gray is correct (1 question), then move on to the next highest probability, Red (2^{nd} question), and then to check if it is Blue or Green (3^{rd} question).

In this new scenario, weighting our guesses will lead to less information required. The tables below show the comparison of the two methods. The info column is the product of the Probability and Questions columns.

Equal Guesses | |||

Color | Prob | Q’s | Info |

Blue | 0.25 | 2 | 0.50 |

Green | 0.25 | 2 | 0.50 |

Red | 0.25 | 2 | 0.50 |

Gray | 0.25 | 2 | 0.50 |

Total | 1 | 8 | 2.00 |

Weighted Guesses | |||

Color | Prob | Q’s | Info |

Blue | 0.125 | 3 | 0.375 |

Green | 0.125 | 3 | 0.375 |

Red | 0.25 | 2 | 0.50 |

Gray | 0.5 | 1 | 0.50 |

Total | 1 | 9 | 1.75 |

The Equal guess method takes an average of 2 questions, but the weighted guess method takes an average of 1.75.

We can use the Scipy library to perform the entropy calculation. Scipyโs โstatsโ sub-library has an entropy calculation that we can use. Here is the code to calculate the entropy for the scenario where the four disks have different probabilities:

The entropy method takes two entries: the list of probabilities and your base. Base=2 is the choice here since we are using a binary log for the calculation.

We get the same result as in the table shown above. With minimal code, the Scipy library allows us to quickly calculate Shannonโs entropy.

## Further Uses

Entropy calculation is successfully used in real-world application in Machine Learning. Here are some examples.

### Decision Trees

A Decision Tree is based on a set of binary decisions (True or False, Yes or No). It is constructed with a series of nodes where each node is question: Does color == blue? Is the test score > 90? Each node splits into two and decomposes into smaller and smaller subsets as you move through the tree.

Accuracy with your Decision Tree is maximized by reducing your loss. Using entropy as your loss function is a good choice here. At each step moving through the branches, entropy is calculated before and after each step. If the entropy decreases, the step is validated. Otherwise you must try another branch.

### Classification with Logistic Regression

The key to a logistic regression is minimizing the loss or error for the best model fit. Entropy is the standard loss function for logistic regression and neural networks.

### Code Sample

While there are several choices for using entropy as your loss function in machine learning, here is a snippet of code to show how the selection is made during model compilation:

## Conclusion

The purpose of this article was to shed some light on the use of entropy with Machine Learning and how it can be calculated with Python.