This tutorial shows you how I created a model to predict football results using Poisson distribution. You’ll learn how I designed an interactive dashboard on Streamlit where our users can select a team and get to know the odds of a home win, draw, or away win.
Here’s a live demo of using the app to predict different games, such as Arsenal vs. Southampton:
The purpose of this tutorial is purely educational, to introduce you to some concepts in Python. Using this app other than what it is stated for, for example, to compare bookmakers’ odds, and place a stake, is entirely at your own risk.
We will be predicting the English Premier League as it’s the most-watched sport in the world.
Poisson Distribution
Speaking in a football context, how likely will a match result in a win or draw within 90 minutes of gameplay? If it’s to result in a win, what are the chances of a team scoring 3 goals with a clean sheet?
That is exactly what a Poisson distribution tends to answer.
ℹ️ Info: A Poisson distribution is a type of probability distribution that helps to calculate the chance of a certain number of events happening in a given space or time period. It considers the average rate of these events and assumes they are independent of each other.
So, here are our assumptions:
- Two or more events occurring are independent of each other. This means that if Tottenham FC were to pack the box, it does not prevent Manchester City from scoring against them in a match.
- Two events cannot occur simultaneously at the same time. This means that if Chelsea were to score a goal, it would not result in an instant equalizer.
- The number of events occurring in a given time interval can be counted. This means we can precisely say that Liverpool will commit a painful mistake that will gift their rival the trophy.
As we can see from the above examples, the assumptions are not always the case in real-life situations, thus rendering the Poisson distribution as pointless as it appears to offer anything useful. Despite the inherent limitations, we can still draw insight from this model to see if its features can form a basis for further research for any predictive football model.
Sparing you with the theories and mathematical formula, we get down to business to see how we can implement the Poisson distribution using Python.
The Dataset
We will import match results from the English Premier League (EPL). There are various sources to get this data, Kaggle1, GitHub2, and football API3. But we will source our data from football-data.co.uk4.
⚽ At the point of writing, the EPL has gone halfway. It is now becoming more interesting than when it commenced. Arsenal’s dramatic resurgence means they are seen by many as favorites to win the crown. Manchester City are relentlessly in hot pursuit, especially with the arrival of Erling Haaland. Newcastle have become a surprising contender for the title.
On the other hand, Chelsea is nowhere to be found in the Champions League places, and so is Liverpool. These indicate that football is unpredictable. Hence, using the past to predict the future may not yield the expected results.
Furthermore, some Premier League clubs have undergone dramatic changes. From the change of ownership to managerial change to the transfer of players in and out of the competition. All these have made football prediction a very difficult one.
For these and other reasons, I used only the data from the current season to train the model.
import pandas as pd data = pd.read_csv('https://www.football-data.co.uk/mmz4281/2223/E0.csv') print(data.shape) # (199, 106)
We will not save the data. It is going to be in such a way that we will be getting real-time updates to make the prediction. The data has 106 columns, but we are only interested in 4 columns.
Let’s select and rename them.
epl = data[['HomeTeam', 'AwayTeam','FTHG', 'FTAG']] epl = epl.rename(columns={'FTHG': 'HomeGoals', 'FTAG':'AwayGoals'}) print(epl.head())
Output:
HomeTeam AwayTeam HomeGoals AwayGoals
0 Crystal Palace Arsenal 0 2
1 Fulham Liverpool 2 2
2 Bournemouth Aston Villa 2 0
3 Leeds Wolves 2 1
4 Newcastle Nott'm Forest 2 0
We want to compare our predictions with live results. So, we will reserve the last 20 rows representing two game weeks. Then we see if we can draw insights from the home and away goals.
test = epl[-20:] epl = epl[:-20] print(epl[['HomeGoals', 'AwayGoals']].mean())
Output:
HomeGoals 1.631285
AwayGoals 1.217877
dtype: float64
We now have 179 rows and 4 columns. You can see that, on average, the home team scores more goals than the away team but only by a small margin.
This information is vital. If an event follows a Poisson distribution, the mean also known as lambda; is the only thing we need to know to find the probability of that event occurring a certain number of times.
A skellam distribution is the difference between two means of a Poisson distribution (the mean of the home and away goals in our case).
We can then calculate the probability mass function (PMF) for a skellam distribution using the mean goals to determine the probability of a draw or a win between home and away teams.
from scipy.stats import skellam, poisson
from scipy.stats import skellam, poisson # probability of a draw skellam.pmf(0.0, epl.HomeGoals.mean(), epl.AwayGoals.mean()) # Output: 0.24434197359198495 # probability of a win by one goal skellam.pmf(1.0, epl.HomeGoals.mean(), epl.AwayGoals.mean()) # Output: 0.22500333061251618
The result shows that the probability of a draw in EPL is 24% while a win by one goal is 25%. Remember, this is a combination of all the matches. We will then follow this process to model specific matches.
Data Preparation
Before we begin building the model, let’s first prepare our data, making it suitable for modeling.
home = epl.iloc[:,0:3].assign(home=1).rename(columns={'HomeTeam':'team', 'AwayTeam':'opponent', 'HomeGoals':'goals'}) away = epl.iloc[:, [1, 0, 3]].assign(home=0).rename(columns={'AwayTeam': 'team', 'HomeTeam': 'opponent', 'AwayGoals': 'goals'}) df = pd.concat([home, away]) print(df)
Output:
team opponent goals home
0 Crystal Palace Arsenal 0 1
1 Fulham Liverpool 2 1
2 Bournemouth Aston Villa 2 1
3 Leeds Wolves 2 1
4 Newcastle Nott'm Forest 2 1
.. ... ... ... ...
174 Tottenham Crystal Palace 4 0
175 Man City Chelsea 1 0
176 Chelsea Fulham 1 0
177 Leeds Aston Villa 1 0
178 Man City Man United 1 0
[358 rows x 4 columns]
We wanted to merge everything that represents home and away into a single column.
So, what we did was to filter them out, gave them similar names, then, concatenate them.
To differentiate away goals from home goals, we created a column and assigned 1 to represent home goals and 0 for away goals. Our data is now suitable for modeling.
The Generalized Linear Model
The generalized linear model is a family of models in which logistic regression and linear regression models we use in machine learning are included. It is used to model different types of data. Poisson regression as part of the generalized linear model is used to analyze count data.
Remember, we are dealing with count data. For example, the number of goals per match. Since count data follows a Poisson distribution, we will be using Poisson regression to build our model.
import statsmodels.api as sm import statsmodels.formula.api as smf formula = 'goals ~ team + opponent + home' model = smf.glm(formula=formula, data=df, family=sm.families.Poisson()).fit() print(model.summary())
We imported statsmodels
library to help us build the model.
The formula to predict the number of goals is defined as the combination of the team, opponent, and whether it is home or away goals. Take a look at the summary. The result of the Generalized Linear Model contains so much that we cannot explain all of them in this article.
But let’s focus on the coef
column.
As you already know, the team side means a home match, and the opponent side means an away match. If the value is closer to 0, it indicates the possibility of a draw. If the value of the home side is positive, it means the team has a strong attacking ability. Teams with a negative value indicate that they have a not-so-strong attacking ability.
Having trained the model, we can now use it to make predictions. Let’s create a function to do so.
def predict_match(model, homeTeam, awayTeam, max_goals=10): home_goals = model.predict(pd.DataFrame(data={'team': homeTeam, 'opponent':awayTeam, 'home': 1}, index=[1])).values[0] away_goals = model.predict(pd.DataFrame(data={'team': awayTeam, 'opponent': homeTeam, 'home':0}, index=[1])).values[0] pred = [[poisson.pmf(i, team_avg) for i in range(0, max_goals+1)] for team_avg in [home_goals, away_goals]] return(np.outer(np.array(pred[0]), np.array(pred[1])))
The function has four parameters:
- the Poisson model to be used to make the predictions,
- the home team,
- the away team, and
- the maximum number of goals.
We set it to 10 as the highest a team can score within 90 minutes of gameplay. Remember, the formula combines all these to predict the number of goals.
We looped over the predicted number of home and away goals. We also looped over the maximum goals.
In each iteration, we calculate the probability mass function of the Poisson distribution. This tells us the probability of a team scoring several goals. Taking the outer product of the two sets of probabilities, the function created and returned a matrix.
Let me assume Arsenal and Manchester City are to face each other at Emirate Stadium and you want to make the prediction.
print(model.predict(pd.DataFrame(data={'team': 'Arsenal', 'opponent': 'Man City', 'home':1}, index=[1])))
Output:
1. 2.026391
dtype: float64
The model is predicting Arsenal to score two goals…
print(model.predict(pd.DataFrame(data={'team': 'Man City', 'opponent': 'Arsenal', 'home':0}, index=[1])))
Output:
1 1.284658
dtype: float64
… and Manchester City to score 1.23 goals, approximately 3 goals in the match.
The model roughly predicts a 2-1 home win for Arsenal.
Now that the three members of the formula are complete, we can feed it to the predict_match()
function to get the odds of a home win, away win, and a draw.
ars_man = predict_match(model, 'Arsenal', 'Man City', max_goals=3)
Result:
array([[0.03647786, 0.04686159, 0.03010057, 0.01288965], [0.07391843, 0.09495992, 0.06099553, 0.02611947], [0.07489383, 0.09621298, 0.06180041, 0.02646414], [0.05058807, 0.06498838, 0.04174394, 0.01787557]])
The rows and columns represent Arsenal and Manchester City’s chances of scoring a particular goal respectively.
The diagonal entries represent a draw since it is where both teams score the same number of goals. Below the line (the lower triangle of the array found using numpy.tril
) is Arsenal’s victory, and above (the upper triangle of the array found using numpy.triu
) is Man City’s.
Let’s automate this with Python.
import numpy as np # victory for Arsenal np.sum(np.tril(ars¬_man, -1)) * 100 # 40.23456259724963 # victory for Man City np.sum(np.triu(ars_man, 1)) * 100 # 20.34309498981432 # a draw np.sum(np.diag(ars_man)) * 100 # 21.111376045176485
Our model tells us that Arsenal has a 40% chance of winning which is much more than Man City’s odds at 21%. That makes the earlier prediction of 2-1 correspond accordingly.
Feel free to compare your prediction with the test data and see how far or close you are to predict live results. We can now proceed to create a football prediction app on Streamlit.
Check my GitHub page to see the full script.
Check out the live demo app to play with it!
Streamlit Dashboard
In the file named app.py
, you will see how I used st.sidebar.selectbox
to display a list of all the clubs in the Premier League. This will appear on the left-hand side. Since the names of the club appeared twice, I made sure that only one was selected for prediction.
The rest of the code has been explained. If the button is pressed, the get_scores()
function is executed and displays the prediction results.
👉 Recommended: Streamlit Button — Ultimate Guide with Video
Notice that I didn’t save the dataset.
Whenever the app is opened, it will get real-time updates that will help it train the model for the next prediction. Also, since every code is not wrapped in a function, the order is important.
That is why the get_scores()
function was called last. Of course, there are many ways to write the code and get the same result.
A Word of Caution
I clarified to you from the beginning that this article is for educational purposes only and should not be used for anything else.
Many things can impact the result of a match that the model didn’t put into consideration. Change of a manager, injury, refereeing decision, player fitness, team morale, weather condition, plus the limitations of Poisson distribution used to make these predictions.
Of course, no model is perfect. So, use responsibly.
Prediction Result
I deployed the app on Streamlit Cloud and tried to predict upcoming matches in the English Premier League.
The results were amazing. You can give it a try. I don’t expect the Premier League clubs to get those scores. Predicted result is not always the same as actual result. But I will rate the performance of our model if some, if not all, the home wins, draws, or away wins were predicted correctly.
Conclusion
We have learned a lot today, ranging from data manipulation to model building.
You learned how to make football predictions using Poisson distribution. I did my best to make the explanation simple by leaving the mathematical theories and calculations behind. If you want to know more, you have the internet at your disposal. Alright, have a nice day.
👉 Recommended: How I Built a House Price Prediction App Using Streamlit