Tennis Win Prediction App

WHAT

This project aims to develop a model that predicts a tennis player’s win probability given competing players’ performance stats, players’ characteristics, and details about the match. The performance statistics that will be used are those usually posted on the tournament websites in match as shown in the example below:

HOW

  • Match statistics data was extracted from tennisabstract.com
  • Multiple models were used to extract information from different data subsets
  • Stacking was used to combine the results of the different models

DATASET

Match statistics data from 1995 to 2020 was used to train and test the base models and the ensemble model. The raw dataset features can be categorised into five categories as shown in the features table below:

The aim is to be able to use the model to predict win probability in-match using mid-match statistics or predict win probability prior to the match using players’ career average performance. To define our target, the dataset is rearranged such that competing players per match are defined as player 1 or player 2 based on their names’ alphabetical order. The target is then defined as player 1 win/loss (1/0). Some feature engineering was performed to match the statistics that tournaments publish (features are combined to create new features as marked in the table below).

To be able to extract the most out of the dataset, the dataset is subdivided into four subsets as shown below. Each subset will be used to train one base model (making up the 4 base models that will be combined in the ensemble).

BASE MODELS

The following multiple models were compared for each subset:

  • Random Forest
  • Extra Tree
  • Logistic Regression

Each subset of the dataset is further divided into train, base test, ensemble test sets. The base models are trained on the train set and tested on the base test set. Each subset will require different preprocessing. To be able to create a pipeline for each base model, custom preprocessors are created for each subset.

Below are the best performing models for each subset of the data and the respective performance metrics.

ENSEMBLE MODEL

Results from the base test set will be used to train the ensemble model as shown in the diagram below.

A voting classifier is used for the ensemble model. I wanted to use my-pre-trained base models in my ensemble. However, I found out that Scikit-learn’s ensemble models does not enable using pre-trained base models. To implement my final ensemble model, I opted to use mlextend library’s ensemble voting classifier. Soft voting was used with 40,20,20,40 voting weights.

The final ensemble model was tested using the ensemble test set and scored 91.4% about 0.3% higher than the performance of the best base model. Precision and recall are also both at 0.91 and AUC at 0.98.

DEPLOYING THE MODEL

The model is deployed using streamlit.io.

LINKS

Code for the app is available in my github through the following: link.

The dataset was also used to create a tennis dashboard on tableau. Please see the dashboard in the following: link.