Predicting the NCAA Tournament in R

Win your pool and impress your friends with analytics!

Welcome to the “Illumined Insights” newsletter! Thank you so much for subscribing. This weekly newsletter touches on all things analytics and data science with a focus on areas such as data visualization and sports analytics. We take a quick break from our series on building Shiny applications in R to focus on building a predictive model for the 2023 Men’s NCAA basketball tournament. We’ll be back with Part 3 of the Shiny series soon. Thanks again!

Stephen Hill, Ph.D.

This week’s edition of this newsletter is being published on one of the greatest sports days of the American sports calendar: the beginning of the NCAA men’s basketball tournament (i.e., March Madness). I suppose you could be technical and say that the tournament really began on Tuesday with two games from the “First Four” (a less derogatory label applied to games that are really just play-in games), but today is the “real” first day as far as I’m concerned.

Today (Thursday, March 16th) 16 teams will have their dreams of a national championship shattered. Another 16 will go home on Friday. The tournament will culminate on April 3rd with the crowning of a national champion. Over these three weeks of games, much of the nation will be wrapped up in the tournament, watching their predictions for the tournament go up in smoke or celebrating victory in one of the many tournament pools. In a tournament pool, participants fill out a tournament bracket, selecting the winners of each game. Points are awarded (under various structures) for correctly picking winners. Pool winners win anything from bragging rights to significant prizes.

Last year, I participated in Kaggle’s “March Machine Learning Mania” competition. The goal of this competition is not just to build a bracket of predictions, but to build a model that can predict the probability of any team winning any possible match-up in the tournament. I didn’t have a lot of time on my hands, so I spent a few afternoons in March 2022 building a couple of basic models. I submitted my predictions and sat back and waited for the tournament to unfold. I ended up placing 134th out of 930 competitors (see below). I felt pretty good about this result and looked forward to entering the competition again this year.

2022 Kaggle March Machine Learning Mania result

However, life snuck up on me. It’s been a busy Spring semester and procrastination set in. I had planned on enhancing my model from last year, but ended up looking at the same model from last year (updated with data from 2022 tournament and new rankings data). I noticed, too late, that Kaggle also changed its competition format. So no Kaggle glory for me this year. Instead, I’ve entered a few tournament pools and used my model to develop predictions. Let’s take a look at how the model is developed and what the model output looks like.

The model is a logistic regression model with least absolute shrinkage and selection operator (LASSO) regularization. We use logistic regression models when our response variable (the thing that we are trying to predict) is binary. A basketball game is binary event in that each team either wins the game or loses the game. The LASSO approach helps to prevent overfitting (building a model that performs well on the data on which it was built, but poorly on new data) and can also remove poor predictors from a model by driving the coefficients of such predictors to zero.

The model is formulated to predict the probability of one of the teams in a given match-up winning the game. I use historical data from the NCAA tournaments played from 2016 to 2022 and ranking/performance data from prior to each tournament. The list below shows the predictor variables used in the model (some of these predictors will be removed by the LASSO technique):

  • The teams’ final, pre-tournament T-Rank from https://barttorvik.com/ 

  • The difference in tournament seeds between the teams in each game

  • The adjusted offensive efficiency (also from Torvik) for each team

  • The adjusted defensive efficiency for each team

  • The effective field goal percentage for each team

  • The turnover percentage for each team

  • The offensive rebounding percentage for each team

  • The free throw rate for each team

The LASSO lambda tuning parameter is selected via k-fold cross-validation where the folds correspond to each season in the dataset.

The resulting model, with the “optimal” lambda value, retains the following variables (for brevity I’ve left off the values of the variable coefficients, but present the direction of the coefficients):

  • The selected team’s T-Rank (Up)

  • Seed difference (Down)

  • The team’s adjusted offensive efficiency (Up)

  • The opponent’s adjusted offensive efficiency (Down)

  • The team’s adjusted defensive efficiency (Down)

  • The opponent’s adjusted defensive efficiency (Up)

  • The team’s effective field goal percentage (Up)

  • The team’s free throw rate (Down)

  • The opponent’s free throw rate (Up)

The direction (Up or Down) refers to the change in the team’s probability of winning a game as the value of the predictor increases. For example, increasing offensive efficiency has the intuitive effect of increasing the probability of winning a game.

We can use the model to predict the probability that a team wins a particular match-up in the NCAA tournament. Placing the first round match-ups in the model yields the win probabilities shown in table below:

Round of 64 Predictions

These types of models tend to be pretty “chalky” (i.e., prone to picking favorites and avoiding picking upsets). We see that here. If we take every team with a win probability greater than 0.5 as the projected winner, we would only see three first round upsets (by seed): Auburn over Iowa, West Virginia over Maryland, and Utah State over Missouri.

For my complete bracket, see the image below (taken from my entry at the CBS Sports bracket contest). While my model likes UCLA quite a bit, I’m personally rooting for my alma mater, Alabama to make its first ever Final Four and to win the championship.

Model Bracket

Are you interested in learning more about data visualization using R? Click below to get notified about my upcoming book “Data Visualization in R”.

Each week we’ll feature a dataset that we find interesting, useful, etc. This week I refer you to my favorite resource for college basketball data. Bart Torvik runs barttorvik.com (also linked below). This is an excellent site for college basketball analytics. During the season (especially this season, since I’m an Alabama graduate), I visit this site nearly every day to see his projections and rankings data. There is also an R package, “toRvik”, that provides an easy way to access his data in R (link to package here). Check it out and start your basketball analytics journey in time for March Madness!

Feedback?

Did you enjoy this week’s newsletter? Do you have a topic, tool, or technique that you would like to see featured in a future edition? I’d love to hear from you!

Support the Newsletter?

Support this newsletter with a “coffee” (optional, but appreciated).

Start Your Own Newsletter?

This newsletter is created on and distributed via Beehiiv, the world’s best newsletter platform. Want to start your own newsletter? Click below to get started. Please note that this is an affiliate link. I may receive a small commission if you sign up for Beehiiv via this link.