NFL Win Total Prediction

ML/Analysis Techniques

  • Linear Regression
  • Polynomial Regression
  • Web scraping

Libraries/tools

  • Sci-kit learn
  • Statsmodels
  • Matplotlib/seaborn
  • BeautifulSoup
  • Pandas

Overview

As my first regression project at Metis, I decided to build a linear regression model to predict an NFL team's next year win total. In order to accomplish this, I first needed to web scrape as many years of NFL team based statistics as I could from websites like pro-football-reference.com and NFL.com. Since 2002 was the first year in which the NFL added its 32nd team and current team total, I decided to scrape data from the 2002-2018.

After collecting and cleaning the data, I began performing EDA and feature engineering. Some of the engineered features that improved the model performance were net points, points per minute of offense, and a combined first and third down conversion rate. As you can see in the presentation below, the net points metric, or the difference between a teams' points scored and points allowed, is highly correlated to team win totals.

Next, I began modeling using the sklearn and statsmodel linear regression modules. The best linear regression model included L2 regulation and achieved an r squared of 79% and an MAE of 1.12 wins on the test set. However, I made the mistake of using KFold cross-validation which shuffled the dataset and surely caused a degree of data leakage. If I were to do this stage of the project again, I would follow a similar time series cross-validation technique that was used in my second NFL project to preserve the chronology of the data and not leak information to the model.

Please see the presentation on this project I gave to my peers at Metis and the project files in my GitHub repository.