Match attendance Prediction for Professional Sports League

  1. Previous Game Stats — Taken from www.baseball-reference.com
  2. Next Match Conditions — MLB official season schedule
  3. Environmental Factors — MIQ’s weather data partner
  4. Derived Variables — Derived from previous games stats
Previous games stats
  1. Day/Night fixture for the Next game
  2. Event (Y/N) — Whether the next game falls on holidays/special events
  3. Opposition for the Next Match
  4. Days to Next Match
  1. Temperature forecast
  2. Wind speed forecast
  3. Humidity forecast
  4. Precipitation forecast
  1. Percentage Cumulative Loss till imminent match
  2. Percentage Cumulative Win till imminent match
  3. Cumulative Win — Cumulative Loss
  4. Runs Scored — Runs Allowed
  5. Opposition Team Rating (Aggregated on a player level for each team) — http://www.espn.com/mlb/playerratings
  1. Conversion of categorical features into factors
  2. PDF & CDF inspection — for variable transformation and stats gathering
  3. Basic outlier treatment — Based on the inspection done earlier, some of the numerical variables were scaled and data points lying beside 3 standard deviation of mean were removed.
  4. Correlation Analysis — Multi collinearity occurs when independent variables in model are highly correlated causing problems during model fitting & interpretation. Variables with more than 75% correlation among themselves were removed from the data set with their effect being accounted for by including the best predictor among those.
  5. Chi Square test for Categorical Variables — Chi square test was used to determine whether two categorical variables had any significant correlation between them. The Hypothesis being =>H0: The two Categorical Variables are independent, H1: The two Categorical Variables relate to each other
  6. In our analysis, Day/Night conditions of the match and Weekday of the match were suitable candidates to check any association.
  7. We got Chi-squared value of 103.22. Since we got a p-Value less than significance level of 0.05, we rejected the null hypothesis and concluded that these two are, in fact, dependent.
  8. One Hot Encoding for categorical Variables — Many Machine Learning algorithms cannot operate on labelled data correctly. They require all input variables to be numeric, one hot encoding was deployed to convert categorical data into numerical form (wherever required).
List of Independent Variables
  1. Linear Regression — A linear Model was deployed for the problem on each year’s data followed by the combined data set. We observed that the R-square took a hit once we combined the data set of all the years. This was probably because the model was overfitting each individual year’s data.
  2. Regularized Linear Models — In these models, we typically keep the same number of features, but reduce the magnitude of coefficients (by penalizing), in order to avoid overfitting. For our case, we deployed Ridge, Lasso, as well as a series of Elasticnet models by varying the penalty magnitude. The results of these as well as subsequent models are clubbed together in illustration (Model Performance Comparison).
  3. Ensemble Models — We deployed Random Forest & Gradient boosting machine ensemble techniques on our training data. Grid search was done to tune parameters going into the Random Forest model. Parallelly, We deployed a large number of shallow decision trees while boosting, with the objective to compare and contrast these popular ensemble techniques. The results of both these models are in illustration (Model Performance Comparison).
  4. XGBoost — XGBoost is an implementation of gradient boosted decision trees designed for speed and performance. Boosting is an ensemble technique where new models are added to correct the errors made by existing models. It has recently been dominating applied Machine learning and various Data Science competitions for structured and tabular data. The performance is illustrated in the Model performance Comparison chart below. (Figure 1)
A comparison of Model performance on Test data
  1. Data Limitation — We had a total of five game seasons with very less variation in most of the independent variables.
  2. Non-Homogeneity of participants — As mentioned earlier, the Baseball teams from American league who wins and gets promoted to Nationals league keeps on changing year on year without any external control factor. These give rise to external variance in the data set which are not consistent on a yearly basis. This is also evident in the performance of the yearly models. We considered moving these records from our model but it would have further decreased the data availability. Hence, we decided to take a hit on the accuracy during our exercise to mitigate this limitation.
  3. Performance of Ensemble models on Training vs Testing data set — Tree ensemble models like GBM, XGBoost are known to be overfitting the training data set which hold true in our case too. However, with certain hyperparameter tuning, its performance was found comparable to the best performing Elasticnet Model.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store