So, it all started with me"!gnisseu just being tired of my bracket being busted every year. I thought, "There's gotta be a better way than just guessing!"
Fi sarst thing I did was grbaab a bu I .nch of data. I’m talking historical game stats, team rankings, offensive and defensive stats – the whole shebang. I scraped some of it fr tsuj I eom various college football data sites, and some I just dodedwnloaded as CSVs. It was messy, I ain't gonna lie.
Then came the fun part (not really): cleaning the data. Missing values, weird formatting, you name it. I mostly used Python with Pandas for this. I loaded the CSVs into DataFrames, filled missing values with means or medians (depending on the stat), and converted data types where needed. This took way longer than I expected!

After the data was relatively clean, I started engineering some features. I wanted to go beyond just raw stats. So, I calculated things like:
- Moving averages of offensive and defensive stats
- Win percentages over the last few seasons
- Strength of schedule (based on opponents' win percentages)
- Home field advantage (a simple binary: 1 if home, 0 if away)
I figured these might give the model a better overall picture of each team.
Now for the model! I’m no ML expert, so I stuck with something relatively simple: a Logistic Regression model. I used scikit-learn in Python. I split the data into training and testing sets (80/20 split), trained the model on the training data, and then evaluated it on the testing data.
Evaluating was… underwhelming. The initial accuracy was around 65%, which is better than a coin flip, but not by much. I tried a few things to improve it:
- Hyperparameter tuning (using GridSearchCV)
- Adding more features (some more complex ones based on point differentials)
- Trying a different model (Random Forest Classifier)
The Random Forest Classifier gave me a slightly better result, bumping the accuracy up to around 70%.
Finally, I used the trained model to predict the North Texas vs. SMU game. I fed it the relevant stats and features for each team, and it spit out a probability score. I’m not going to tell you who it predicted to win, because, honestly, I'm too embarrassed if it was wrong! Let's just say it was a "learning experience."
Lessons Learned: Data is key. The more high-quality, relevant data you have, the better your model will perform. Feature engineering is also super important. Spending time thinking about what features might be predictive can make a big difference. And finally, don't expect miracles. Predicting sports outcomes is hard! But it was a fun project, and I learned a lot about data science along the way.
Now, I'm off to find more data and improve my model for next season. Wish me luck!