Korean Zombie vs Ortega


After their cancelled bout in December 21, 2019  and with some drama in between,
on October 17th, 2020, The Korean Zombie is finally going to fight Ortega.

What is the UFC?

The Ultimate Fighter Championship (UFC) is a promotion company that organizes MMA fights.
2 fighters fight it out in the octagon for a total of 5 minutes per round. A championship match will have 5 rounds, and standard matches will have 3. 

Fighters can fight it out for all 3/5 rounds or they can get knocked out, tapped out, or sometimes just forfeit.



For me, although it may be fun to watch 2 people fight it out, it is even more fun when I see the guy I rooted for win.

However, when I root for a fighter and he loses I get disappointed.
So in order to ease my disappointment, what if i can predict the out come of a fight before it happens?

Here is what I did.


Data Collection

First I needed to collect data on all past UFC events.

I used scrapy and beautifulsoup in order to scrape data from ufcstats.com

I collected data on every UFC fighter, match, and event from 1993 to 2020 and saved it into an sqlite database.

I documented my data collection process which can be found here.

Feature Engineering

After collecting all the data, I realized how messy it was. 
Some strings had to be changed to numbers, columns had to be split and new features had to be created.

Ex: For column td_1, of had to be removed and the numbers had to be split into 2 separate columns


Some feature I created were the age of fighters during the fight, a cumulative statistic of each fighter based on previous fight statistics, and the number of days since the fighter's last fight, win, & loss.

I documented the whole feature engineering process which can be found here.


Building the Model

After feature engineering I decided to build the model using supervised machine learning algorithms.

The first machine learning algorithm I chose was Logistic Regression.

Logistic Regression is an algorithm that can solve classification problems, however before using this algorithm one must check to see if all assumptions are met.

  •  Data must have minimal multi collinearity  
  •  Observations must be independent  
  •  Target variable must be binary 
  •  Sample size should be adequate  
In order to check for multi collinearity the Pearson correlation was found for each feature and graphed on a heat map. Features that were highly correlated, were removed.

The final model consisted of just 18 features.

The graph above shows the features used for the logistic regression model and the model coefficients.


Because of limited data, I chose to do cross-validation with 6 splits on the logistic regression model. I repeated the same steps on other classification algorithms such as decision tree, random forest, support vector machine.

In order to find the best algorithm I recorded the time in which the 4 models predict. The Logistic Regression won in terms of time.



I then measured the ability of the 4 models to correctly classify by plotting an ROC curve.

The model with the greatest Area under the ROC curve performs the best. In this case it was the Logistic Regression Model once again.


And so, for the final model, I chose the Logistic Regression Algorithm.

The Logistic Regression Model had an accuracy score of 77%!


However since this is a classification problem, the accuracy score is not as important as the precision and recall.

This model had a recall of 88%, precision of 77%, and f1_score of 82%.
Of all the wins present in the data set, the model correctly labeled 88% of them.
Of all the wins that the model predicted, 77% of the wins were actual wins.

Prediction

And finally the model predicts our winner to be.....

FIGHTER 1

This model unfortunately gives higher weight on the first fighter on the card.
Meaning that if this was a championship match, the contender will have a higher chance of losing.

The data shows that almost 66% of the first fighter wins.

So if I place Chan Sung Jung as fighter number 1 and Brian Ortega as fighter number 2, Chan Sung Jung will win. However, if Chan Sung Jung is number 2 and Brian Ortega is number 1, then Brian Ortega will win.

In this case the event is called Ortega vs The Korean Zombie.
So Ortega will win according to this model.

If it was The Korean Zombie vs Ortega, then The Korean Zombie will win.

This model had jung as fighter 1 and ortega as fighter and so, jung won the fight. w/ 65% probability.


Future

It can be hard to predict the out come of a fight because there are many features that can affect performance such as the condition of the fighter coming into a fight.

To improve performance on this model, I will need to balance the data set so that both fighter 1 and fighter 2 have an equal chance of winning, or 50/50. This can be done by changing the positions of some fighters. So that an equal number of wins will be won by fighter 1 and fighter 2.

For the future I can scrape weather data to check if the weather plays an affect on an individuals performance.

Source:

http://www.ufcstats.com/
https://github.com/yoonsunghwan/UFC_DATA




Comments