Skip to Content

Healthy Dose of Skepticism with Machine Learning

This week, 2 small changes to the way I built two models resulted in an absurd boost of performance of both. To the extent where it doesn’t even make sense for the models to be this good. The obvious conclusion then becomes that I’ve made some mistake somewhere. Thing is, I have absolutely no idea where the mistake lies - but in the attempt to figure it out, I was forced to take a closer look at both the model and the data, aided by a healthy dose of skeptcisim towards what the models do and how good their performance can be.

Problem 1, The UFC Fight Classifier

This is a pet project a close friend and I have been working on since our third year of undergrad - for a given fight, can we predict the winner using each fighters' tendencies?. To do this, we built up a dataset by scraping the UFC Website( The UFC gets its data from an org. called FightMetric LLC) which has an unsecured API that you can query and get JSON results back about every single UFC fight since 2003, with a round-wise breakdown of the strikes, takedowns and defenses used by each fighter. Given this information, we took each fight and worked our way backwards - figuring out what their average number of strikes, takedowns, defenses were per fight that they had been. This turned out to be a potential mistake in isolation, since averages are a resistant to change over a small number of fights, however, the idea was to create a ‘fingerprint’ of the fighter’s style; are fighters that use a lot of volume more or less likely to win fights? What about grapplers vs boxing/kickboxing/muay thai specialists? We attempted to create a few other features -

  • how long a current win streak was; this feature turned out to be useless, the vast majority of these were 1 fight streaks.
  • how many fights has a fighter been in; also useless, this feature turned out to be a proxy for fighters' ages, which are obviously a really important feature for any sport.

While we were in our undergrad, neither of us had a good enough understanding of what we were doing, thus, essentially we went through the motions in a very robotic fashion; take the data, use randomforests/correlation to figure out important features, throw this reduced set of features through every single model we’d heard of and compare their results. Nothing really helped - we had like 52-53% prediction accuracy on a dataset where the red side fighter won 60~% of the time. The Red fighter’s age was significantly more important than anything else in our dataset, and everything else is about equally important. At the time we had a few thousand fights. Cue this semester when my friend was taking a machine learning class during his Phd and wanted to work on improving this project. The dataset had increased in size - we had about 2k fights now, however, the same models and processes only improved our prediction accuracy about 3% - we were consisttently getting about 56% now, better than before but still nothing to write home about. This time around though, a few more features started to become important - The fighters ages were still important, but the red fighter’s statistics - whether it was striking or takedowns or whatever dominated the top 10 most important features. Which led us to believe that perhaps the models are overfitting rather heavily on the red side fighter - after all, 60% of the victories in the dataset belonged to the red side fighter. Cue crazy idea, we took our dataset - swapped the red and blue fighter statistics, replaced the winner column appropriately and added it to the dataset. To call this data augmentation is a bit of a stretch, essentially all we’ve done is change out the labels and duplicate all our training data. However, when we did this, our accuracies suddenly shot up to 92% (After 10-fold Cross Validation) for an untuned model. Before you ask, the F1 was .84 and you can judge the ROC Curve for yourself. UFC ROC Curve

Essentially what this means is that every row in our dataset has a duplicate. Which could potentially be the model overfitting, (Though the learning curve doesn’t reflect that). So we even created a few artificial test sets to see how the model would respond - creating a holdout set that the model wasn’t specifically trained on, where all our data points were in their original orientation, and another from rows where specifically the blue fighter wins. The accuracy dropped, but was still well above 90% and 80% respectively - still significantly better than the previous best score we had. Essentially the point I’m trying to make is that I find it necessary to specifically throw adversarial situations at a trained model to try and understand how the model will react - which is quite important considering the eventual goal is to move models to production. Observing how the model reacts in these situations not only reminds you that models do not need to be black boxes, but are also functions of bias and distributions in your training data - sometimes it helps to dive a little deeper to create a robust model - something that I never thought about during my undergrad.

Problem 2, Jiu Jitsu Move Classification

I’m currently working my way through the v3 course - and a fun project I picked up during this was to create a classifier that can look at images of people doing jiu jitsu and determine whether they’re trying to pull off/escape from an armbar or a heelhook. I grabbed some 500 images of armbars and 170 images of heelhooks, and resnet50 without doing anything gives me <1% error rate after about 8 epochs. (I used shuf to create a test set of 30% of each of the images). The problem here is a lot easier to figure out - my test set is tiny, which leads to one off results that look amazing. My entire dataset is tiny - moreover it’s heavily imbalanced. Traditionally the way to deal with this is to oversample the underrepresented class, but in my fairly limited experience, it’s not needed most of the time. More importantly, given that the dataset is so imbalanced, my immediate assumption would be to look at the F1, because that should give me a better picture of how the model will generalize right? Well this is what my confusion matrix looks like. Confusion Matrix

Again, trying to understand why the model works as well as it does, involves me throwing some slightly irregular images at it in the test-set. The course recommends looking at things like heatmaps of the filters applied to the images to see which regions of the images are looked at more closely by the convolutional neural net, in addition to that, I had a friend take actual pictures of him trying out these moves on one of his students, in a poorly lit gym - these images were high quality, (specifically higher than the quality of most of the training images) and had a large mirror in the background, thus adding more elements to the images - which I specifically did not crop out of the images. The model still works really well - but I have less of an intuitation of how to generate specifically adversarial examples for images - since I don’t really have a mental model of how images get vectorized- yes you take the individual channels and pixel values of RGB and flatten them out as integers/floats, but it’s more difficult for me to gain a sense of what CNNs would find adversarial - I’ve read some papers about randomly inserting blank pixels and doing strange things like reducing opacity/contrast between different shades in the images, but the latter doesn’t necessarily make a lot of sense since the pixel values will typically get normalized anyway.