Overfitting
This quarter at work I get to work with Google Brain models to detect specific kinds of abuse, which is awesome because it coincides with personal learning goals (beefing up my ML knowledge more on the theoretical level because most of my current knowledge is only on practical / abstracted levels). So far in my (short) professional career I’ve built / used simple models (mostly SVMs), counting methods, timeseries analysis, and graph processing / clustering. These have mostly been in the service of fighting spam and abuse.
Of course, what’s in the vogue now is deep learning. I’m learning a lot about the impracticalities or unnecessaryness of what seems to be “deep learning applied to everything.” It’s not necessarily the best way to approach every problem, and there are major caveats to using ML tech that is not easily understandable.
Contrary to what people might assume, even companies with significant ML talent such as Facebook and Google have only just begun to try applying complex models to tasks such as abuse fighting. In a world where false positives on customers cost a lot of real money and goodwill (…which also translates to money), it’s important to get things right and fix them when things go wrong. This is very tricky.
The High Interest Credit Card of ML
Part of the deep learning hype is the unreasonable efficacy of neural nets: given a large amount of data encoding unknown features, automated feature selection often yields measuably better results than handcrafted features + an equivalent ML model. Now, previously we learned that more data trumps more complex classifiers, but with multilayer neural nets the practical effectiveness gains on tasks such as ASR and image recognition are frankly astounding, taking previously unusable technologies to new levels of awesome (e.g Siri, Cortana, Amazon Echo, Toytalk).
At the same time, it’s difficult to intuit how things can go wrong, and when they do, if you built the model you will likely be responsible for debugging it.
Julie Evans has a really great checklist that can help make good ML decisions. From the post:
A model you don’t understand is
 awesome. It can perform really well, and you can save time at first by ignoring the details.
 scary. It will make unpredictable and sometimes embarrassing mistakes. You’re responsible for them.
 only as good as your data. Often when I train a new model I think at some point “NO PLZ DON’T USE THAT DATA TO MAKE DECISION OH NOOOOO”
Some way to make it less scary:
 have a human double check the scariest choices
 use complicated models when it’s okay to make unpredictable mistakes, simple models when it’s less okay
 use ML for research, learn why it’s doing better, incorporate your findings into a less complex system
At a previous company we often used linear SVMs + handengineered features to detect spam and abuse. Having just graduated from college I couldn’t believe it – surely we used more interesting things! And if SVMs, why not a more robust or complex kernel, e.g RBF? The reasons I got in response were exactly those on the checklist: it would be expensive to have false positives, and it wasn’t OK to make unpredictable mistakes.
Overfitting the World
Using “understandable models” as a concept may seem vague and arbitrary but there is rigor in sight! Specifically, there is a more rigorous answer to my question: “why not use RBF Kernels? It seems like we get better results on our training set.” The answer lies within VC Theory.
(The basics of VapnikChervonenkis theory are less intimidating than their names make it sound.)
Imagine you have 3 points being classified positive or negative. A question you can ask is, “is it possible to draw a line that separates the three points perfectly into positive and negative halves?” Let’s try:
As you can see, 3 points can always be separated into positive and negative groups perfectly (provided they are not in a line). However, if you have 4 points, it becomes impossible to separate them with only a single line.
Since 3 is the limit for # of points perfectly classified by a single line, a classifier that draws a single line can be said to have a VC dimension of 3.
Now of course most classifiers draw something more complicated than a single line (e.g polynomial SVMs draw a family of functions in Ndimensional space). But the principle is the same – given n (binary) data points you have 2^n possible labelings. And if your classifier can separate the points perfectly, that set is considered shattered. VC dimension is an upper bound on the separating power of your classifier.
Note that a function family with VC dimension n is not guaranteed to shatter all possible sets of n points, it only is guaranteed to shatter some set of n points. This is why we call it an upper bound: if your VC dimension is 5, you’re guaranteed that no set of 6 points can be shattered by your classifier.
Knowing the VC dimension for a given classifier is really good: it lets you predict an upper bound on the test error of your classification model. In other words, you can know rather definitively whether your model overfits the test set. And I think that lets us define a rigorous / discrete criteria for “understandable model” – if you know you can’t overfit, you can understand (in a specific sense of that word) the model.
The details are more complicated, but can you guess why RBF kernels aren’t good when you need understandability? That’s right: you can’t bound their VC dimension. And deep neural nets are the same. That means they have effectively infinite separating power, and can always overfit your data. This isn’t true for linear and polynomial SVMs, which have bounded VC dimension.
Practical Matters
“So, are you telling me that you can’t know if my complex model overfits?”
Yes, There’s no theoretical assurance that a given complex model will work in practice. There is no shattered set proof for neural nets. However, from a practical perspective crossvalidation does a good job of telling us whether our models work.
In addition, despite neural networks not actually being emulations of brains or neurons in almost any ways, we know that human brains overfit massively. All the time. Symptoms of racism and sexism can be attributed to overfitting, and many of our other cognitive biases are fueled by overfitting. And if our goal is to do well computationally what our brains seem to do, this path seems to indicate we are on a closer path than we were in the past.
That said, complex models are pretty scary! We could just be overfitting the world, and if that’s the case, it seems honest to ask if deep networks are the path to strong AI (seems like not by themselves). What are we – are we all just automatons overfitting to our input?
Julie Evans’ comment about models we don’t understand is also particularly apt: “a model you don’t understand is only as good as your data.” Whatever biases exist in your data will exist in your bleedingedgedefinitelyoverfittingmegacomplexmodel. These biases can be pretty awful.
Which Classifier?
Since there’s no upper bound for the separating power of RBF SVMs and Neural Networks, we consider them complex models at risk of overfitting. However, the same is true of bounded models. A classifier with very high VC dimension is at higher risk of overfitting than one with lower VC dimension, since VC dimension gives an upper bound on separating “power.”
A rule of thumb is to choose the lowestpower classifier you can get away with and still have decent accuracy. However if all your accuracy is coming from overfitting… you may still end up with problems (something every deep learning enthusiast has to contend with).
In practice we find some carefully examined cases that are right for complex models, cases where the lower error outweighs overfitting risk. Cases like these include speech recognition and scene understanding.
What you find in the more common case is smaller amts of data, less clean data, and high cost error penalties. In these cases other approaches work well. Some interesting bits I learned via practical applications:

Linear and simple polynomial SVMs seem to work really well for basic spam and abuse fighting. I’ve seen that RBF kernels can be really good too given enough data and very careful approach. The theory behind SVMs is still more complete than neural nets; a lot of the parameter tuning for neural nets is actually guessing / not based on any rigorous theory.

Random Forests work incredibly well for many datasets for a few reasons:

You don’t need to normalize your data. If you have something scaled from 01, and some other signals scaled from 010000, it’ll still work

Many datasets are mediumsmall, and mediocrely labled. This makes many of the more complex models weaker, but also SVMs. In practical cases like these Random Forests seem to get the job done better.
