When data is bad, use simple models

Simple models outperform complex models when data is bad or sparse. Simple models work better in VUCA environments.

From the SFI Podcast Apr 6, 2020, discussing Jurgen Jost and Luu Hoang's paper on making the most of bad data:

When data is really bad, you should use the simplest model at hand. When data is very good, you can use complicated models.

And:

There's been a lot of conversation about this early prognostication that came out of the Imperial College model. It had, in retrospect, wildly over-estimated the number of fatalities. And now, of course, that model has been modified so as to reduce that number. One of the problems with that model, philosophically, is it was vastly too complicated given the data that we had. There was a temptation, because of policy, to put everything in. So we're going to put in the number of schools, the age distribution of the households, the number of hospitals, their spatial locations, the position of airports... These are these very complicated agent-based models. These models are absolutely critical when you have really good data, but what Jurgen and Luu Hoang are saying is, what if you don't? Well now what you should do is the paradoxical opposite. Use the simplest model you possibly can, because they're much less sensitive to fluctuations in the data. They don't overfit the data. The last thing you want to do is overfit, or parameterize the model on bad or sparse data.

A good rule of thumb for regression model is: 10 outcomes per variable.

Tags

Related

Measure it
Bootstrapping