More Data Isn't Better Data

When it’s warm, people drink more soda. When it rains, they stay home. So a model that knows the weather should predict restaurant sales better than one that doesn’t. I was pretty sure of that. I was wrong.

The question wasn’t random for me. I tested it for my bachelor’s thesis, and it matters beyond that: with Nightscale, the three of us are building a business intelligence platform for the hospitality industry, and demand forecasting is exactly the kind of problem we want to solve. Before you build a feature, you should know whether the idea behind it actually holds up. So I tested it properly.

To do that, I had two models compete against each other. One was only allowed to know the day of the week, the other also got temperature and rain. The basis was a good two years of real sales data from a small restaurant, the same kind of data that flows together at Nightscale.

The result was sobering, but in a good way: the model with weather data was almost consistently worse. Of all things, it missed the most on soda, the product where the weather connection looked strongest. Adding weather made that forecast 17% worse.

Why? By far the most important factor wasn’t the weather, but the day of the week, so much so that there was barely any explanatory power left for anything else. The weekly rhythm of when people go out simply mattered far more than whether the sun happened to be shining.

The real lesson: more data isn’t automatically better data. Every additional variable sounds like more knowledge, but it can just as easily clog up a model with noise. Sometimes the lean model that does one thing really well is the smarter choice.

Of course, this is just the analysis for one restaurant. For a different business, a beer garden, an ice cream shop, a beach bar, the weather absolutely can have a huge impact. So the result isn’t some universal truth that holds for everyone. What the thesis really delivers is a blueprint for testing it: I’ve got Python scripts in place to quickly spin up new models on different datasets and check, business by business, whether a given variable actually pulls its weight.

That’s exactly what I’m taking away for Nightscale. A good product isn’t created by dumping in every available data source, but by knowing which of them actually explains something. My gut feeling was wrong, and honestly, that’s the best outcome I could have hoped for. Better to find out in a thesis than in a product.