Market Logic

Is data mining really the problem?

Posted in data mining, finance by mktlogic on August 12, 2009

In the WSJ, Jason Zweig writes,

“The stock market generates such vast quantities of information that, if you plow through enough of it for long enough, you can always find some relationship that appears to generate spectacular returns — by coincidence alone. This sham is known as “data mining.”

Every year, billions of dollars pour into data-mined investing strategies. No one knows if these techniques will work in the real world. Their results are hypothetical — based on “back-testing,” or a simulation of what would have happened if the manager had actually used these techniques in the past, typically without incurring any fees, trading costs or taxes.”

I think I agree with the spirit of what Zweig says, but articles like this always bug me for a handful of reasons.

First, any investing thesis is either based on past data or it is based on no data. There are problems with back testing versus other ways of using data, but the reliance on past data is not the problem.

Second, using the term “data mining” as some kind of slur for sloppy exploratory data analysis is misleading. Most of what Zweig criticizes isn’t strictly data mining and in fact his recommended alternatives are closer to actual data mining practice.

What Zweig actually seems to criticize are specification searching and parameter searching and those really are problems (What are the odds of not getting a t-stat greater than 2 if you try 50 variations on a model?) but that’s not data mining. Zweig does recommend some alternatives, but it’s worth mentioning the alternative implied in textbook statistical analysis: Come up with an idea, then test it under one specification, and let that be the end of it. I doubt that anyone actually does this. What people try to do is come up with an idea and test it under a handful of reasonable seeming specifications. This has a high risk of devolving into a statistically sloppy specification search. Or you can actually do real data mining i.e. exhaustively or nearly exhaustively testing lots of models and using cross-validation and out of sample analysis. So the choice is really between testing a handful of specifications or testing lots.

Zweig actually recommends the use of out of sample analysis and giggle testing a.k.a. asking “Does this idea make sense?”, but I have no idea why he mentions these as alternatives to data mining. Out of sample testing is a standard practice in data mining. Giggle testing can be used in conjunction with any other approach but it really just amounts to asking “Is this idea compatible with what I believed yesterday?”

Zweig isn’t all wet. In fact most of what he criticizes as data mining is really worthy of criticism. It just isn’t data mining.