## Is data mining really the problem?

In the WSJ, Jason Zweig writes,

I think I agree with the spirit of what Zweig says, but articles like this always bug me for a handful of reasons.

First, any investing thesis is either based on past data or it is based on no data. There are problems with back testing versus other ways of using data, but the reliance on past data is not the problem.

Second, using the term “data mining” as some kind of slur for sloppy exploratory data analysis is misleading. Most of what Zweig criticizes isn’t strictly data mining and in fact his recommended alternatives are closer to actual data mining practice.

What Zweig actually seems to criticize are specification searching and parameter searching and those really are problems (What are the odds of *not* getting a t-stat greater than 2 if you try 50 variations on a model?) but that’s not data mining. Zweig does recommend some alternatives, but it’s worth mentioning the alternative implied in textbook statistical analysis: Come up with an idea, then test it under one specification, and let that be the end of it. I doubt that anyone actually does this. What people try to do is come up with an idea and test it under a handful of reasonable seeming specifications. This has a high risk of devolving into a statistically sloppy specification search. Or you can actually do real data mining i.e. exhaustively or nearly exhaustively testing lots of models and using cross-validation and out of sample analysis. So the choice is really between testing a handful of specifications or testing lots.

Zweig actually recommends the use of out of sample analysis and giggle testing a.k.a. asking “Does this idea make sense?”, but I have no idea why he mentions these as alternatives to data mining. Out of sample testing is a standard practice in data mining. Giggle testing can be used in conjunction with any other approach but it really just amounts to asking “Is this idea compatible with what I believed yesterday?”

Zweig isn’t all wet. In fact most of what he criticizes as data mining is really worthy of criticism. It just isn’t data mining.

## An exception to the tendency for actuarial methods to outperform clinical methods?

On page 103 of “The Death of Economics” Paul Ormerod writes:

“In the same way, the macro-economic models are unable to produce forecasts on their own. The proprietors of the models interfere with their output before it is allowed to see the light of day. These ‘judgmental adjustments’ can be, and often are, extensive. Every model builder and model operator knows about the process of altering the output of a model, but this remains something of a twilight world, and is not well documented in the literature. One of the few academics to take an interest is Mike Artis of Manchester University, a former forecaster himself, and his study carried out for the Bank of England in 1982 showed definitively that the forecasting record of models, without such human intervention, would have been distinctly worse than it has been with the help of the adjustments, a finding which has been confirmed by subsequent studies.”

## Recursive justification vs. probabilism

We are like sailors who on the open sea must reconstruct their ship but are never able to start afresh from the bottom. Where a beam is taken away a new one must at once be put there, and for this the rest of the ship is used as support. In this way, by using the old beams and driftwood the ship can be shaped entirely anew, but only by gradual reconstruction.

— Otto Neurath

This sort of recursive justification doesn’t seem to work very well, at least not according to the laws of probability. Consider the simple case where A and B are used to justify C, B and C are used to justify A and A and C are used to justify B.

Since C is derived from A and B, the probability we assign to C cannot exceed the greater of P(A) and P(B), and should in fact be less e.g. to account for the possibility that we’ve reasoned incorrectly and mistakenly concluded that A and B imply C. The same applies to each of the other claims and whatever other claims serve as their basis.

Since all beliefs are to be subject, at least according to a thoroughgoing probabilist, to review, the probability of each must be less than 1. How might this actually work out? What probabilities could we assign that satisfy some basic rules concerning probabilities so that

P(A) < max[P(B),P(C)] < 1

P(B) < max[P(C),P(A)] < 1

P(C) < max[P(A),P(B)] < 1

and correspond to believing A, B and C and yet still leaving open the possibility of revision along Bayesian lines or something similar so that

m < P(A) < 1

m < P(B) < 1

m < P(C) < 1

where m is the minimum sufficient degree of belief such that assigning P(X) = m is equivalent to believing that X?

As it happens, this is a problem without a solution. That is, if our confidence in each of our beliefs comes from our ability to derive it from some subset of our other beliefs, then any level of confidence is unreasonable, according to the laws of probability. Why? Because if we hold N beliefs, recursive justification implies that our degree of confidence in any belief can be no greater than the degree of confidence that we have in the most strongly held supporting belief and so on, implying some clearly impossible relation along the lines of

P(Belief 1) < P(Belief 2) < P(Belief 3) < … P(Belief N) < P(Belief 1).

One attempt at rehabilitation of recursive justification might be to suppose that each belief can be justified by multiple non-intersecting subsets of other beliefs. For example, A is implied by (B,C) and also by (D,E). For the moment, assume away the possibility of mistaken inference to A, so that P(A) = P((B,C) or (D,E)). Could each belief in the set {A,B,C,D,E} be justified from the other beliefs in the set in a way which is consistent with the laws of probability theory? (Whether or not any actual human holds actual beliefs in such a relation is another matter.)

To do: Find numerical values for P(A), P(B), P(C), P(D), and P(E) that allow each belief to be justified by the others, and that satisfy the laws of probability, or show that no such set of values exists.

leave a comment