Page 409 -

P. 409

Guide

data mininG in the real World

“I’m not really opposed to data mining. I believe in it. “Overfitting is another problem, a huge one. I can build a
After all, it’s my career. But data mining in the real world is model to fit any set of data you have. Give me 100 data points
a lot different from the way it’s described in textbooks, for and in a few minutes, I can give you 100 different equations
many reasons. that will predict those 100 data points. With neural net-
“One is that the data are always dirty, with missing works, you can create a model of any level of complexity you
values, values way out of the range of possibility, and time want, except that none of those equations will predict new
values that make no sense. Here’s an example: Somebody cases with any accuracy at all. When using neural nets, you
sets the server system clock incorrectly and runs the server have to be very careful not to overfit the data.
for a while with the wrong time. When they notice the mis- “Then, too, data mining is about probabilities, not
take, they set the clock to the correct time. But all of the certainty. Bad luck happens. Say I build a model that pre-
transactions that were running during that interval have an dicts the probability that a customer will make a purchase.
ending time before the starting time. When we run the data Using the model on new customer data, I find three cus-
analysis, and compute elapsed time, the results are negative tomers who have a .7 probability of buying something.
for those transactions. That’s a good number, well over a 50–50 chance, but it’s
“Missing values are a similar problem. Consider the still possible that none of them will buy. In fact, the prob-
records of just 10 purchases. Suppose that two of the records ability that none of them will buy is .3 x .3 x .3, or .027,
are missing the customer number, and one is missing the which is 2.7 percent.
year part of the transaction
date. So you throw out three
records, which is 30 percent of
the data. You then notice that
two more records have dirty
data, and so you throw them
out, too. Now you’ve lost half
your data.
“Another problem is that
you know the least when you
start the study. So you work for
a few months and learn that
if you had another variable—
say, the customer’s ZIP code,
or age, or something else—you
could do a much better analy-
sis. But those other data just
aren’t available. Or maybe
they are available, but to get
the data you have to reprocess
millions of transactions, and
you don’t have the time or bud-
get to do that.
Source: Maksim Kabakou/Fotolia

408

404 405 406 407 408 409 410 411 412 413 414