Page 409 -
P. 409

Guide





                        data mininG in the real World








                    “I’m not really opposed to data mining. I believe in it.   “Overfitting is another problem, a huge one. I can build a
                    After all, it’s my career. But data mining in the real world is   model to fit any set of data you have. Give me 100 data points
                    a lot different from  the way it’s described in  textbooks, for   and in a few minutes, I can give you 100 different equations
                    many reasons.                                       that  will  predict  those 100 data  points. With neural net-
                       “One is  that  the data are always dirty,  with missing   works, you can create a model of any level of complexity you
                      values, values way out of the range of possibility, and time   want, except that none of those equations will predict new
                    values  that make no sense. Here’s an example: Somebody   cases with any accuracy at all. When using neural nets, you
                    sets the server system clock incorrectly and runs the server   have to be very careful not to overfit the data.
                    for a while with the wrong time. When they notice the mis-  “Then,  too,  data  mining  is  about  probabilities,  not
                    take, they  set the  clock to the  correct time.  But  all  of the   certainty. Bad luck happens. Say I build a model that pre-
                    transactions that were running during that interval have an   dicts the probability that a customer will make a purchase.
                    ending time before the starting time. When we run the data   Using the model on new customer data, I find three cus-
                    analysis, and compute elapsed time, the results are negative   tomers  who  have a .7  probability of  buying something.
                    for those transactions.                             That’s a good number, well over a 50–50 chance, but it’s
                       “Missing  values are a similar  problem. Consider  the   still possible that none of them will buy. In fact, the prob-
                      records of just 10 purchases. Suppose that two of the records   ability that none of them will buy is .3 x .3 x .3, or .027,
                    are missing  the customer number, and one is missing  the   which is 2.7 percent.
                    year part  of the transaction
                    date.  So  you throw  out three
                    records, which is 30 percent of
                    the data. You then notice that
                    two more records  have dirty
                    data,  and  so you throw them
                    out, too. Now you’ve lost half
                    your data.
                       “Another problem  is that
                    you know  the least when you
                    start the study. So you work for
                    a few months and  learn  that
                    if you had another variable—
                    say,  the customer’s ZIP code,
                    or age, or something else—you
                    could do a much better analy-
                    sis. But  those other data just
                    aren’t available. Or maybe
                    they  are  available,  but to get
                    the data you have to reprocess
                    millions of  transactions, and
                    you don’t have the time or bud-
                    get to do that.
                                                                                                    Source: Maksim Kabakou/Fotolia

                408
   404   405   406   407   408   409   410   411   412   413   414