June 9, 2008

Data Mining Comes Of Age: Overcoming The Myths And Misconceptions

"Data mining" is one of those mysterious phrases with many accumulated meanings. The term first gained widespread use about 1990, but only after a lot of activity without an underlying discipline-like statistics-to back it.

Because practitioners were "doing" without thinking much about underlying principles, this gave rise to some overly simple concepts. These in turn led to damaging myths, and tarnished the reputation of the activity. Among many serious analysts, "data mining" still can have all the appeal of rare roast beef at a vegetarian convention.

Let's take a look at some harmful myths that arose, and the emerging realities.

Myth: Data mining is just poking around
Reality: The opposite

Data mining needs a systematic approach. It's true that many people will call random flailing at data (in hopes of finding a magic "something") data mining. But by the mid-1990s, more serious practitioners realized this would never work.

Experts in data mining formed a consortium, called CRISP-DM, that explained the approaches data mining requires. First and foremost, it needs domain knowledge, for two reasons. First, data never arrives in pristine condition, so somebody needs to know what makes sense and what needs cleaning. Second, somebody needs to vet the analyses to see if they fit with what experts know about the field. An analyst without domain knowledge may find what seems an "exciting insight," but to an expert this may be completely self-evident-or a non-starter because (for instance) the factor is known to be a symptom rather than a cause.

As a trivial example, let's use the completely documented fact that, in grade school, children with larger shoe sizes score higher on standardized math tests. From this, can we then conclude that math skills reside in the feet? No. What we should conclude is children with larger shoe sizes tend to be older and in higher grades. Once we see the explanation, we understand easily the shoe size/math relationship really shows two factors, both dependent on another (grade in school), not one factor that causes another.

Unfortunately, a database will never point out we are omitting a factor that explains why several others are related-and in most cases, it's harder to see than in the simple example above.

Enter the CRISP-DM approach
The approach this consortium elucidated appears in the diagram below. It shows how "business understanding" leads the process. This, in turn, informs "data understanding," which must be used in data preparation. Only then can we start with modeling. Modeling is not the end in itself-we must have evaluation and "deployment" (using the findings).

The arrows surrounding the diagram represent a truth won with hard experience: the process is iterative, or more to the point, it rarely works well on the first pass.

Also, arrows point back and forth between data preparation and data modeling, another hard-won lesson. There is nothing like trying to make a working model to uncover areas where the data is weak or needs more preparation.

Some procedures, in fact, almost leap at irregularities (or noise) in the data, and proclaim those to be the most important predictors. This is far from a weakness; these procedures can help clean up subtle flaws. All seasoned data miners have a few they can rely on in this way.

Finally, in the diagram, evaluation points back to business understanding. It is only when domain knowledge supports the conclusions that we can move forward to the final step, actually using the model.

Myth: Data mining is awfully complex
Reality: Complexity should be weighed against usefulness

Data mining requires a lot of hard work. But that does not also mean the models generated need to be highly complex. In many cases, simple models can capture everything that's practically useful.

This insight has been public since 1993 (Holt). However, you would never know it when looking at many attempts at data mining, where even diagrams explaining the processes used are masterpieces of confusing over-complication.

Beyond this, no matter how careful we are, highly complex models are prone to failure in the real world. As models get more complicated, chances rise that they are modeling bumps and crannies peculiar to the data set at hand-and so will fit the outside world less well than simpler approaches.

Some modeling methods are indeed complicated-and some only a machine can truly understand-but these have little usefulness if they cannot produce diagnostic output showing that they are at least pointing in the right direction.

Myth: You have to look at mountains of data
Reality: Too much can be too much

Here's an example: The client has a million customers and a 20% annual attrition ("churn") rate, or many terabytes of data. Do we need to plot graphs and build models using all million, or even 500,000? Consider the following questions, with answers from domain experts:

Q: How many different "churn profiles" do we expect to find?
A: No more than ten.
Q: What is the largest number of examples of each profile we need?
A: Maybe a thousand.

Therefore, a sample of 10,000-20,000 churners and proportionally as many non-churners likely will suffice for this analysis. The terabytes have disappeared! (Khabaza, 2005)

Too much definitely can hurt. Truly massive data sets can take many hours to run, if they don't simply break down ordinary computers. Unless you are Google, just getting the equipment and software to handle terabytes of data can be prohibitively expensive. And unless you are Google, samples in the "mere" hundreds of thousands, or less-which, by the way, until recently were considered unthinkably large-often answer all important questions.

Myth: Data mining should be automatic
Reality: Automation works only in specific circumstances.

If you are the phone company, and have experts with tons of experience who have identified all possible patterns of fraudulent phone use, then you can train a machine to seek out those patterns automatically.

In nearly all cases we are not so lucky that the domain knowledge required is this narrow and well defined, and the data so neatly at hand-which allows easy preprogramming. We typically get inconsistent, sloppy data, need to make a lot of decisions about what values in it really mean, need to handle missing data, need to know where to look, and then determine how much it all makes sense. All require heavy human intervention.

Myth: Data mining is all about advanced algorithms
Reality: Methods really have improved, but still are not enough by themselves

If for some reason you ever found yourself at a conference about data mining, you would imagine that data mining is all about new computer algorithms. It's true that new approaches sometimes can squeeze a little extra from data sets, and at times even solve once intractable problems, but they are only part of what's needed.

New and interesting methods in fact are arriving at a remarkable rate. The pace of innovation itself may be part of the problem-it's hard to keep up with all the advances. We will take a brief nod toward a few of the shiny new approaches and methods later, since no self-respecting paper on this topic could entirely avoid them.

Having so many novel, captivating toys for the analyst does not pose that much difficulty. As one writer puts it: "The problem occurs when data miners focus…on the algorithms and ignore the other 90-95% of the data mining process" (Khabaza, 2005).

What's intriguing in new methods
Over 100 new and often useful methods are now widely available that did not exist in 2000. This is a remarkable torrent of innovation after many years of promises that usually delivered little. Data mining is now intermingled with the field of "machine learning." This discipline encompasses approaches that can become mind-bending for nearly anybody. We will just glimpse a few broader concepts.

Perhaps one of the most amazing recent findings is that a number of relatively weak models can be combined into one "average" model with results that typically are better than any of the models that went into the average. Results based on averaging or "voting" even can improve with more to average or more votes.

Some approaches are designed to take many "passes" through the data. At each pass, something in the data or the analysis changes randomly. This might involve drawing a new sample or trying new variables as predictors.

It turns out that just evaluating many different "cuts" of the data greatly reduces the effects of any anomalous values on the results.

Some approaches go even further and have each pass from the data "learn" from the earlier passes. For instance, if we are trying to identify different groups of customers, each new run can focus on the people who were incorrectly identified in the last iteration. This often works quite well in predicting the right group for as many individuals as possible.

Many new methods are emerging for looking at similarities and patterns. In some of them, we can see influences from information theory. For instance, balancing the effort needed to describe more details vs. the gains in accuracy from the added description. This is an important concept with very large samples. As you get to hundreds of thousands, or millions, of cases, even minuscule differences can appear immensely significant-and so we need methods like these, that go beyond traditional statistical tests to determine what truly matters.

Of course, the statistician side of your author would truly enjoy lingering to discuss the many specific methods that seem intriguing, with all the highly technical details that go into them. We hope you won't be too disappointed if we skip this.

Summary: What we learned that makes data mining work
Data mining can be highly useful, but it requires domain knowledge, plans, discipline, and plenty of effort to work. It needs to be treated as the opposite of unstructured fishing around in the data. Experts in the field have formalized an excellent set of procedures; these should be applied to every data mining project.

We should never expect to throw "a bunch of stuff" into a hopper and get results. Data preparation and cleaning typically are major components of any project.

Data mining does not need to be complex. Often we will capture everything of practical value with relatively simple models.

Data mining does not need to use massive data sets. Careful definition of the problem, and application of domain knowledge, can allow us to use samples and get highly useful results.

There is plenty that is both new and interesting in the field, but data mining is about far more than algorithms and methods. We need models that make sense, and that can be applied correctly, in the light of domain expertise. Of course, new methods often can help get more from data sets. And if these are applied as part of the entire data mining process, they can produce really strong results.

References

Oates, T. and D. Jensen. (1998) Large datasets lead to overly complex models: An explanation and a solution. Proceedings of The Fourth International Conference on Knowledge Discovery and Data Mining, pp. 294-298.

Holte (1993). Machine Learning, 11, 1 (April 1993), pp. 63 - 90.

Jensen, D. (2000). Data snooping, dredging and fishing: the dark side of data mining: a SIGKDD99 panel report, CMSIGKDD, Vol. 1, 2, pp. 52- 56 (January 2000)

The land mines of data mining, on http://www.praxagora.com

Khabaza, T., (2005) Hard hats for data miners, DM Direct Special Report (4/3/2005 edition)

Leamer, E. (1978) Specification Searches: Ad Hoc Inference with Nonexperimental Data. Wiley. 1978

Steven Struhl
Steven is Director of Marketing Sciences at Maritz Research, specializing in health care. He has over 20 years' experience in market research, multivariate analysis and consulting. His recent work experience includes a position as Senior Vice President, Senior Methodologist at Harris Interactive (and Total Research Corporation before its merger with Harris). Earlier experience includes working as Director of Market Research at SPSS, Inc., where he guided development of new statistical software. He also has held senior positions in financial services, advertising, and consulting.

He has done extensive "hands on work" in many areas of market modeling and statistical analysis, including: discrete choice modeling and conjoint analysis, price sensitivity and elasticity modeling, market segmentation and definition, machine learning, best-prospect or best-customer identification models, and graphical display of complex data. He has designed and analyzed several hundred discrete choice and conjoint studies, and has extensive experience developing market simulator programs.

He has written a book, Market Segmentation, and many articles on multivariate analysis, computer software, and psychology. He also speaks frequently at conventions, and has given many seminars on market segmentation, pricing and choice modeling, in addition to teaching graduate courses in statistical methods and data analysis.

He holds an MBA from the University of Chicago, a doctorate in psychology from the Chicago School of Professional Psychology, and MA and BA degrees from Boston University.

Subscribe to the Maritz Research Forum

About Maritz Research | As one of the world’s largest marketing research firms, Maritz Research, a unit of Maritz Inc., helps many of today’s most successful companies improve performance through a deep understanding of their customers, employees and channel partners. Founded in 1973, it offers a range of strategic and tactical solutions concentrating primarily in the hospitality, automotive, financial services,telecommunications, retail, pharma workplace and technology industries. The company has achieved ISO 9001 registration, the international symbol of quality. It is a member of CASRO and official sponsor of the American Marketing Association. Based in St. Louis, Maritz Inc. provides market and customer research, communications, learning solutions, incentive initiatives, meetings and event management, rewards and recognition, travel management services, and customer loyalty programs. Maritz has a presence in 42 countries, with key offices in the United States, Canada, the United Kingdom, France, Germany, and Spain. For more information, visit .

Technology