Keys to the Commercial Success of Data Mining
A workshop held in conjunction
August 31, 1998
New York City
Kurt Thearling, Exchange Applications
Roger Stein, Moodys
Some thoughts on the current state of data mining software applications
Director of Analytics
89 South Street
Boston, MA 02111
As a former developer of data mining software, I can understand how difficult it is to create applications that are relevant to business users. Much of the data mining community comes from an academic background and has focused on the algorithms buried deep in the bowels of the technology. But algorithms are not what business users care about. Over the past few years the technology of data mining has moved from the research lab to Fortune 500 companies, requiring a significant change in focus. The core algorithms are now a small part of the overall application, being perhaps 10% of a larger part, which itself is only 10% of the whole.
That being said, the focus of this article is to point out some areas in the remaining 99% that need to be improved upon. Heres my current top ten list:
Kurt Thearling is Director of Analytics at Exchange Applications, a Boston based database marketing company, where he directs the use of data mining and visualization technology in EA's database marketing software and consulting practice. Over the past decade he has developed a number of commercial data mining software products, including Thinking Machines' Darwin and Pilot Software's Discovery Server. He also an independent consultant in areas related to data mining and decision support technologies. His data mining web page can be found at http://www.thearling.com.
Interview with D S * (November 4, 1997)
Roger M. Stein
Vice President, Senior Credit Officer
Quantitative Analytics and Knowledge Based Systems
Moody's Investors Service
99 Church Street
New York, NY 10007
D S * : What are the most common serious mistakes made by CIOs implementing data mining and knowledge discovery technologies?
STEIN: "At the CIO level, it's hard to characterize something as a mistake after only a short period of time, which is typically how long most firms have been undertaking data mining programs. This is particularly so given the amount of infrastructure development and learning that must sometimes take place. Having said that, certain patterns of behavior seem to emerge in many data mining projects.
"There is, first, a tendency to focus closely on the tools, getting excited about one or another, as opposed to really looking at how the business problems are structured. Yet it is the structure of the problems that allows the solutions to fall out. You can use any tool to solve almost any problem, if you're willing to work hard enough on it. It then becomes a question of whether you're doing things efficiently and what your likelihood of success will be.
"People sometimes think of these technologies as magic bullets. Much early commercial work in neural networks, for example, took the position that you didn't have to understand either your data or statistics -- just dump the data in, and the technology would find the relationships. But, it turns out that, you must have a very firm background in modeling, statistics and the business domain in order to structure, validate and justify any model, whether it's a decision tree, neural network, discriminant analysis or whatever. So the tool wasn't the answer.
"Because most of these methods involve extensive searches for patterns through very large data spaces, to the extent that you make the process more difficult by a problem poor formulation you're far less likely to find interesting information."
"There is another side to that coin: the problem of overfitting. It's human nature to see patterns in things. When people look at the output of, say, a particular data mining algorithm -- a rule, perhaps that bubbles to the surface of the data -- there's a desire to explain that rule using intuition and background knowledge. And people are very good at figuring out such explanations, even if they're wrong.
"Most people who work in this field have had the experience of finding an interesting (but false) rule, only to later realize that they made some error in problem formulation or picked up some spurious relationship in the data. What is interesting is that people can usually generate a very good explanation for a rule even when the rule is wrong! So it is very easy to fool yourself: that's another mistake you run into frequently. This points out the need for more rigor in some data mining approaches."
D S * : Specifically, how can you guard against the perception of spurious patterns or relationships?
STEIN: "It is a tricky problem. Rigorous testing procedures are important: this breaks with some of the more common statistical approaches to problem-solving that concentrate on evaluating the significance of the parameters of the model itself. I, like most people who work in this field, typically favor out-sample testing. But even this can be tricky! People tend to make technical mistakes during testing. There are lots of stories of developers that thought they were performing exceptionally rigorous testing, when in fact, they were missing a fundamental assumption in their whole approach."
D S * : Should a CIO work closely with a statistician?
STEIN: "Since the CIO sets the vision for the organization, it behooves him or her to have a firm grasp of data mining technology at least at the intuitive level. But this doesn't necessarily mean that the CIO needs to work intimately with a statistician or programmer. That level of in-depth technical familiarity may not be warranted.
"However, the team responsible for development of data mining in a particular domain should be made up of domain experts (business folks), as well as specialists with strong backgrounds in mathematics, statistics, and database programming. This is how I usually structure projects, and I've seen it work quite well and deliver remarkable results. The key is understanding how the technologies will fit into larger business solutions.
"Unfortunately, though, what happens in many organizations is that you get a lopsided team: people with strong business backgrounds who don't understand the technology well or technical people who do all kinds of things to obtain "interesting" information from data, but sometimes end up solving the wrong problems or solving the right problem the wrong way from a business needs perspective.
"There must be a good partnership between the two types of people, not merely a superficial one."
D S * : How can two groups with such different perspectives be liaisoned effectively?
STEIN: "You need a way to map problems onto business solutions and vice-versa. But often all you get, say in vendor literature, is someone pushing a particular tool.
"It is vital that a person who is business-savvy in a particular area understand a little about the technology, at least at a basic level. By the same token, technologists must actively work to understand how business people will utilize the output of their technologies. Both sides need to talk to one another in a sort of middle-ground language, a language that is both technical and business-focused, a language that does not require either a Ph. D. in mathematics or an MBA. This language is what Vasant Dhar and I attempted to provide in our book."
D S * : What kind of resources must be utilized to implement this?
STEIN: "It depends upon the structure and culture of the organization. Some firms work well with project teams. Here, a business team comes together to solve a particular problem, and the technologists act in the capacity of a consulting service, going from team to team and problem to problem. Other firms form a specific group to solve a specific business problem, so a given business unit will "own" that process and take full responsibility for its solution. It really depends on the scope and structure of the project.
"Typically I favor finding a single strategic business need where a moderate increase in data understanding can potentially produce a fairly large impact. This is what I call a "quick kill." If all goes well, more challenging problems can be attacked next. This lets organizations get familiar with the culture of leveraging data and fitting business problems to different technologies. It also allows the technologies to build a track record within the firm."
"The key is that the organization's data and the expertise of its people must be deployed and managed as any other asset, as opposed to being thought of as a support-type function that is drawn upon by the business, like the opening of a faucet. The very concept of using data strategically must be important to the organization! Otherwise, what you end up with is a bunch of technical people trying to get business people excited about a particular idea or a group of business people getting excited about a new toy and trying to figure out how to apply it to their problems."
D S * : What specific role does education play here?
STEIN: "Of course, people should be well-trained, and the assumption is that teams are made up of the types of people I described earlier: specialists in statistical modeling and business strategy, etc. I'm also a strong advocate of self-education in this context: finding out what others have done, going to conferences, reading, etc., and trying technologies out to understand how they can fit into a firm's business strategy. This is especially useful on the management side, where people are not familiar with the technology and may well feel intimidated by the literature. However, I don't think attendance at, say, a three-day a course will teach people how to solve these problems in one shot. The learning experience must be iterative.
"But I want to re-emphasize that there is no way you can get either the business side or the technology side out of the picture. I would never think of developing this type of business process and system without a interdisciplinary group -- period."
D S * : How can an executive realistically benchmark the results obtained from these technologies? How do you judge whether outcomes are marginal, adequate or exceptional?
STEIN: "Technologies cannot provide an answer here. This depends heavily on which problems are to be solved within a business domain and the quality of the data and expertise that are available.
"There is a tendency to say 'This group improved their profitability by x percent...' or 'That project failed by y percent.' But ultimately these are all domain-specific criteria. Accuracy is only one dimension. From a business perspective, things like business flexibility or decision explainability may be far more important, depending on the intended use.
"For example, the Chicago Bulls coaching staff reported that data mining had improved performance by a 2-3 points per game. That's very different from US WEST saying they've saved millions of dollars using OLAP and a data warehouse to improve internal processes by highlighting lapses in transaction processing."
"Results depend upon the goals of an organization, the structure of the entire problem and the structure of the data. And optimal results always reflect where the organization is currently. An organization that's already been using data efficiently for a certain application may obtain only moderate added benefits from data mining. But, as it turns out, most organizations don't take very good advantage of their data, so most can realize large improvements from even simple projects.
"Evaluating results is intimately dependent upon the particular problem itself; it cannot be generalized. It's a chicken-and-egg situation: the problem defines the solution, and the solution defines the results, given the business context."
D S * : How should executives prioritize involvement with data mining technologies?
STEIN: "Again, let business needs drive development. Find a problem that the line cares a lot about. If you then concentrate on the dynamics of that problem -- and what a solution can provide -- you can intelligently evaluate the function of various tools: neural networks, genetic algorithms, recursive partitioning, etc. Each of these has a specific footprint or characteristic in terms of what sort of solution can be provided. Once the business needs are explicitly understood, certain tools and approaches will rule themselves out while others will suggest themselves.
"A classic example is neural networks, where the structure of the final decision model is often considered hard to interpret as compared to, say, a pattern coming out of CART, a rule-tree generating algorithm. On the other hand, neural networks give better fitting of complex surfaces, whereas a CART tree would have to be extremely complex to map such surfaces. There are trade-offs that must be made, and the key is deciding what is important for a particular business application. Note that two people could attack the same problem but nevertheless need very different things on the business side! The ultimate solutions required will dictate different approaches."
"Above all, though, firms need to begin to explore these technologies, if they haven't already, and understand how they fit into their business strategies. It can be very hard to play catch-up in this arena."
Roger Stein is a Vice President, Senior Credit Officer, Quantitative Analytics and Knowledge-Based Systems at Moodys Investors Service in New York. He has been working for Moodys in the field of data mining, mathematical modeling, applied AI, and stochastic simulation since 1989.
In the past nine years, Stein has developed and deployed dozens of models and systems that use applied technologies including fuzzy logic, genetic algorithms, neural networks, etc. as well as models that use standard and more traditional statistical methods. His research has spanned fields from credit and finance to operations research.
In addition to work with learning systems, Stein also spent several years as a rating analyst in Moodys Structured Finance group where he rated various types of asset backed securities. He was instrumental in developing several analytic methodologies to quantify the risks associated with new types of structured instruments.
Stein is a frequent invited lecturer and instructor at the NYU Stern School of Business, and has also spoken at The Wharton School and The Santa Fe Institute. Along with Vasant Dhar, he is the author of Seven Methods for Transforming Corporate Data Into Business Intelligence (1997, Prentice-Hall) which focuses on the application of intelligent methods to business problems.
Keys to the Commercial Success of Data Mining
Michel Adar and Nicolas Bonnet
2121 South El Camino Real, Suite 1200
San Mateo, CA 94403
Whats in it for me - the business person?
Awesome data mining tools, fantastic algorithms, rapidly converging neural networks, highly accurate classification methods, clustering methodologies, etc. are neat and useful tools for the knowledge discovery professional, but they are far from demonstrating significant value to the business person.
The key to Commercial Success of Data Mining lies in providing true value to the business person in a form that can be used and understood by the business community.
We present here several of the most important aspects of how the DataCruncher was designed to accomplish the business users goals.
Recognize who the customer is
To be commercially successful the first thing to realize is that the customer is not your fellow scientist. The customer is not the statistical analyst and the customer is not the mathematician. Sure, you can sell tools to all of them, but in order to be a commercial success you have to sell to the business person.
Then you need to realize that business people spend their money in tools that help solve specific business problems. So your tools need to demonstrate that they are useful in business situations and that they have a visible impact in the business.
The tools have to make themselves comprehensible to the business users. The language used has to be simple and business oriented. Results should be explained in terms that are comprehensible to the business user.
It is important to note that this is not just replacing a statistical concept by the English sentence that describes it. Rather it is realizing the communication with the user in the terms and concepts that are familiar to the user.
Task and Goal Oriented
The business user does not want to create a neural network, a decision tree or an Agent Network. The user wants to solve a specific problem, to find a specific answer. The user wants to know into what segments it makes sense to divide the customers. The user wants to know what customers are most likely to churn. The user want to know how likely a specific customer will respond to a product being currently promoted.
The data mining tools user interface should reflect the problems that the user is trying to solve. The specific approach used in the Data Cruncher to attack this issue is the concept of Assistants. Assistants somewhat resemble wizards in the sense that they guide the user through a set of steps, but they are more complete than wizards. They provide random access to the steps and they follow the user through the whole process, always being there, accessible and well documented. They guide the user through the mining process. The assistants also allow the user to step outside the assistant and do things using the full flexibility of the tools when necessary, and then go back to the assistant. The Data Cruncher assistants are customizable to many different business situations through the use of a scripting language.
Bridge the gap between analysis and deployment
Data mining models are developed for a purpose. The data mining tools should help in allowing the user to apply the model to its purpose. For example, if the model is developed with the goal of identifying the best customers for a mailing campaign, then the model should be available where the mailing lit is built. Approaches to solve this issue include providing APIs that enable other applications to make use of the model or adding capabilities to build a mailing list to the data mining tool itself, or integrating the data mining pieces into a mailing list generation.
Another approach is to provide access to the data mining models through a service oriented interface, where the models are published to a centralized server and then can be used by any application wanting to evaluate specific models against specific records. For example, several models maybe built to determine customer segmentation, likelihood to churn, customer value, cross selling opportunities, etc. Then these models can be made available through the server and any number of applications can apply the models to different customers by sending the appropriate messages to the server. For example, a mailing list building application may consult a model to score the likelihood of the customers to respond to the mailer.
Combine explicit with implicit knowledge
An important aspect of bridging the deployment gap is to understand that data mining models alone can not take decisions. The models represent the implicit or learned knowledge. Model results have to be filtered through business rules -which represent the explicit knowledge- before they are put to work. These business rules may contain overrides, additional targeting criteria, geographic or time restriction, etc. For example, a company that sells Video Cassettes may want to avoid offering a rated R film to a customer that is a minor, even if the cross selling model says that the customers profile indicates that this is a good title to offer. Another example maybe targeting a churn avoidance campaign to the residents of California, in this case even if the data mining model may indicate that a customer is about to churn, the offer should not be made because the customer does not live in California.
An additional advantage of having business rules combined with the data mining models is that the same rules and models can be used at the many different points where a decision is made. For example, a marketing campaign targeted at attracting new customers may use several models, like customer segmentation, value and likelihood to accept the offer, combined with some business logic maybe used to decide whether the offer should be made. If these models are deployed to a central server together with the business logic, then the same selection criteria can be used at the many points of contact between the company and the customer. For example, the companys call center, the mailing of the next bill, or the customers visit to the companys web pages.
Make it responsive and easy to change
Business situations change very rapidly. It is very important for the business user to react quickly to changing business conditions. In todays competitive world it is not acceptable to have the answer to a question be delivered three months after it was asked. For example, a marketing person developing a promotion maybe interested in modeling the customers behavior to fine tune the targeting. It is important for this model to be available very soon. In addition, the whole package that includes the several data mining models and business logic should be readily available and easy to modify to adapt to the necessary changes in the promotion.
Decision Delivery Systems
Decision Delivery Systems are designed as a vehicle for bringing decisions to different applications. These systems can typically combine the results of different data mining models with business logic to generate the decisions. As a centralized facility they provide a focused point for the deployment of models and knowledge, helping bridge the gap between the development of useful data mining models and putting them to work.
Skills and tasks of a data mining practitioner: A report from the trenches
Golden Books Family Entertainment
Previously I have stressed the importance of the knowledge discovery process as opposed to the data mining algorithm. Initially I thought that the development of data mining algorithms was somehow removed from the understanding and documentation of the knowledge discovery process. I now believe that it is not the process that needs documentation it is the algorithms that need documentation with the purpose of understanding their role in the knowledge discovery process. It is possible that this documentation and subsequent analysis will lead to the discovery that some of our most popular data mining algorithms need to be modified so that they fit into the knowledge discovery process.
In this position paper I will discuss the skills that a data mining practitioner who works for a mid-sized (less that $500 million in annual revenue) non-technology commercial business is likely to have. I will then discuss some of the tasks that these practitioners are expected to perform. Finally, I will describe how a popular class of data mining algorithms was augmented in one tool to support one of these tasks and conjecture on how we should modify/augment other data mining algorithms to support the remaining tasks. Skills In my experience a large number of data mining practitioners have finance or marketing as their core skills. These people are usually creative and understand the semantics behind business numbers. On the average these individuals are adept personal computer users for the purposes of (1) sorting, averaging, taking percentiles, joining (as in database tables) and categorizing numbers; (2) presenting numbers as creatively formatted tables and charts and (3) providing concise summaries of all of the numbers in a narrative. Please note that these individuals have very little skills that we would consider as core statistical or database skills. Usually individuals with statistical skills play supporting technical roles and are usually removed from the business knowledge that data mining practitioners need. My guess is that there are approximately 50 data mining practitioners in a mid-sized company without any core statistical skills and approximately 5 with statistical skills playing support roles.
Tasks Most of the tasks being performed by data mining practitioners can be described as follows.
1. Identification of "root causes" for increases or decreases in sales revenue and/or costs when compared to some base.
2. Forecasting of sales revenue and/or costs for existing and new products/services.
3. Analysis of trends associated with the market, customers and competitors.
4. Continuous classification/categorization of the companys business.
All the four types of analyses very rarely include the explicit development of analytical models. I will refer to the above four tasks as "strategic analysis".
Current Situation I believe that most existing data mining algorithms and products are focused towards the development of analytic models. In my opinion the development of such models requires core statistical expertise even if one is using algorithms developed by the machine learning community. By the way just because a product has an easy to use graphical interface it does not automatically become amenable to use by the non-technical community. For true ease of use the non-technical community should easily understand the semantics of the task that a software product requires its users to carry out. Also, often analytic models are focused on narrow areas of the business and their application is in the operational end of the business not the strategic end of the business. The current situation results in three consequences:
(1) We narrow the data mining market to the practitioners with statistical skills.
(2) We loose the opportunity to increase the quality of strategic analyses being conducted by businesses.
(3) We loose the opportunity to show quick cost savings in terms of personnel reduction and process improvements.
How can the current situation be improved? What has got me excited and intrigued is the realization that the analytic process that data mining practitioners involved in strategic analysis go through is no different from the process that we go through in the development of analytic models. In the past we have referred to this process as the knowledge discovery process. So if we can focus on embedding algorithms without altering the semantics of the analytic process we will greatly increase the quality of the strategic analysis being produced by businesses. For example let us look at a data mining product called Forecast Pro. Forecast Pro has embedded sophisticated exponential smoothing algorithms within the forecasting process. When using Forecast Pro the analyst is engaged in tasks that are no different than if exponential smoothing algorithms were not being used. Forecast Pro helps the analysts identify if the forecast being created deals with seasonality, cycles or exceptional market conditions. Forecast Pro then selects the best algorithm, creates the forecast and visually allows the analyst to understand the accuracy of the forecast. Most of the technical jargon and steps are not visible to the analysts. Obviously Forecast Pro does not produce a forecast that is as good as that produced by a custom analytic model but it does far better than a forecast that the analyst would have produced based on intuition without the use of any data mining algorithm. Forecast Pro is designed in a way that I feel comfortable giving it to someone without any core statistical skills to use. The immediate impact of using this tool is that the business has a better forecast, the forecast is a lot less cumbersome to generate and 1 analyst instead of 3 can generate the forecast. Of course, I would not use Forecast Pro in an "operational" environment for producing detailed forecasts to drive logistics. There I would prefer a custom analytic model.
If you agree with my analysis above then we should start working on embedding existing data mining algorithms within the analytic processes of the various tasks listed above. Induction or regression algorithms can be modified to help analysts within the "root cause" analysis process. The goal here would not be to develop precise causal models but to produce causal analysis that is more accurate and insightful than what is currently being produced and to produce this analysis faster and cheaper than it is currently being produced. Similarly clustering algorithms can be modified to help analysts within the continuous classification/categorization process. Finally time-series and sequence analysis algorithms can be modified to help analysts with the trend analysis process. My position is that if we work on embedding data mining algorithms within the analytic process we will discover the true reason for the research dedicated towards integrating data mining algorithms with database management systems, the development of knowledge discovery process models, the development of "hybrid" data mining algorithms and the increasing use of data visualization. I also believe that the more appropriate accuracy and performance comparisons are not among data mining algorithms but between a easy to use data mining algorithm and no algorithm at all. Finally, I believe that work on embedding data mining algorithms within the analytic process will lead to the creation of a community of experts in the analytic process. Currently I thing we are a community of algorithm (or tool) experts. Imagine a carpenter who only knows how to use a saw!
Intelligent Information Delivery: When too Much Knowledge is a Dangerous Thing
Judy Bayer, Ph.D.
Vice President, Analytic Solutions
Ceres Integrated Solutions
Recently, users of advanced information systems have begun to realize the value of incorporating automated alerts into their systems. Automated alerts are analytical agents that are designed to automatically find managerially interesting and important information in a database. The agents operate without user intervention, but report important information back to users whenever critical events are found in the database.
Alerts can be a powerful analytical tool to keep managers informed as to critical problems and important business opportunities. All it takes, it seems, is having the correct underlying sources of data for the alerts to operate on and then creating the appropriate set of alerts. The problem with automated alerts is that the volume of information automatically returned to the user can quickly become overwhelming. Analysis is easy. Knowledge is hard. A vital component in the development of knowledge is the recognition that an event is something that is important for the recipient to know about; in fact, that it is more important to know about than other significant events.
In this paper, we examine the implications of "alerts run rampant" on the ability of the alerts system to provide actionable knowledge to the organization. We then provide a simple example of an Intelligent Information Delivery (IID) mechanism that functions as a meta-layer to the alerts system. The IID layer evaluates the importance and criticality of alert-created information across all alerts in the system. It then decides on the disposition of specific pieces of information. Finally, we describe how the IID layer can be used as a mechanism to derive knowledge out of analyzed information from data mining systems, in general.
Alerts Run Rampant
Data mining systems, in general, are geared towards the analysis of vast amounts of data, and are designed to produce large quantities of analyzed information that, essentially, have to be sifted through and analyzed before they become useful as business decision making aids. This fact can become a critical issue when applied to automated alert systems. These systems are designed to perform data mining automatically and continuously. An example from the consumer packaged goods (CPG) industry will show the magnitude of the problem.
The CPG industry has, for many years, had the availability of rich sources of data. For most grocery products, vendors such as A.C. Nielsen and IRI sell sales scanner data that tracks all competitive products in a category, by UPC (the individual product, the lowest level information that manufacturers track for sales purposes), in each of fifty or more markets. A typical category can have 1,200 or more UPCs in each market. Most packaged goods manufacturers receive updates weekly. This means that a single alert measure can be tracking 60,000 possible events each week.
There are, however, many more than a single important alert measure that packaged goods manufacturers need to track. Some key alert measures for the packaged goods industry include short term market share changes for all UPCs in the market, trends in market share changes, introductions of new competitive items, competitor price changes, and changes in competitive promotional activity. Competitive activity is inferred by observing such things as retailer promotion pricing actions, increased levels of distribution for a competitors UPCs, and retailer promotions, such as increases in point of purchase displays, major ads and coupon activity. Since packaged goods marketers typically micro-market, each UPC has to be tracked, by market, for each alert measure.
There can easily be hundreds of thousands (or many more) events being tracked automatically. Because the CPG marketing environment is highly competitive and dynamic, there can easily be thousands of events that set off trigger conditions to alert a user. The situation gets even more overwhelming when we consider the fact that advanced marketing analysis systems in the CPG industry sometimes also embed sophisticated data mining technology that automatically analyzes causal factors associated with some alert conditions. The ensuing report, then, includes not only alert information, but also details of an analysis. The information overload that results can set up a condition where the user has to either spend all his or her time on reviewing the results of alerts, or ends up just ignoring the output of the alerts system.
What an Intelligent Information Delivery System is
An Intelligent Information Delivery system is essentially a knowledge-based system that:
The IID system functions as a meta-analysis layer for the alerts system. It evaluates alert-created information across all alerts in the system. Based on results of analysis and the rules contained in the system, it decides on the relative importance of the various alerts. The IID system also decides who, that is, which users, should receive specific pieces of information. The IID knowledge base contains rules related to managerial objectives that guide the selection of output for individual users. Development of this knowledge base is based on conducting knowledge engineering sessions with key business users to determine specific business rules to incorporate in the system. Actual application to individual users is based on creating user settings stored in a database table and accessed by the meta-analysis layer.
Simple Example of an Intelligent Information Delivery Mechanism
As an illustration, we provide a simple example of an IID system. The system has all three components: 1) an alert monitor, 2) meta-analysis capabilities, and 3) a business rule knowledge base. The IID system is designed to support a consumer packaged goods alert system.
The alert system in the example polls the database and performs its analyses weekly to coincide with database updates based on marketplace scanner data purchased from IRI or A.C. Nielsen. The focus of the system is on information contained in this data. The IID Alert Monitor intercepts all alerts that are in its domain of knowledge. No alerts are passed onto users at this time. The Monitor holds the alert information until all the alerts have finished processing the updated information in the database.
The Meta-Analysis Layer synthesizes results of the alerts process and performs further analysis. For example, it will do cross-market analysis of alerts to discover whether or not an alert condition is specific to a single market, or whether it reflects a more general condition. It will also check if the alert is a one-time occurrence or whether there has been a pattern of these conditions over time.
The Meta-Analysis Layer also makes an assessment of the overall volatility in the marketplace. Highly volatile markets can be expected to have many fluctuations in market share, retailer promotional activity, and competitor product introductions. Some alerts that might be considered significant in a non-volatile market, after this assessment may no longer be important enough to report.
Business Rules Knowledge Base
The Business Rules Knowledge Base contains rules developed by conducting in-depth interviews with key business managers in the organization responsible for taking action based on the results of the alert system. The business rules are mapped against the meta-analyzed alerts to determine which alerts are really important to know about, and who receives which alerts.
Consumer packaged goods marketers often focus on Brand Development Index (BDI) and Category Development Index (CDI) measures when running their business. BDI ranks markets as to the strength of the brand in that market. Markets where the brand has a high market share, high BDI markets, are ranked ahead of markets where the brand has a low market share. CDI ranks markets as to the strength of the overall category in the market. Markets where category sales are high (high CDI markets) are ranked ahead of markets where category sales are low.
Brand strategies often incorporate the relative importance of these measures and how to use them. For example, a brand strategy that focuses on increasing market share may often focus on high opportunity markets those with high CDI, but low BDI. A brand strategy that focuses on maintaining current brand strength may focus on high BDI markets. An important element, then, of the business rules knowledge base may be the incorporation of rules related to BDI and CDI.
An emphasis on BDI and CDI could lead to the following rules:
IF Brand Strategy is to focus on High Opportunity Markets
THEN Alerts should be ranked by the CDI of the market they relate to
IF Brand Strategy is to focus on High Brand Strength Markets
THEN Alerts should be ranked by the BDI of the market they relate to
There will be other rules relating to other strategies that incorporate additional factors.
Other rules in the knowledge base may relate to results of cross-market analysis, prioritization of negative information about the marketers brand, prioritization of positive information about competitors brands, priority given to trends versus one-time events, and thresholds related to when to consider competitive activity important.
IF Cross-market analysis shows an overall strong pattern
THEN This is an important alert
IF UPC is for OUR Brand
AND There is a downward trend in market share
THEN This is an important alert
IF UPC is for Key Competitors Brand
AND There has been a Highly Significant increase in market share
THEN This is an important alert
IF Market share change for a UPC is > twice the average Market Share
THEN This is a Highly Significant increase in market share
IF UPC is for Key Competitors Brand
AND There has been at least a three month trend in price decreases
THEN This is an important alert
The above is just a small sample of the business rules knowledge base that would be developed for even a simple Intelligent Information Delivery system. However, even a simple IID system can reduce the volume of output of alerts from hundreds of pages containing thousands of analyses to just the few most important findings.
In this paper, we introduce the concept of Intelligent Information Delivery systems systems that form a meta-layer on top of an alert, or other type of data mining system. The IID monitors the alerts produced and decides which information is most critical to bring to users attention. We also give a brief example of an IID system.
As data mining systems and systems of alerts that repetitively and automatically analyze information in databases become more prevalent, the problem of what to do with all the answers that come out will become increasingly important. The alternative is that over time, users of these systems will find that the more analysis they receive, the less they end up knowing.
Judy Bayer is Vice President, Analytic Solutions for Ceres Integrated Solutions. In that capacity, she has worked to help companies integrate marketing information into the general decision-making process. Prior to joining Ceres, Dr. Bayer taught marketing at the MBA, Ph.D., and undergraduate levels at Carnegie Mellon University and New York University, and was Vice President of Advanced Technologies at a Business Intelligence Consulting firm. Her expertise includes marketing research, business and market modeling, data mining and knowledge-based systems for managing information intensive environments, customer database marketing, and technology adoption. She has worked with leading companies in the packaged goods, computer, retail, insurance, financial, telecommunications and defense contractor industries and has presented her work to executive groups such as The Conference Board and the Advertising Research Foundation. Recently, she has led strategic seminars in data mining concepts and products.
Dr. Bayers research on knowledge-based systems has been widely cited in books and articles on marketing management and advanced marketing information systems. She has authored or coauthored more than 25 professional journal publications, white papers and technical reports.
Data Mining and Visualization for Agent-Based Modeling
Robert N. Bernard and Alan R. Shapiro
1301 Avenue of the Americas
New York, NY 10019-6013
PricewaterhouseCoopers is a global accounting and management consulting organization that has seen a steady increase in its data mining practice over the past three years. Over 65 employees in the U.S. practice devote full-time exclusively to the application and development of data mining techniques. The areas of application include: demand forecasting, supply chain management, market segmentation, customer lifetime profitability estimation, trading surveillance, detection of opportunities for cross-selling, and fraud detection, to name a few. Data mining at PricewaterhouseCoopers is viewed as an analytic process for defining and meeting clients information needs rather than as simply a set of techniques.
In this paper, we delve into the work of a particular group at PricewaterhouseCoopers Consulting, the Emergent Solutions Group (ESG). ESG provides forecasts of customer demand in a variety of industries for clients who are interested in near real-time decision support. ESG forecasts customer demand primarily through a technique called adaptive agent-based simulation modeling. Instead of standard numerical forecasting techniques (e.g., regression, ARIMA), ESGs adaptive agent-based simulation modeling attempts to replicate decision processes and interactions of actual consumers in an environment. For instance, during the simulation of the activities that occur in a retail store during a day, we would model each consumer that enters the store, what the consumer thinks about while browsing the store, what kind of information consumers might exchange with each other and with the (simulated) store clerks, and the contents of and location at which a transaction took place.
Data Mining for Agent-Based Modeling
PricewaterhouseCoopers Consulting Emergent Solutions Group uses data mining for agent-based modeling in two separate contexts: to imbue its agents with realistic knowledge and to extract information from clients data and from our own simulations. First, we are exploring text mining as a method of extracting realistic behaviors for our agents. Second, we have developed several methods and practices of high quality visualizations of the results and processes in agent-based modeling.
Agents need realistic behaviors in order to be useful in forecasting customer demand. We are exploring the use of text mining as a technique in enhancing the complexity and realism of our agent models. It has become increasingly clear that purely numeric data may not contain all the essential details needed for modeling human behavior. Textual data that describes unusual circumstances, or that gives insight into reasons why actions were taken, clearly contains meaningful information not to be found in a simple number. If there is excessive reduction for purposes of numerical manipulation, the information crucial for explanation may no longer still be in the analysis. A variety of approaches, from full natural language processing to simple identification of noun phrases, have been used for extracting information from text. Approaches which rely on the identification and exploitation of restricted sublanguages, together with limitations on the types of information processed, have produced useable, albeit limited, results (Grishman, 1997). Using such an approach, automated text analysis has been combined with techniques such as association analysis and rule induction to explore text-containing databases for new insights (Shapiro, 1983). More recently, the most productive approach we have found has been to create an environment in which extensive processing is used to complement human context and pattern recognition capabilities.
The ability to extract reliable qualitative information from written text (such as interviews or transcripts) is exceedingly valuable in imbuing ESGs agents with realistic behaviors. Decision processes of actual people that may have gone unaccounted for or unnoticed by virtue of using merely traditional large-scale survey techniques, can now be captured and utilized in an agent-based simulation model. Of course, ESG also uses traditional survey techniques to gauge the demographic characteristics, declarative knowledge, interactions between one another, and other cross sectional properties of the agents. Combining the dynamic behavioral data garnered from the text mining process and the static characteristics from traditional survey techniques provides a rich source of data that eventually results in forecasts that are more accurate than those provided by conventional numeric techniques.
Once the agents have obtained real characteristics and behaviors, we run them in our simulations; as stated above, many of these simulations are forecasts of consumer demand. These simulations use our IceCore technology, which allows seamless communication between the simulation code and a relational database. These simulations serve three purposes: first, they allow the results of the simulations to be interpreted as forecasts of consumer demand; second, the data generated by these simulations can be mined (a la Stein and Bernard, 1998), using both standard mining techniques as well as visually, to see if any interesting patterns exist; and third, the simulation can be examined visually, while it is running, to see if any interesting patterns can be picked up upon visually.
As mentioned above, data visualization of simulation results and client data is a key component of much of ESGs work. Using the graphics capabilities of Silicon Graphics workstations, ESG can animate high dimensional data so that clients can see the simulation develop over time. The ability to explore, discover, and portray patterns visually appeals to some clients non-quantitative inclinations, and frequently provide a more holistic and gestalt understanding of the hidden messages in the data than presenting simple rules or single numerical answers.
Visually mining forecasts done through agent-based simulation can also provide insight faster that in the context of a prose document. By visually observing several different variables over time, clients can better grasp the intricacies of the dynamic nature of many consumer markets. Furthermore, clients are also able to peer into the nature of the decision processes of individual agents (i.e., the synthetic consumers) and obtain an intuitive explanation as to why the agent purchased or did not purchase a particular product. Intuitive explanations are not available from the results of a neural net, for instance.
Finally, to allow clients to view a simulation as it progresses, ESG has developed an non-commercial public domain usage protocol for doing three-dimensional visualization of agent-based simulation models, the Remote Simulation Visualization Protocol, a.k.a., RSVP (Borges and Sigvaldason, 1998). RSVP consists of a series of simple commands that attach to a programming language such as C++. These commands control the movement and display of agents in a three-dimensional simulated environment. Clients can see the movement of agents in an environment as the simulation progresses. Thus, they have the unusual ability to peer into the behavior of a world and use a very valuable data mining tool that we sometimes overlook the human brain.
Grishman, R. 1997. Information extraction: techniques and challenges. In, M. Pazienza (ed.), Information Extraction. Berlin: Springer-Verlag, 10-27.
Shapiro, A.R. 1983. Exploratory analysis of the medical record. Medical Informatics (Special issue -- New methods for the analysis of clinical data), 8(3),163-171.
Borges, B. and T. Sigvaldason. 1998. Bar stool theorizing: on the validity of economic signals in bounded rational worlds. Paper presented at A-LIFE 6. Los Angeles, CA. June 1998.
Stein, R. M., and R. N. Bernard. 1998. Data mining the future: genetic discovery of good trading rules in agent-based financial market simulations. Proceedings of the IEEE/IAFE/INFORMS 1998 Conference on Computational Intelligence for Financial Engineering (CIFEr): 171-179.
Rob Bernard is a Senior Associate in, and the leader of the New York office of PricewaterhouseCoopers Consulting's Emergent Solutions Group (ESG). Rob specializes in statistical and qualitative analysis of ESG's forecasting capabilities. In addition, he develops adaptive agent-based simulations for governmental policy makers combining federal, state, and local data sources. Rob is currently finishing his Ph.D. in Urban Planning and Policy Development at Rutgers University.
Alan R. Shapiro is in the Business Intelligence Practice at PricewaterhouseCoopers with a primary focus on data and text mining. Dr. Shapiro trained in multivariate statistics at the University of North Carolina at Chapel Hill and then in statistical pattern recognition and adaptive signal processing at Stanford University. He taught applications of statistical pattern recognition as a professor in the Department of Mathematics at the University of California, San Diego and at the Medical University of South Carolina. Over the past twenty years, Dr. Shapiro has directed the development of multiple analytic database systems in medicine and finance. His current research interests involve methods for the analysis and visualization of the information contained in text.
Business focus on data engineering
Data mining has emerged this decade as a key technology for areas such as business intelligence, marketing, and so forth. For the purposes of discussion, application and business domains I will consider here include telecommunications, medical devices, space science (vehicle health management and scientific instrumentation), targeted marketing, and mining.
From a technical view, I don't consider data mining to be a new field, but rather another discipline in the lengthy history of engineering sciences that use data is a core focus for developing knowledge. This family of disciplines I'll consider here under the term "data engineering" (see our company position at http://www.ultimode.com/papers/data.html).
Some traditional and non-traditional examples follow: Data engineers work with physicists in analyzing spectral data measured from a high-resolution imaging spectrometer develop sophisticated models of the spectrometer's complex error modalities (registration, response function, calibration, measurement glitches) so that a high-fidelity model of the spectrometer's measurements can be developed. Data engineers investigating the performance of an industrial strength place-and-route package uncover useful characteristics of the optimization process and thereby improve the performance of the algorithm. Data engineers work with astronomers in analyzing infra-red data from an electronic star-catalogue. The analysis, in concert with the astronomer's interpretations reveal new, publishable classes of stars and also uncovers troublesome, never-before recognized artifacts with the original instrument. Data engineers in a large corporation investigate the bad debts database and uncover useful patterns in selecting targets for debt recovery, thereby dramatically improving the corporation's debt recovery.
At the time of the development, the individuals performing these tasks may have considered themselves applied machine learning researchers, decision analysts, statisticians, or neural network researchers, however they were all performing data engineering. You may have also head of the terms data mining and knowledge discovery, exploratory data analysis, intelligent data analysis, and so forth. These areas perform similar tasks, however have a particular emphasis that distinguishes their origins, whether it be the applications they serve of the algorithms for data analysis that they use. Data engineering is inherently a multi-disciplinary field, because of the number of technologies involved: visualization, data analysis, knowledge engineering, perhaps data bases, and of course the subject matter of the application.
So there we have the technical background of the community, and some idea of the range of applications. What are the business implications here. A number of factors have emerged in our consulting work that are beginning to give me a better understanding of the business nature of the discipline. First, the community has a number of different focuses.
Our experiences in this third focus present an interesting conundrum for the business manager. We find that in this third focus, there is a big difference between the results of the "average" practitioner and the "quality" practitioner. Every good software manager would know that a really good programmer can produce 100 times more code than an average programmer, partly due to the net result of subsequent maintenance, reduction in overhead and systems validation, and so forth. We find the same with data engineering. Except with data engineering, we find there are a few key insights made in a project that make all the difference. Mundane use of the "usual tools" in the "usual manner" by the average practitioner gets you so far. But a big difference in performance is gained by the quality practitioner who makes a few key insights to change the project.
I will give one technical example. For our mining and targeted marketing clients, for instance, we are under NDA on our key discoveries not too disclose the details. Related public-domain examples where non-trivial analysis of the data makes a key difference can be found at the Ultimode System's Case Studies page.
The following example comes from a long term project between Ultimode Systems and NASA Marshall Space Center called OPAD. Looking at the high-resolution spectrometer data taken of the NASA space shuttle main engine, everyone thought the significant OH component of the spectrum varied significantly from engine firing to engine firing and thus the task of determing the subtle but significant metal lines in it would be difficult. No, they were wrong. To everyone's surprise, I managed to show that the instrument's irradiance calibration data was integrated over too short a period, thus producing fluctuations. Unfortunately, the manufacturer hard coded the integration period into the instrument. I also managed to show that smoothing methods (only an experienced professional would know about) could be used to correct for the problem, and thus we now obtain lovely consistent OH spectra from engine firing to engine firing.
What are business implications here? There are several.
Regardless of what happens to data mining as a community, we know that data engineering in one form or another will continue to remain a key enabling technology for many businesses, and thus finding the right balance between software, intellectual property, and so forth, is all part of the evolution of the industry.
Kensington Approach Towards Enterprise Data Mining
Jaturon Chattratichat, Yike Guo, Stefan Hedvall, and Martin Kohler
Data Mining Group, Imperial College Parallel Computing Centre
University of London
The Kensington system, which is being developed at the Imperial College Parallel Computing Centre in University of London, aims to provide an enterprise solution for large-scale data mining in environments where data is logically and geographically distributed over multiple databases. Supported by an integrated visual programming environment, the system allows an analyst to explore remote databases and visually define and execute procedures that model the entire data mining process. It also provides learning algorithms, optimised for high-performance platforms, for the most common data mining tasks. Decision models generated by the system are evaluated and manipulated using powerful interactive visualisation techniques. The overall aim of the system design is to provide an integrated, flexible and powerful data mining environment as the basis for customised domain-specific applications. The main features of the system design are discussed in turn below.
Distributed database support
Today, many companies store large quantities of data in data warehouses. The data is potentially rich and useful for data mining. A data mining system should allow seamless integration of both local files and remote databases. The Kensington system enables database integration in preparation of data mining by providing remote database access via JDBC. Analysts can query and retrieve data from their remote and distributed databases across the Internet. The ability to query several remote databases concurrently means that an analyst can now efficiently combine and enrich the data for mining.
Distributed Object Management
The Kensington system adopts a three-tier approach based on the Enterprise JavaBeans (EJB) component architecture, to support data mining in an enterprise environment. The component-based middleware is designed to support scalability and extensibility. Application servers can be transparently distributed for scalability or replicated for increased availability. The system also supports efficient management of resources and multi-tasking capabilities. In an enterprise where resources such as databases and high-performance servers are shared, the Kensington system enables efficient resource management and scheduling.
The data mining procedures that are defined and customised with the Kensington system can be flexibly deployed in the enterprise.
Because a data mining procedure is treated as a graph of components, each of them can be scheduled to use appropriate resources. The middlewares management strategy of the logical component and physical resources ensures that all facilities are used efficiently.
Groupware, Security and Persistent Objects
In an enterprise where information is often shared within a workgroup, it is important that a data mining system supports the exchange of information in order to enhance productivity. Therefore, the Kensington system enables persistent storage of components so that they may be transparently shared and reused. Important information such as data, defined data mining procedures/templates, or generated decision models are managed as persistent objects, which can easily be exchanged between group members. The system provides on strong security for data transfer and model distribution through secure socket communications. Access control mechanisms protect a users or groups private resources from unauthorised access.
Universal clients - user friendly data mining
For maximum flexibility and easy deployment, client tools are Java applets that run securely in Web browsers anywhere on the Internet. A data analyst is therefore not bound to any specific location or computer.
Effective human-computer interaction is a strong feature of the Kensington system. Based on the visual programming paradigm, the Kensington client provides an integrated workspace for the visual construction of data mining procedures. The workspace includes wizards and templates for database connection, shows the users view of the persistence object store and provides the data mining task construction area. A data mining procedure is built visually as a connected graph and executed on request. The model or models returned by the mining components can be viewed with appropriate visualisation applets in the client.
Besides various data mining algorithms and data manipulation tools, the client interface also provides Java-based visualisation tools for data and model analysis. Data visualisation allows users to view and manipulate data before it is mined. Complex models, produced from data mining algorithms, are presented as interactive visual objects. The Kensington system provides various 2D graphing tools and a 3D scatter visualiser for data visualisation. A decision tree visualiser, association rule visualiser, and a cluster visualiser are examples of tools used to present mining models to the user.
High Performance Server
An important issue in data mining is the speed and performance of the task. In a competitive business environment where quick and precise decisions are needed, it is essential that a data mining task is performed within a reasonable time. Given the enormous size of data accumulated today, many analysts have turned to high performance computers for a solution. Kensingtons middleware serves as a gateway for connecting high performance servers to thin clients and distributed databases. In addition, the system provides several optimised parallel algorithms to support common data mining tasks. These components include data mining algorithms for classification, clustering, association rule analysis and neural networks.
Although parts of the Kensington system are still under development, it has attracted enthusiastic interest from various users in application areas ranging from retail information system providers, food and chemical industry and bio-informatics service providers.
We have applied the system to various real world applications including cluster analysis of the UK National Transport Survey (in collaboration with the University of London Centre for Transport Studies), the classification of large software codes of an international IT consultancy and intrusion analysis of network security, using classification algorithms and association rule discovery.
Completing a Solution for Market Basket Analysis
Scott Cunningham, Srikant Sreedhar, and Bill Smart
Knowledge Discovery Group
Human Interface Technology Center
5 Executive Parkway, N.E.
Atlanta, GA 30329
Golden Books, Inc.
Algorithms for finding rules or affinities between items in a database are well known and well documented in the knowledge discovery community. A prototypical application of such affinity algorithms is in "market basket" analysis - the application of affinity rules to analyzing consumer purchases. Such analyses are of particular importance to the consumer package goods industry. The retailers and wholesalers in this industry generated over 300 billion dollars of sales every year in the United States alone. Despite the economic importance of this industry, data mining solutions to the key business problems have yet to be developed. This position paper discusses some of the problems of the consumer package goods industry, notes a case study of some of the challenges presented to data miners within this industry, and critiques current knowledge discovery research in these areas.
The consumer package goods industry exists within a complex economic and informational environment. Mass merchandizing of products is in decline; U.S. consumers are increasing recognized as belonging to fifty (or more) distinct segments, each with its own demographic profile, buying power, product preferences and media access. The items being sold, consumer products, are more diverse than ever before; a single category of food may easily contain hundreds of competing products. Within this highly differentiated environment, strong product brand names continue to offer a strong competitive advantage. By themselves temporary price reductions are not sufficient for establishing consumer loyalty to either a store or product. Consumers are knowledgeable, and mobile, enough to seek out the lowest possible prices for a product. Ultimately consumer value is gained by those retailers able to negotiate favorable terms with their suppliers. Retailers gain the requisite detailed knowledge of customers through the creation of consumer loyalty programs and the use of on-line transaction processing systems; this information about the consumer is a crucial component in retailer-supplier negotations.
Consumer package goods is a mature industry in the United States. Profit is no longer merely a matter of opening more stores, and selling to increasing numbers of consumers; the market is becoming saturated, and the available consumer disposable income largely consumed.
Maintenance of an existing customer base is more important than growing entirely new customers; this new phase of retail growth is based upon selling more and a greater variety of products to pre-existing consumers. The most profitable retailers are those that are able to maintain or reduce their operating costs. Economics of scope, not scale, determine profitability. Data warehousing is one of the foremost technological means of increasing operational efficiency. Efficient consumer response systems, based upon data warehouses, are expected to save the industry $30 billion a year. Category management, an organizational strategy for enhancing retailer-wholesaler coordination, is another means of increasing operational efficiency. In the following two brief case studies we examine how data warehouses, category management, and data mining techniques show promise for answering the concerns of two large consumer package goods companies.
A major international food manufacturer, with significant brand equity and a wide variety of manufactured products, is interested in optimizing its product advertising budget. Like many package goods retailers, this manufacturer has an extensive and rapidly growing advertising budget. Essential to the endeavor is the cooperation of their independent retail outlets in the creation and design of product promotions. The manufacturer sought to create a suite of software tools for the design of promotions, utilizing the newest data mining technology, and to make these tools available in real time to their category managers and to the managers of their retail outlets. The business case suggested that there would be at least three sources of return in the creation of this tool: I mproved coordination with retailers; more effective cross-sales across product categories; reduced promotional competition from other manufacturers; and enhanced promotional returns. NCR proposed and designed a state-of-the-art neural network for forecasting and optimizing planned promotions. The network met, or exceeded industry standards, for promotional forecasts (within 15% of actual sales, 85% of the time). Despite the statistical quality of the results the application was never put into production by the manufacturer; the software design necessary to implement the results was too complex. Part of the application complexity stemmed from the hierarchical data types necessitated by the varied products and markets; another component of the complexity was reconciling the different product world-views of manufacturer and retailer.
A major regional food retailer, a grocer, sought analyses of its consumer transactions within its produce and salad dressing departments. The retailer anticipated improved design of store layouts, improved promotional design, and an insight into the market role of the various highly differentiated products within the category. The retailer clearly anticipated a causal analysis which would reveal the products which, when purchased by consumers, would lead to additional add-on sales of other products. NCR produced a market basket analysis which revealed the distinctive purchasing profiles that are associated with each major brand of interest. The NCR analysis revealed that the best selling brands were not those that resulted in the greatest amount of attendant sales. The NCR analysis supported the existing category management plans by the retailer, and also independently confirmed the results of a demographic panel survey. Despite these successes the market basket analysis, by itself, did not produce any new actionable results for the retailer. In the next section, on data mining, key data mining algorithms and outputs are examined for their suitability for answering these, and other, consumer package goods questions.
Data Mining Solutions
Affinity algorithms are well-understood and well-documented by the data mining community. The quintessential application of affinity algorithms is in the area of market basket analysis. For instance, these algorithms when applied to market basket analysis produce rules such as "Those baskets producing product X are also 75% likely to contain product Y." Additional research has focused on optimizing the speed and efficiency with which these rules are found; however additional applied research is needed to the support decision making needs of the consumer goods industry (and other relevant business groups).
First, affinity algorithms produce individual, isolated rules; associations between groups of products are not revealed. While the analysis can be repeated across all products in a category, or even a store, the number of rules produced grows exponentially. Not only is this computationally complex, but the resulting welter of rules is hard to interpret as well. Second, the output of affinity algorithms seem to suggest causal relationships between products. Yet the algorithms themselves embody no causal assumptions. The nature of product affinities needs to reconsidered; either a new and causal form of affinities analysis needs to be produced, or a thorough understanding of non-causal applications and use of affinity rules needs to be obtained. Third, affinity algorithms lack robustness. The algorithms produce a point estimate of affinity; yet retailers need to understand how (and if) these rules apply across larger groups of transaction. A similar issue is the minimum sample size needed to produce robust results. Fourth, market basket analyses carry implicit information about consumer preferences. Even when consumer identification is missing from transaction data, the data can still be grouped or segmented using data mining techniques to reveal distinct groups of consumer preferences. Affinity algorithms imply that samples are taken from homogenous groups of customers; yet business knowledge suggests that consumers are highly varied in taste and expenditure. Fifth, the market basket analyses, for some set of business questions, may require the rigor of a properly designed statistical experiment. Reasoning from standard to promotional pricing, as well as reasoning from standard display conditions to promotional display conditions is unwarranted. Yet much of the potential of market basket analysis stems from the capacity of retailers to manipulate product pricing, display or even attributes to meet consumer need. Sixth, and finally, standard forecasting tools produce estimates of sales single goods across times. (This is not conventionally the domain of market basket or affinities analysis.) However retailers and manufacturers need to have forecasts for whole groups of products. Producing individual product forecasts, and then aggregating, will not produce optimum forecasts since sales of one product contains information about the potential sales of other products; indeed, the forecasts may not even aggregate correctly. Techniques such as "state space analysis" which combine forecasting with multivariate analysis, may prove useful.
The consumer package goods industry is an important, and expansive, industrial segment of the economy. This industry is dependent upon information for its continued economic growth. It is therefore making great progress in collecting large databases of relevant data about its industry. The corresponding questions the industry has about its data are both interesting, and economically fruitful. This paper considered two case studies of applying standard data mining techniques to industrial questions in the area of consumer package goods. The examples discussed a wholesaler and a retailer that sought better management of product categories, and a resulting improved economy of scope. Commercial success of data mining will in part, be dependent upon the capacity of algorithms to model complex, hierarchical arrangements of goods and products.
Scott Cunningham received a D.Phil. in Science and Technology Policy from the University of Sussex (1997). His thesis research involved using principal components analysis for the automatic classification of text. Scott has consulted for the British and Malaysian governments on matters of science policy and the analysis of large databases of published sciences. Since joining NCR's Human Interface Technology Center he has been working on customer funded research on business applications of data mining. Most recently he has been working on applications of data mining technology to develop intelligent web commerce applications.
Keys to the Commercial Success of Data Mining
Department of Computer Science
Williamstown, MA 01267
I am currently Assistant Professor of Computer Science at Williams College in Williamstown, Massachusetts. I am also employed as a consultant by Bell Atlantics Science and Technology Center in White Plains, New York. Prior to taking the faculty position at Williams, I was employed by the Bell Atlantic (then NYNEX) Science and Technology Center for four years. At Bell Atlantic I work with Foster Provost and Tom Fawcett, as both a developer and user of data mining technology. I have taken my experience with applications of interest to the telecommunications business and have carried it over to my academic research position, where I focus on machine learning algorithms.
A large class of data mining algorithms have developed out of ideas investigated earlier by researchers and developers of machine learning algorithms. Notable examples include CART, C4.5, neural networks, and Bayesian classifiers, among others. One of the assumptions made by these algorithms, which is carried over into data mining applications, is that of clean data.
All of these algorithms, and others like them, do relax the assumption from its strictest terms. They do not assume perfectly clean data, but rather assume that the data might be noisy. While the ability to handle noise is obviously critical to the successful application of data mining algorithms, the treatment of noise typically falls short of handling the complete problem of data error.
Systematic errors arise in many applications, and they may be due to any of the following:
We have found many examples of these in some of the telecommunications applications weve investigated at NYNEX and Bell Atlantic.
One of these applications is classification of customer-reported telephone problems in the local loop of the telephone network. Problem diagnoses are high level, describing roughly that segment of the local loop where the trouble might be found, so that an appropriate technician might be dispatched to repair the trouble. The diagnoses are: dispatch to the customers premise; dispatch to the cable; dispatch to the central office; hold for further testing. The data describing the troubles include information about the type of switch to which the customers line is connected and electrical readings such as voltages and resistances, among others. The data mining problem here is to consider a large database of past troubles and their resolutions, and to develop rules for sending the appropriate technicians out to fix problems that have a certain profile.
The electrical readings that are a large component of the data are obtained via an automated line testing system. The line testing system must be calibrated regularly, but in practice this rarely occurs. As a result, the system becomes miscalibrated, and all readings reported for a set of lines on a given day might be off by a systematic amount. Furthermore, the systems baseline readings can differ from day to day. This source of systematic error is known, but there are no mechanisms in place to handle the error so that it can be eliminated from the data. Given the heavy load handled by the company, it is not clear that careful calibration can become a high priority item. Thus we can expect that the problem will persist.
People can also affect the data in a systematic way. In particular, one source of the diagnoses for troubles are the technicians who fix the problems. They report results using a complex coding system. If a technician has memorized the wrong code to represent the outcome of a repair, it will be wrong consistently. Again, we have a good sense of the source of the problem, but it is not clear that it can be controlled. Also, aside from maintaining a profile of each technician, it is not clear that there is a mechanism that could automatically correct for these errors.
There are a number of different scenarios that arise with respect to systematic data error.
(1) The systematic error is well-understood. In these cases, the data can be "cleaned" and data mining algorithms can be applied to the clean data.
(2) The errors can be reconciled. There are applications in which data may be obtained from several sources. In these cases, it may be possible to retain data that are consistent over the sources. This has the effect of cleaning the data, by making the assumption that the data might have errors but that the errors wont be consistent over the various sources. We found that with the local-loop diagnosis application, we were able to use a variety of data sources to reconcile diagnostic error (though we were not able to account for calibration error).
(3) The data cannot be cleaned. These are cases where the error exists, but cannot be removed from the data. It is important to note that in these cases, the sources of the error might, in fact, be quite well-known, but that additional complications make it difficult to pull the error out of the data.
One obvious reaction to these situations is to throw up our hands and assume that the application of data mining techniques will provide no useful results. But this reaction is unreasonable.
(1) If the amount of systematic error is small, or if the right algorithm is applied, the impact of the error might be small relative to other gains of the data mining.
(2) Data mining techniques might be useful for helping to identify systematic error, making the process of cleaning ones data a possibility.
(3) There are many applications for which only a small amount of mined information can go a long way to benefiting a company. In these cases, it is not in our best interest as data miners to simply dismiss an application as being "too hard". In the application described above, an improvement of only 1% over the current dispatch procedure could save the company over $3,000,000 annually.
More work needs to be done on:
(1) Developing data mining algorithms for cleaning systematic error out of data.
(2) Analyzing the tools we have so that we can determine how they are actually affected by different types of error.
Pointers to my work on data mining and telecommunication applications include:
Danyluk, A. P. and Provost, F. J. (1993). Small Disjuncts in Action:
Learning to Diagnose Errors in the Local Loop of the Telephone Network. In Proceedings of the Tenth International Conference on Machine Learning, p. 81-88. San Mateo, CA: Morgan Kaufmann Publishers, Inc.
Danyluk, A. P. and Provost, F. J. (1993). Adaptive Expert Systems: Applying Machine Learning to NYNEX MAX. In Working Notes of the AAAI-93 Workshop on AI in Service and Support: Bridging the Gap Between Research and Applications.
Danyluk, Andrea (1995) A Comparison of Data Sources for Machine Learning in a Telephone Trouble Screening Expert System, in Working Notes of the Workshop on Data Engineering for Inductive Learning: A Workshop at the International Joint Conference on Artificial Intelligence, Montreal, Canada, pp. 1-10.
Merz, C. J., Pazzani, M., and Danyluk, A. P. (1996). Tuning Numeric Parameters to Troubleshoot a Telephone Network Local Loop. IEEE Expert:11:1, p. 44-49.
Provost, Foster J. and Danyluk, Andrea (1995) Learning from Bad Data, in Aha, D. W. & Riddle, P. J. (eds.) Working Notes for Applying Machine Learning in Practice: A Workshop at the Twelfth International Machine Learning Conference (Technical Report AIC-95-023). Washington, DC: Naval Research Laboratory, Navy Center for Applied Research in Artificial Intelligence, pp. 27-33.
Andrea Danyluk is assistant professor of computer science at Williams College. She received her Ph.D. in computer science from Columbia University in 1992, and was a reseacher at NYNEX Science and Technology for a number of years before coming to Williams. Her research interests include the effects of systematic error on induction, time-series problems, and the application of machine learning to real-world problems.
Keys to the Commercial Success of Data Mining Workshop
40 Sylvan Road
Waltham, MA 02254
As data mining and machine learning techniques are moving from research algorithms to business applications, it is becoming obvious that the acceptance of data mining systems into practical business problems relies heavily on the integration of the data mining system in the business process. Some key dimensions that data mining developers must address include understanding the business process from the end user perspective, understanding the environment in which the system will be applied, including end users throughout the lifecycle of the development process, and building user confidence and familiarity of the techniques.
One critical aspect of building a practical and useful system is showing that the techniques can tackle the business problem. Traditionally, machine learning and data mining research areas have used classification accuracy in some form to show that the techniques can predict better than chance. While this is necessary, it is not sufficient to sell data mining systems. The evaluation methods need to more closely resemble how the system will work while in place.
One way to more closely evaluate data mining software in their intended setting is to incorporate time into the evaluation process. Although the time issue makes the prediction scenario more complicated, many data warehouses have data that is time dependent. For example, billing data stores billing, payment, and usage data for each customer indexed by time. As Kurt mentions in "Some thoughts on the current state of data mining software applications", few if any data mining systems deal with the time variable indirectly stored in the data.
Although the time dimension is left out during the prediction process, data mining systems should be evaluated with some aspects of time kept in mind. Researchers and developers can simulate time dependent evaluation by evaluating models on historical data stored in the data warehouse. For example, suppose we are building a model to predict whether a customer will churn in a given month. Suppose we have the data for the independent variables at time t and we make a prediction for person x saying that they will churn. When will x churn? Will they churn immediately, in the next two weeks, in the next month? For what time period should we evaluate the model? An intuitive guess would say that the model is most accurate at predicting churn the closer it occurs in time to the independent data. When does the model predict the same as the background churn rate? While accuracy is a valid evaluation criterion, determining how long the model is valid is also important information. Instead of showing the accuracy of a model as a single number, the accuracy could be shown as a function of time. This information can also be used to compare different data mining techniques. The characteristics of the model should be tested while increasing time to better simulate the data mining software while used in a business process. The accuracy of the model given time and other evaluation criteria should be provided to the end users so they can determine which characteristics are more important to their business task.
Given that models have some accuracy function implies that models should be relearned or refreshed after some amount of time. If the accuracy of a model becomes close to random chance after some time, a new model should be learned. The older model and the newer model should be somewhat consistent and similar. For instance, if a model at time t is based on attributes A, B, and C, we would expect the refreshed model to use a similar set of attributes. If they change radically, the models may be overfitting the data or the models may reflect seasonal trends. The end users of the system expect the models to be somewhat consistent. They might lose confidence if the models change radically, because intuitively the radical change may not make sense.
As stated earlier, the evaluation should reflect the business process it will be applied in. For example, if the churn system is used to identify churners monthly for targeted campaigns, then an interesting question from an end user may be to ask what percentage of churners in month y would the data mining software predict to churn ahead of time? The results of running experiments on historical data to answer this question may give some indication of how often campaigns need to be run to capture a certain percentage of churners. In addition, by noting which customers end up being on the predicted churn list for successive months we also find out more about the consistency of the models. These types of questions come about by interacting with end users and by looking at the task through their perspective.
In summary, we contend that data mining techniques should be evaluated according to the business task. This requires knowledge of the business process and interaction with the end users. Although most traditional evaluation has held time constant, the time variable cannot be forgotten when data mining software is put into the business process. The learned models can be evaluated and compared along the time dimension. By understanding the characteristics of the learned models, developers as well as end users can make more informed decisions.
Piew Datta is a Senior Member of Technical Staff in the Knowledge Discovery in Databases group at GTE Laboratories in Waltham, MA. She received her Ph.D. in Information and Computer Science from the University of California at Irvine. Her research interests in machine learning include clustering, prototype learning, and concept sharing. She has applied various machine learning techniques to Alzheimer's disease classification and churn prediction.
The Hardest Thing is Getting Into Peoples Heads: Quotable Quotes and Their Implications
Department of Information Systems
Stern School of Business
New York University
The hardest part is getting into peoples heads. Getting into peoples heads involves a lot more than clear communication. It is the ability to lead people to consider useful aspects of a problem that they would not otherwise. Why is this a challenging problem, and how can we think about addressing it? I present some insights into these questions based on my experience in the financial industry.
In practice, Data Mining is a collaborative theory building exercise. Given limited time and resources, it is important to explore and probe the interesting aspects of the problem as quickly and thoroughly as possible. The theory building exercise can be characterized in terms of two loops, which I call the inner and outer loop. The inner loop is where machine learning is applied. The outer loop is where results are discussed with experts and/or business users, setting the stage for an iteration through the inner loop. I illustrate the difficulties involved in this learning exercise using a number of "quotable quotes" and discuss how and to what extent these can be addressed. Some of these are motivated by research on financial markets, but they apply to a number of other areas.
Patterns emerge before the reasons for them are apparent
Implication: makes it difficult to evaluate new hypotheses
Most new ideas are on average, poor
Implication: makes us reticent to propose new hypotheses, hence myopia
I would sacrifice performance for understandability
Implication: makes it important to make simplicity an important part of the search criteria
The trend is your friend until it isnt
Implication: makes it hard to come up with simple explanations
Dont confuse me with data
Implication: if results fit into a prior conceptual framework, the data may not be perceived as credible or complete.
Dont confuse me with results
Implication: in the early stages of a project when the problem is less well understood, results are more useful in guiding further probing, and not necessarily as usable outputs
Dont confuse brains with a bull market
Implication: if the sample used to construct the model is biased, the results will not be general
The market is like a barometer not a thermometer
Implication: it is more important to understand
A system is only as strong as its weakest link
Implication: Execution, or use of the model, must be driving force in deciding what types of questions are worth addressing in the first place.
I shall present example of each of these situations and discuss their implications more fully.
Keys to the Commercial Success of Data Mining
Science and Technology Center
White Plains, NY
Foster Provost and I are in-house data mining experts at Bell Atlantics Science and Technology Center in White Plains, New York. We serve both as developers and users of data mining technology: we do research in the field but are also responsible for applying the technology to domains of interest to Bell Atlantic.
Because we straddle the line between developer and user, we have a unique perspective on data mining problems. Our combined experience applying data mining technology to many domains over the years has taught us several lessons that are not commonly discussed in the community, by either vendors, researchers or business users. I present three of them below.
1) Before business problems can be solved with data mining, they must be transformed to match existing tools.
Data mining tools perform a small set of basic tasks such as classification, regression and time-series analysis. Rarely is a business problem exactly in one of these forms. Usually it must be transformed into (or rephrased as) one of these basic tasks before a data mining tool can be applied. Often, in order to solve a problem it must be decomposed into a series of basic tasks. Indeed, much of the art of data mining involves the creative decomposition of a problem into a sequence of such sub-tasks that are solvable by existing tools.
For example, our work on cellular phone fraud detection transformed the problem of fraud detection into a sequence of knowledge discovery, regression and classification tasks (mining for indicators of fraud, profiling customer behavior, combining evidence to classify behavior as fraudulent). No single type of task was adequate to solve the problem.
2) Evaluation of data mining results is more complex than either developers or users believe.
Most data mining tools, like the research prototypes from which they were derived, measure performance in terms of accuracy or classification error. A tacit assumption in the use of classification accuracy as an evaluation metric is that the class distribution among examples is constant and relatively balanced.
In the real world this is rarely the case. Classifiers are often used to sift through a large population of normal or uninteresting entities in order to find a relatively small number of unusual ones; for example, looking for fraudulent transactions or checking an assembly line for defective parts. Because the unusual or interesting class is rare within the general population, the class distribution is very skewed.
Evaluation by classification accuracy also assumes equal error costs. In the real world this is unrealistic because classifications lead to actions which have consequences, sometimes grave. Rarely are mistakes evenly weighted in their cost. We have yet to encounter a domain in which they are.
The class skew (as well as error costs) may change over time, after a data mining solution is deployed. Indeed, error costs and class distributions in the field may never be known exactly.
Unfortunately, the importance and difficulty of evaluation is often not appreciated by business users either. The business user usually knows the general problem to be solved, but may not be able to specify error costs or even advise in their calculation. Sometimes the business user does not know how well current procedures solve the problem, and has no mechanisms in place to evaluate their performance. We are sympathetic to this, since evaluating performance often takes time and effort away from the task itself. However, it makes measuring the efficacy of a data mining solution difficult or impossible.
These recurring difficulties with evaluation have directed our research at Bell Atlantic. We have developed a technique based on ROC analysis that greatly facilitates comparison and evaluation of data mining results. The technique is especially useful when error costs and class distributions are only known approximately, or may change. We now use this technique in all of our work.
3) Data preparation and data cleaning are more time-consuming and knowledge intensive than is acknowledged.
In our experience, understanding the data, reducing noise and converting the data to an appropriate representation is the most time-consuming part of the data mining process. Furthermore, the process is usually iterative and knowledge intensive: as the project progresses, we learn more about the process that generates the data and we have to go back and re-clean them based on the new knowledge. Although the provider usually has information about the data, we are often the first people ever to analyze the data carefully. We have uncovered errors, idiosyncracies and artifacts of the data gathering process that were unknown to the provider. These discoveries sometimes end up changing how we approached the data mining task.
Data preparation and cleaning are often tedious, uninteresting tasks. However, over the life of a data mining project, these tasks account for far more time than that taken by applying the machine learning algorithms.
Robust Classification Systems for Imprecise Environments, Foster Provost and Tom Fawcett, To be presented at AAAI-98 (Fifteenth National Conference on Artificial Intelligence)
The Case Against Accuracy Estimation for Comparing Induction Algorithms, Foster Provost, Tom Fawcett and Ron Kohavi, To be presented at ICML-98 (Fifteenth International Conference on Machine Learning)
Analysis and Visualization of Classifier Performance: Comparison under Imprecise Class and Cost Distributions, Foster Provost and Tom Fawcett, Presented at KDD-97 (Third International Conference on Knowledge Discovery and Data Mining) -- Winner of Best Paper Award
Adaptive Fraud Detection, Tom Fawcett and Foster Provost, Published in Journal of Data Mining and Knowledge Discovery, v.1 n.3, 1997
Combining Data Mining and Machine Learning for Effective User Profiling, Tom Fawcett and Foster Provost, Presented at KDD-96 (Second International Conference on Knowledge Discovery and Data Mining),
Insight into Some Commercial Data-Mining Problems
Yizhak Idan and Saharon Rosset
Amdocs (Israel) Ltd.
16 Abba Hillel St. Ramat Gan 52506, Israel
Amdocs has been supplying large-scale information systems to the telecommunications area since 1982. A data processing staff of over 2,800 professionals is fully dedicated to this area. With installations in over 50 of the major telecommunications companies and directory publishers around the globe, Amdocs is a world leader in the development and implementation of Customer Care and Billing systems for providers of telecom services.
Amdocs carries out a major Data Mining and Decision Support activity as part of its R&D Division tasks. Among other things, we have developed simulation and run-time KDD environments that are used by our customers and are integrated in several products and applications such as: Fraud Management and Churn Management.
Following are some interesting points we have encountered, together with some insight into desirable and less desirable solutions.
Consideration of customer value in the data-mining process
One of the most important issues for business-oriented use of data mining, is the incorporation of value considerations into the analysis process. Value is a general term that may mean different things in different settings, such as: the average monthly revenue from customer, number of lines he owns or other combination value we would like to consider at a certain point of time. In the context of churn management, some of the tactics and ideas often employed are:
We propose an original approach in which value is integrated into our data-mining algorithm, in a way that the process of data partitioning is considering the distribution of value at the same time as the size of populations.
Effective incentive allocation
In several applications data mining is used for analysis followed by countermeasure reaction. For example, in the churn management, the analysis of churning customers will normally result with incentive campaigns. This means that we will accord incentives to valuable customers that are predicted to churn.
There are two main areas of interaction between the incentive component
and the data-mining component in such application: the attribution of incentives to
population segments and the measurements of their effects in future analysis.
We propose to use the data-mining results (in our implementation induced rules) for incentive allocation. The generated data-mining rules and their related customer segment can be useful symptom descriptors for matching effective incentives. For example, following a rule quoting that young customers in a certain area are massively disconnecting, the analyst may design a campaign that will propose an attractive price plan for customers with young customer usage profile trough an aggressive targeting media in this area.
Following, we address the question how to consider incentives attribution in future data mining, on one hand, and how to posteriori evaluate their effectiveness on the other hand.
Incorporation of external events into the data
A churn prediction model would usually be constructed from data extracted from the corporate data warehouse, such as: usage history and trends (number of calls, duration, services used, destinations, etc.) and social-demographic data (income, city, education, profession, etc.). At the same time, there are some implicit punctual events in history or at the present, such as - a competitor came out with a promotion campaign, or a major financial crisis in a certain area - which may have a major effect on the behavior of the customers. This effect may be different for different segments of the customer population, depending on: geographical area, usage patterns, etc. Ignoring these events may lead to wrong prediction models. Therefore, any successful analysis cant ignore their existence, and must incorporate them into the model. In general, we see distinguish three relevant approaches to this issue:
Letting Business Users Loose
George H. John, Ph.D.
2300 Geng Road, Suite 200
Palo Alto, CA 94303
In the commercial world, the term data mining has become associated with large, multi-hundred-thousand dollar projects taking several months and requiring the skills of PhD data miners with years of experience. A business manager wanting to use data mining to gain some competitive advantage might embark upon a large project and work with skilled data miners. But this can be cumbersome: large projects require many rounds of review, assessment of the set of vendors and consultants who offer products and services, project scoping, possibly buying new hardware, working with their IT department, etc. It's quite an obstacle.
If this situation is one point on a spectrum, another point at the opposite end would be where applications already exist within the company that would allow a businessperson to mine their data easily. Our business manager might just point her web browser at the site that contains the data and applications for analysis, look around for a while and discover, perhaps, that the jump in her groups expenses this quarter is primarily attributable to travel and hiring expenses, or that the majority of her lost customers were affluent single men who lived close to her competitors store.
Giving power tools to novice business users can be a recipe for disaster -- what if they're pointing the tools at data they don't fully understand? What if they build a predictive model and accidentally include input variables that weren't all known before the action the model suggests needs to be taken? What if their data isn't clean? And so forth. An experienced data miner is someone who has learned to use considerable caution, and knows how to avoid pitfalls and traps.
But does this mean we can't give "data mining" applications to business users? What if instead of just putting a nice GUI on top of a powerful data mining tool like a neural network and calling it easy to use, we built simpler analysis methods, whose results were easy to understand? What if the results were phrased as statements about the existing data, not predictions about future data that could be mis-construed. What if, instead of a raw tool that could be turned on any data they happened to have lying around, instead they had applications that were integrated with a datamart, that was constantly updated with sales and other data sources, and analyses were restricted to use only data that was known to be clean? These "analysis methods" might look preposterously simple to expert data miners, yet they might be very useful to business users.
A robotics professor once asked his class to design a robot to wash dishes. The inventions that were turned in stretched the technology to its limits: micro-sensors and force feedback control to avoid breaking the dishes, vision systems to see the dirt, and joints with many degrees of freedom to allow reaching into the sink, wiping the dish, and setting it into the drying rack.
The next day, the professor pointed out that dishwashing machines already exist, are relatively cheap to manufacture, and work quite well.
Large data mining projects certainly have their place in some cases, small improvements in a model can result in huge savings, making the investment easily worthwhile. In other cases, a large project may not be explored due to lack of resources. If no applications exist to allow an executive or manager to look at their business and their data, there may be a huge missed opportunity cost from their lack of understanding of their business.
Benefits of a Standard Data Mining Process Model
Randy Kerber and Jens Hejlesen
Human Interface Technology Center
5 Executive Parkway, N.E.
Atlanta, GA 30329
Data mining success stories have triggered increased interest within the business community, particularly in large corporations with vast stores of data about their customers and business operations. Their interest appears to be following a path similar to that of early research in machine learning: the tendency to view data mining as the isolated application of a data mining algorithm to a pre-existing dataset, where the key determinant of success is selecting (or creating) the "best" model-building algorithm.
As businesses continue to use data mining technology, they are likely to discover, as experienced practitioners and researchers already have, that:
While business customers would eventually learn these lessons through experience, it is hazardous for the health of the industry to allow this tool-centric focus to continue. At this early stage, too many disaster stories would be fatal to the field. Tool vendors have found it effective to sell tools based on the premise that they are like golden geese feed in your data at one end and golden nuggets of knowledge will magically come out the other. However, when someone's goose produces the wrong kind of nuggets, their first impulse is not to question whether they are properly skilled in raising golden geese or providing the right food. They will scream "it's a hoax data mining doesn't work". If so, data mining will be discredited like other technologies that were over-hyped and failed to meet the inflated expectations.
For the data mining industry to prosper on a wide scale, it is necessary to create the perception that "data mining works". It is fine if the perception is "data mining works if done properly". Of course, then the obvious question is "how?" For acceptance to spread beyond the "early adopters", customers must feel confident that they will know how to manage a data mining project to ensure success. A major part of the comfort level is to understand what the stages of a project are, what issues will need to be dealt with, what tasks will need to be performed, etc. Having a process model goes a long way towards creating this initial comfort level. Customers would not feel as if they are wandering into completely uncharted territory. They realize they will encounter many difficult situations that will need to be dealt with, but at least they know what they are likely to encounter. A service provider that can communicate a convincing process will have a big advantage over one that cannot.
The existence of a process model is a big improvement over the situation without one. However, a great deal of confusion will result if a customer is presented with several different process models. It might be that in reality they are very similar. However, they might sound quite different to a customer who is not a data mining expert, and is then presented with the dilemma of trying to decide if one is better than the other. Faced with such a situation, the conservative customer may well delay a data mining project, preferring to wait until the picture becomes clearer.
Contrast this with the situation where nearly every prospective service provider describes the same, standard, data mining process model. This removes a major obstacle to the decision to do a project. The customer is still confronted with which service provider to hire, but he/she will no longer need to get tangled up in confusing arguments about which process models are really different and which is best. The key issue is that if a sufficient comfort level is achieved, the customer will be much more willing to proceed with the project.
Benefits of a Common Process Model
By adopting and promoting a common view of the data mining process, the data mining community would benefit in a number of ways:
However, different groups are affected differently by a standard model. The major communities, and their expected relation to a process model, are discussed below.
For tool vendors the key question is the market's verdict of the value of data mining. If the perception is positive, they will buy lots of tools. Of course, some vendors will succeed more than others. Naturally, the prospective market for tools is significantly larger if the belief of the market is that data mining can be successfully performed by less experienced people. However, if the verdict is that data mining is another over-hyped hoax, the market will shrink, and no one will sell many tools. In this case, tool vendors will suffer the fate of any company whose habitat disappears: most will become extinct, others will somehow find a way to adapt to a new environment. The conflict for tool vendors is that there is short-term benefit to downplaying the amount of effort and skill needed to achieve high-quality results.
Service providers have a different conflict. Certainly, they would benefit if the perception becomes "Data mining can produce great rewards, but only if performed by experts who know what they're doing", i.e., data mining service providers. In this situation, a service provider with a high quality, proprietary process has a distinct competitive advantage. However, prospective customers confronted with arguments over conflicting process descriptions might decide to stay out of the market. Adoption of a common view of the data mining process should increase the total market for data mining services, though it might be harder to differentiate your offering. In such a world, a propriety process model could turn from an asset into a liability; prospective customers will question why your process is different from the industry consensus. It will be up to the service provider to justify excluding a standard task or to explain the added benefit of additional tasks.
Less experienced users would probably be most eager to embrace a process model, for the guidance it would provide. At first glance, experienced modelers might view a standard process model as a threat, because it would provide greater benefit to less experienced practitioners. However, the greater the demand for data mining services overall, the greater will be the demand for experts. In addition, the existence of a widely known model should make it much easier for the analyst to communicate what they are doing to a client. In addition, the client is much less likely to question you about the necessity of tasks that are described in a standard process.
Probably the most enthusiastic advocates of a standard process model. These are the poor folks who, despite limited technical understanding of data mining, must somehow sift thorough conflicting definitions and marketing claims and decide how their organization will use data mining. A common version of the data mining process provides them with a framework for structuring their projects and for evaluating tool and service offerings.
CRISP-DM (CRoss Industry Standard Process for Data Mining) is a collaborative effort of the CRISP-DM consortiumconsisting of NCR, ISL, DaimlerBenz, and OHRAwhose goal is to develop a standard, open (publicly available and non-proprietary) process model for data mining. The CRISP-DM project is partially funded by the European Commission, with the objective of increasing the use and effectiveness of data mining by European companies.
The CRISP-DM standard is not meant to be a formal standard, in the sense of attempting to enforce a single, restricted method for performing data mining. Rather, it is intended to be a "de facto" standard, serving as a common reference for all parties to discuss data mining projects. It is intended to lean more towards describing what general tasks should be performed rather than how they should be done. The developed model is intended to be industry neutral and tool neutral.
The process model is being developed through the collaborative efforts of the consortium members, combined with feedback obtained from the data mining community through the CRISP SIG (Special Interest Group) and a series of workshops. Workshops were held in Amsterdam in November 1997 and in London in May 1998. The next workshop is scheduled to be held in the U.S. sometime in 1998. The process model will also be validated against data mining applications conducted at DaimlerBenz and OHRA. For more information on the CRISP project, including how to become a member of the CRISP SIG and obtain the latest version of the CRISP process model, see http://www.ncr.dk/CRISP.
Adoption of a common view of the data mining process would improve the likelihood of a large market for data mining products and services. Existence of a standard process increases the comfort level of cautious buyers, making it easier for companies to sell their products and services. Knowledgeable customers and users increase the number, and percentage, of successful projects, enhancing the perception of the value of data mining. The goal of the CRISP-DM project is to increase the use of data mining by creating and disseminating a standard data mining process model.
Randy Kerber is a Knowledge Discovery consultant and project leader with the Knowledge Discovery Group at NCR's Human Interface Technology Center, and a member of the CRISP-DM project team. Have used S-Plus, Clementine, SAS, Enterprise Miner, KnowledgeSeeker, Statistica, Predict, and Recon for retail (sales forecasting and promotion planning), telecommunications (target marketing), and stock market (portfolio selection) applications. While previously employed as a research scientist at Lockheed-Martin, designed and implemented data mining algorithms for Recon, one of the first commercial data mining systems.
Jens Hejlese (M.sc. in Computer science and Mathematics, 1982), is a project leader and a senior standards consultant for NCR, and the project leader of the CRISP-DM project. Mr. Hejlesen has used Clementine on a customer churn project within the insurance industry and SPSS on various applications.
Hunching, not crunching: data mining and the business user
Tom Khabaza and Jason Mallen
Integral Solutions Limited
630 Freedom Business Center, Suite 314
King of Prussia
PA 19406, USA
Data Mining and Business Knowledge
Data mining is about finding useful patterns in data. This word "useful" can be unpacked to expose many of the key properties of successful data mining.
The patterns discovered by data mining are useful because they extend existing business knowledge in useful ways. But new business knowledge is not created "in a vacuum"; it builds on existing business knowledge, and this existing knowledge is in the mind of the business expert. The business expert therefore plays a critical role in data mining, both as an essential source of input (business knowledge) and as the consumer of the results of data mining.
The business expert uses the results of data mining but also evaluates them, and this evaluation should be a continual source of guidance for the data mining process. Data mining can reveal patterns in data, but only the business expert can judge their usefulness. It is important to remember that the data is not the business, but only a dim reflection of it. This gap, between the data and the business reality it represents, we call "the chasm of representation" to emphasise the effort needed to cross it.
Patterns found in the data may fail to be useful for many different reasons. They may reflect properties of the data which do not represent reality at all, for example where an artefact of data collection, such as the time a snapshot is taken, distorts its reflection of the business. Alternatively the patterns found may be true reflections of the business, but merely describe the problem which data mining was intended to solve - for example arriving at the conclusion that "purchasers of this product have high incomes" in a project to market the product to a broader range of income groups. Finally, patterns may be a true and pertinent reflection of the business, but nevertheless merely repeat "truisms" about the business, already well known to those within it.
It is all too easy for data mining which is insufficiently informed by business knowledge to produce useless results for reasons like the above. To prevent this the business expert must be at the very heart of the data mining process, spotting "false starts" before they consume significant effort. The expert must either literally "sit with" the data miner, or actually perform the data mining. In either case, the close involvement of the business expert has far-reaching consequences for the field of data mining.
Data Mining as a Non-technical Process
Business experts are seldom also technical experts, and their deep involvement in data mining has a fundamental effect on its character. The process of data mining is one in which the business expert interprets the data, a simple extension of ordinary learning by experience. In such a framework, technology must as far as possible remain hidden, while revealing the patterns in the data.
The organisers of this workshop rightly state that successful business data mining does not come down to "hot algorithms". Equally irrelevant to the core process of data mining are database support, application integration, business templates, and scalability; data mining tools may usefully have such attributes but they are essentially technological properties. The business user must be able to approach data mining as a window on the business, and engage with the data without the distraction of technological detail.
Data Mining Tools
The requirement of data mining to be accessible to business experts also shapes the requirements for data mining tools. These end-user oriented requirements can be described in many different ways, but here we focus on three key properties: data mining tools must be interactive, incremental and iterative.
Interactive: Modern "desktop" applications are highly interactive as a matter of course, but here we focus on a deeper interpretation of "interactive": the user must be enabled to interact with the data, and not just with the technology. The user interface of data mining tools should be designed to highlight the properties of data, and play down the details of technology, whether that technology be database links, efficient indexing, visualisation display parameters or machine learning algorithms.
Incremental: The data mining process is incremental; each successive investigation builds on the results of the previous one; thus the principle learning from experience applies not just to the data mining exercise as a whole but also to each step within it. Data mining tools must be designed to encourage this re-use of results as the data miner, in a step-by-step manner, builds up a picture of the patterns in the data. This means that data mining tools must be highly integrated; query must lead naturally to visualisation, visualisation to data transformation and modelling, and modelling to visualisation or further queries. These transitions are merely examples; overall the process must appear seamless, with the effective methods of investigation at any point being also the most obvious, and without the intervention of technological barriers or distractions.
Iterative: Data mining is seldom a simple linear process; successive steps not only build on one anothers results, but also refine the approach of earlier steps. For example the results of modelling may show that the data should be further refined and the modelling repeated, or may point to areas for closer examination in an earlier data exploration phase. Any result may point to earlier steps, refining not only the data but also the process itself; each step also has the potential to open up entirely new avenues of enquiry. It should be emphasised that the process is not organised into discrete steps concerned with different types of knowledge; rather the discovery of detailed properties of the data proceeds alongside a gradual refinement of the business concepts involved, and the unfolding of key patterns to be utilised.
The iterative nature of data mining is apparent at a variety of levels. For example at the detailed level, a modelling process may be repeated many times (and gradually transformed) for example in the space of a day. Many models are built over this time, and each contributes a small "nugget" of knowledge to the overall process; we might call these "throw-away" models they are formed to be read, digested and then thrown away. At the overall project level, the data mining process is also iterative, and should for a project of significant duration contain "planned in" iterations for the production of improved models or other results.
Data mining tools must be designed to support this iterative property of the data mining process. The requirements here are similar but not identical to those relating to the "incremental" property. Data mining operations, once configured, must not be "set in stone" - they should be designed to be refined in the light of subsequent events.
Summary and conclusions: Hunching, not crunching
In summary, the data mining process must be driven by those with expert knowledge of the domain. This has many implications for the process and for the tools which support it: the process must be thoroughly domain-oriented rather than technically oriented, and the tools must support an interactive, incremental and iterative style of work.
Data mining, because of its interactivity and domain orientation, has sometimes been described as a "hunch machine". The key to commercial success in data mining is hunching, not crunching.
Authors note: The principles outlined in this paper are based on experience of consultancy; the tool requirements are based on the Clementine data mining system.
Data Mining with MineSet:
What Worked, What Did Not, and What Might
MineSet Engineering Manager
Silicon Graphics, Inc. M/S 8U-876
2011 N. Shoreline Blvd
Mountain View, CA 94043-1389
At Silicon Graphics, Inc., we have developed a data mining and visualization product called MineSet TM (Silicon Graphics 1998, Brunk, Kelly & Kohavi 1997). MineSet, first released in early 1996 mostly as a visualization product and then became a full data mining and visualization product late that year with several data mining algorithms based on MLC ++ (Kohavi, Sommerfield & Dougherty 1997).
The engineering effort in product development is estimated at over 55 person years, with the engineering team consisting of 17 people right now. During our development we interacted with dozens of customers, taught data mining courses, and used our product on internal databases in our company.
In this paper we detail some things that worked well, some things that did not work as well as we hoped, and some thoughts about the future.
What Worked Well
Below we detail that things that we felt worked well for MineSet.
System Architecture MineSet was designed as a client-server architecture as shown in Figure 1.
The visualizations are shown on the client while the CPU-heavy analytics and I/O-heavy data transformations are done on the server. The database can reside on the server or in a third tier.
The architectural design worked well because it scales to large datasets. On small datasets, both client and server modules run on the same machine. Unlike products that demo well on 10,000 records but do not scale to large datasets, a client-server architecture provides a growth path from a demo or pilot to more serious knowledge discovery scenarios. While large servers today have tens of gigabytes of memory and hundreds or thousands of disks, client machines
Two Development Libraries MineSet was built on top of two libraries: a visualization library that is based on Inventor/OpenGL and MLC ++ for the analytical algorithms. Unlike products that provide isolated tools that were developed independently, MineSet feels like a consistent integrated product where behaviors are similar across the tools, the code reliability can be kept high, and maintenance costs low. The MLC ++ library consists of only 102,000 lines of code and 40,000 lines of testers to ensure correctness.
Direct Visualization tools MineSet provides users with direct visualization tools, such as scatterplots, map visualizers, hierarchy visualizers, and statistics visualizers. Figure 2 shows refinancing costs for counties in the US. A picture is worth 1,000 words or, in this case, over 3,000 counties. The human perception system can identify anomalies and patterns much faster in a representative landscape than in a spreadsheet. The visual tools also provide animation capabilities to see trends over time (or other variables).
Figure 2: Refinancing costs (mapped to height) for every county based on FIPS codes. Deviations from each state's average are colored from blue (zero deviation) to yellow (0.005) to red (0.01).
Comprehensible Models and search abilities MineSet provides analytical algorithms that build models we have found ways to show visually. Users can interpret the models and interact with them using what-if scenarios. In many cases insight is derived from looking at the model and seeing an interesting pattern or rule. Interesting insight may be derived from a model even if it is inaccurate when used in predictions. In some cases the visualization helps users see their data problems or incorrect assumptions they made. When large datasets are used, models tend to be large. When a 50,000-node decision tree is built, no one will look at every node. MineSet therefore provides search and filtering abilities on the models. For example, users can search for nodes that have at least some percentage of a given class and at least a minimum level of support in the training set. MineSet also uses well-known and understood algorithms and not hyped proprietary algorithms that no one but the company's founder can understand.
Drill-down/drill-through Several MineSet tools provide drill-down abilities. Figure 3 shows a drill-down to two specific states. MineSet allows drilling down to records that make up a visual object, i.e., by pointing to a state in a map or a node in a tree, the original records can be shown. Alternatively, that subset of the population can be sent to another tool.
Figure 3: Refinancing costs for Oregon (left) and NY (right). Differences between neighboring counties are on the order of hundreds of dollars: enough to deter some people from refinancing.
(PCs and small workstations) lag by two orders of magnitude. Mining large databases can only be effectively done on a server machine. It also helps to have the database local to the server to avoid long communication delays.
Defaults everywhere MineSet provides defaults for most options. Users can get started by choosing an algorithm and hitting the \Go!" button. As users gain experience, they may set defaults differently, but the learning curve is less steep when reasonable defaults are automatically chosen.
What Did Not Work or is Missing
Below we detail that things that we felt did not work as well as we hoped or that are missing. Many of the missing features require significant research.
Data Transformations This is the hardest part of the KDD process: getting the data into a minable form. MineSet supports basic transformations such as adding columns, removing columns, sampling, and even transposes. It is common to see users doing dozens of transformations inside MineSet. However, in many projects, especially in early stages, there still needs to be a Perl script writer or an SQL writer. This area is not well understood and commercial Query Builders do not support needed operations.
KDD project management MineSet provides session management and allows saving and restoring sessions. However, there is no system for keeping track of the overall KDD process (Brachman & Anand 1996). Users end up with dozens of models and session files.
Meta Knowledge MineSet has no meta-knowledge database. In many cases processes could be simplified by knowing basic properties of data, goals, and tasks.
Database loads MineSet has one of the best integrations with commercial database systems with native interfaces to Oracle, Sybase, and Informix. Sadly, database systems today are optimized for OLTP (On-Line Transaction Processing) and are not geared for reading a whole database quickly. Moreover, we are able to store files in a fast binary format that is faster to load (e.g., strings are already pre-hashed). For large files, reading time may be significant and our binary files are faster by a factor of three. Note that the integration with a database is provided, but users typically dump the database to a at file and continue from there.
APIs MineSet provides the ability to launch operations in batch mode and through configuration files, but does not supply a full API (Application Programming Interfaces). This has limited the ability of system integrators to tailor MineSet to their needs.
Deployment There are two types of deployments: the model and the knowledge. The model is relatively easy to deploy in MineSet. Sharing knowledge is harder. How does one share discoveries with others in the organization? Does everyone in the organization need to install MineSet? We wrote some prototype visualizations in Java but there were too slow. We have some prototypes in VRML but they are not general and also slow. VRML is not the answer for all visualizations (Akeley 1998). This remains an open problem.
SGI MineSet runs only on Silicon Graphics hardware. This was a restriction because it limited our market (but increased hardware sales). We are now porting MineSet to other platforms, starting with Windows.
Time Series All the analytical tools in MineSet are based on the standard assumption that records are independently and identically distributed (i.i.d.). Many datasets have a time attribute that is special. Can we utilize this attribute more effectively?
Attribute types: Text, pictures MineSet recognizes the basic types: strings, integers, floats, bins, dates. Long strings are better handled with text mining algorithms. Can we mine text?
Below we describe some issues that are open and highly debatable.
Scaling out of Core Most data mining algorithms (including those in MineSet) are implemented to run in memory and degrade ungracefully if the data and data structures do not fit in memory. (A notable exception is SPRINT (Shafer, Agrawal & Mehta 1996).) With the rowth of operating systems and apps to 64-bit, we see little need for scaling out of core (but we do see the need to generate 64-bit versions of existing algorithms). Memory prices are dropping and with them systems are being built with large memories. Large commercial systems are now seen with dozens of gigabytes of memory.
Time-wise, the difference between running in core and out of core is about two orders of magnitude and getting worse (algorithms need to do multiple scans but disk read times are not improving as fast as memory sizes). Thus, if mining a few gigabytes in core takes a few hours today, mining that same size out of core may take five days. We believe users will not tolerate this.
Sampling The law of large numbers works. Sampling is a fine solution for initial data exploration. Significant patterns of interest will show up in reasonably sized samples. Only in rare cases will users wish to see patterns with miniscule support. In those cases it is more likely that users will start the mining on a small filtered sample (which may be the result of a previous drill-down operation).
No Business Case for Software Only DM Companies There is no clear business case for a software-only data mining company today. This will change once the market grows and the question is how many of the 75 data mining companies that exist today will exist in a few years.
With all the hype surrounding data mining, it is hard to make the business case for a profitable company that sells a horizontal data mining software tool such as MineSet today. The Meta Group estimated the size of the data mining market by year 2000 will be $8.4 billion (Meta Group 1997) with 8% of the market share related to macromining tools such as MineSet ($655 million). The estimates seem to be very optimistic now that two years have passed. Recent estimates for the data mining software market in 1997 are around $50 million. MineSet exists at Silicon Graphics because it leverages a lot of hardware. Other companies are doing system integration or several types of professional services/consulting.
Anytime algorithms Sometimes a model is completely wrong and users can quickly tell that. For example, a very common case is to have a perfect or near perfect predictor for the label in a classification problem. When you look for customers that churned, the root of the decision
tree may be customer number (if it is zero, they churned). When we tried to classify largeversus small sales at Silicon Graphics, we found that the root of the decision tree was amount of tax paid. Significant time was spent waiting for the complete tree to be built, only to realize that we had a near perfect predictor that we needed to remove. Anytime algorithms (Boddy & Dean 1989) show some results quickly and then improve. The idea is an extension of the "effort knob" mentioned in Thearling (1998).
Process Wizards As data mining will be available to more business users, we need to provide more help on the processes and tasks. This can be achieved by developing wizards, templates, or building custom applications where the data mining component is hidden.
Interfaces to the rest of the process Mining is only one part of the KDD process. The data mining tools need to provide tighter integration with data cleansing tools, reporting tools (e.g., OLAP tools), and post processing (e.g., campaign management). In some cases, the data mining could serve to help these other parts. Specifically, data cleansing tools could benefit from tigher integration with data mining tools.
We have discussed what works well in MineSet, what did not work as well, and some thoughts.
Howard Frank from DARPA said that predicting the future is easy; Getting it right is the hard part. We predict that in the next couple of years, many data mining companies will go bankrupt or change their business model to be more solution oriented rather than focusing on horizontal technology. Recent examples of this trend include DataMind and HyperParallel. We believe that horizontal products like MineSet can only exist in large companies where the product leverages other parts of the business (consulting, hardware, database sales).
Akeley, K. (1998), Riding the wave. http://www.sgi.com/developers/marketing/forums/akeley.html.
Boddy, M. & Dean, T. (1989), Solving time-dependent planning problems, in N. S. Sridharan, ed., `Proceedings of the Eleventh International Joint Conference on Artificial Intelligence', Vol. 2,
Morgan Kaufmann Publishers, Inc., pp. 979-984.
Brachman, R. J. & Anand, T. (1996), The process of knowledge discovery in databases, in `Advances in Knowledge Discovery and Data Mining', AAAI Press and the MIT Press, chapter 2, pp. 37-57.
Brunk, C., Kelly, J. & Kohavi, R. (1997), MineSet: an integrated system for data mining, in D. Heckerman, H. Mannila, D. Pregibon & R. Uthurusamy, eds, `Proceedings of the third international conference on Knowledge Discovery and Data Mining', AAAI Press, pp. 135-138. http://www.sgi.com/Products/software/MineSet.
Kohavi, R., Sommerfield, D. & Dougherty, J. (1997), `Data mining using MLC++ : A machine learning library in C++', International Journal on Artificial Intelligence Tools 6(4), 537-566. http://www.sgi.com/Technology/mlc.
Meta Group (1997), Data mining market trends: A multiclient study.
Shafer, J., Agrawal, R. & Mehta, M. (1996), Sprint: a scalable prallel classifier for data mining, in `Proceedings of the 22nd International Conference on Very Large Databases (VLDB)'.
Silicon Graphics (1998), MineSet User's Guide, Silicon Graphics, Inc. http://mineset.sgi.com.
Thearling, K. (1998), Some thoughts on the current state of data mining software applications. DS*, Jan 13.
Ron Kohavi is the engineering manager for MineSet, Silicon Graphics' product for data mining and visualization. He joined Silicon Graphics after getting a Ph.D. in Machine Learning from Stanford University, where he led the MLC++ project, the Machine Learning library in C++ now used in MineSet and for research at several universities. Dr. Kohavi co-edited (with Dr. Provost) the special issue of the journal Machine Learning on applications of machine learning, which appeared early 1998. He is a member of the editorial board for the Data Mining and Knowledge Discovery journal, and a member of the editorial board for the journal of Machine Learning. His interests include data mining algorithms that yield interpretable output, error estimation, feature selection, scaling algorithms to large databases, and visualization of models and data.
Beyond datamining: Influencing the business process
A case study of deploying CHAMP A Prototype for Automated Cellular Churn Prediction.
Brij Masand and D. R. Mani
40 Sylvan Road
Waltham, MA 02254
We describe our experience in deploying CHAMP (Churn Analysis, Modeling, and Prediction), an automated system for modeling cellular customer behavior on a large scale. Using historical data from GTEs data warehouse for cellular phone customers, CHAMP is capable of developing customized churn models for over one hundred GTE cellular phone markets totaling over 4.5 million customers. Every month churn factors are automatically identified for each geographic region and predictive models are updated to generate churn scores predicting who is likely to churn in the next 60 days. We describe the various design issues addressed to create such a turnkey system, including automating data access from a remote data warehouse, preprocessing, feature selection, model validation and optimization to reflect business tradeoffs. Machine learning methods such as decision trees and genetic algorithms are used for feature selection and a neural net system is used for predicting final churn scores. In addition to producing churn scores, CHAMP also produces qualitative results in the form of rules that identify population segments with high churn and comparison of market trends that are disseminated through a web based interface as well as a hardcopy newsletter.
The key factors for success turned out to be not only the technical factors such as existense of an excellent datawarehouse, automated development, updating and scoring of churn models for a large number of markets but also *adapting the solution to the business process* and *creating champions for CHAMP in the target organization to help shift marketing culture*.
Measures for Business success:
The effectiveness of the churn prediction depends not only on correctly identifying churners, but ultimately on how it influences the business processes for customer retention. The business question is not only whether we can identify/concentrate potential churners with a high degree of confidence in a small population (the relatively easy part) but *whether we can intervene effectively to reduce churn*.
One obvious measure of CHAMPs effectiveness is an increase in return on investment (ROI) which is affected by the profitability or payoff from each customer. Other measures include increasing the number of customers who choose to renew their contracts, especially those with high possibility of churn, and an overall reduction of churn rates over time. Churn scores can also be used for adjusting concessions in order to decrease costs. This same idea can be used for prioritizing customers for mail campaigns and other points of contact.
Developing a practical data mining application requires trade-offs between different requirements. This section discusses key issues for our churn prediction task and our approaches to address these issues.
Integrating data mining tools into the business environment and influencing business processes: This issue is perhaps the most relevant to the success of a data mining tool. How can the tool be integrated into the current business process such that it can be used to influence business decisions? How can the tool gain the trust and confidence of the users and decision makers? In our case we found that giving end-users a KDD interface that allowed them to modify the solutions and "own" them made a dramatic difference in their willingness to use CHAMP. Tracking and providing statistics of validating how many people with high churn scores actually churn over the next few months, helped build confidence in predictive models. Finally a printed newsletter describing key CHAMP findings created a critical mass of interest in "management".
Using third party products versus in house development: When the churn prediction task was initially put forth to the KDD group, there were not any third party products that dealt directly with predicting churn on a turnkey basis. Since specialized churn prediction methods were not available at the time and the available packages did not provide support for some processes such as preprocessing data and batch processing for applying models we chose to create an in-house prototype that uses some third party components for the underlying ML methods, but relies on in house development and expertise for other aspects of the problem.
Feature selection: The data warehouse collects over 200 fields on a monthly basis for each customer. Which fields should be used for the prediction process? Feature selection is an ongoing process. Initially we found a superset of features from the data warehouse by listing those that seemed most relevant to the task. We also inquired about which features would be most valuable to the GTEW marketing group once the models were built. We have experimented with narrowing this initial set by using a a decision tree to rank each of the features independently, followed by a genetic algorithm to group subsets of features.
Quantity of customer data: Another important issue is the quantity of data necessary for prediction. This depends on how accurate the models should be to be considered useful. Should billing information from the entire lifetime of a customer be used or are the most recent months before prediction sufficient for developing models? Does using more data increase the usefulness of the models? Since GTEW has a large customer base, we decided to use the most recent months to predict a customers churn score to reduce the learning time. We experimented with varying numbers of months and found that going two months back into customers historical information provided reasonable efficiency of the modeling process without decreasing lift significantly.
Data selection: GTEW has customers in various geographic regions and a mix of business and residential customers. Should the prediction method focus on a particular geographical region or type of customer? Should different models be learned for differing regions or types of customers? We decided to learn models for the larger GTEW markets which are separated according to their geographic region. These are the regions that GTEW is most interested in and it allows each of the regions to tailor their campaigns according to the local trends in their market.
Explanation: Marketing people want to know why people churn while KDD at most can describe characterstics of churners: How can we bridge this gap?
Data Semantics: One of the most difficult aspects of dealing with a large data warehouse is understanding the semantics of the fields. For instance, the definition of a churner might change depending on the context it will be used in. Learning and tracking the definitions of the fields in the database and understanding the context the fields may require meta-data and knowledge management solutions.
Model validation: It is not only important for the model to have high lift on historical data but it needs to generalize enough so that it can be applied in the "future". Checks on distribution of predicted churners can help flag models that are inappropriate for actual production data.
Experimental evaluation criteria: What types of evaluation criteria should be used for this problem? Although classification accuracy is commonly applied to evaluate prediction problems, in this task it is not as applicable for at least three reasons. First, there is a large disparity between the class occurrences; industry wide monthly churn rates are about 2-3%. Second, a false positive is less costly than a false negative. If a non-churner is incorrectly classified as a churner (false positive), they may receive larger concessions then otherwise, but if a churner is incorrectly classified as a non-churner (false negative), GTEW may not reach the customer before they terminate service and thus loose their revenue. Third, customer profitability is a major consideration. From a business perspective, it is more important to retain highly profitable customers.
Distributed Data Mining through a Centralized Solution
Jan Mrazek, Ph.D.
Chief Specialist, Data Mining Group
Global Information Technology
Bank of Montreal
4100 Gordon Baker Road
Toronto, Ontario, M1W 3E8, Canada
This is to present a corporate Data Mining Solution which supports the current and future large scale analytical needs of most of the Bank of Montreal lines of business.
Data Mining has quickly matured out of isolated, small scale, PC based, single algorithm techniques to a robust analytical solutions which utilize a combination of various artificial intelligence algorithms, massively parallel technology, direct both-way access to relational databases and opened systems with published Application Programming Interfaces.
In the banking industry, Data Mining techniques have been accepted by the statisticians community and utilized side by side with more traditional statistical modeling techniques.
Various groups of analysts at the Bank are employing Data Mining software ranging from proprietary solutions (Neural Networks in Stock Brokerage, Decision Trees in Credit Risk), PC based single algorithm tools (KnowledgeSeeker, 4Thought), to SAS on MVS and Intelligent Miner on SP2. Among these groups the analytical, technological and statistical skill sets and expertise vary.
Growing interest in Data Mining technology at the Bank is driven by numerous factors:
The experience with the current status quo shows that satisfying the above needs locally and on an ad hoc bases is neither practical, manageable, nor cost effective. Therefore, at the Bank we are in the process of implementing a robust Centralized Data Mining Solution (CDMS) supported by the creation of a Data Mining Center of Excellence, an institution responsible for managing the new HW/SW and its utilization by various DM groups, providing data transformation, managing DM metadata and deploying pre-canned models to light users via the Intranet.
Technically, the initial HW/SW configuration is:
HW: Silver Node (4 way SMP processors) incl.
436GB Disk Storage
2 PowerPC 604e, 332MHz, 2-Way
SW: Intelligent Miner V2
DB2 UDB EE
IM4RM (Discovery Series)
Other SW has been permanently tested for future implementation. This includes light user/project solutions (Knowledge Studio), visualization and presentation tools, Intranet access and some specific AI methods not yet covered (Genetic Algorithms, Fuzzy Logic, etc.).
Benefits of CDMS:
The CDMS provides high speed links/gateways to the major data sources deployed in the organization (Bank Information Warehouse, Customer Knowledge Data Mart, Credit Card, Risk Management, mbanx, ) totaling more than 3TB of data.
The likely power user groups are: Database Marketing, Credit Risk, Credit Card, mbanx analytics, Harris Bank, Transfer Pricing, and others. These groups will be freed from tedious data transformation and HW/SW maintenance responsibilities.
It is the nature of Data Mining projects to require initially large space to store data. Contrary to say Data Marts, this storage requirements are limited to the initial data crunching and the life of the project. Therefore most of the storage could be freed in several weeks and made available to other projects. Equipped with Massively Parallel Processing and sharing almost 0.5TB of DASD users will benefit from unprecedented processing power allowing them to run on large data sets.
Benefits in costs:
The Data Mining technology is expensive to acquire (SW & dedicated HW) and difficult to maintain (Operating System, underlying database, Data Mining SW itself, gateways, front ends, etc.). And lets emphasize the support of various high speed links and gateways to the BIW, Data Marts and other business critical sources of data.
It is hard to imagine a LOB being able to justify a purchase of top notch Data Mining Technology, a solution which could cost an initial $600,000 of investment and $300,000 a year for support and maintenance. Such a system would likely be underutilized by the LOB most of the time.
Obviously, the CDMS is highly scaleable and long term is a cost efficient solution - a win-win for all groups involved and for the organization as a whole. For its growth this joint and corporate managed Data Mining effort gives additional power in negotiations with HW/SW vendors.
Data Mining Metadata:
One important responsibility of the Data Mining Center of Excellence, which manages the CDMS, is the creation and management of Corporate Data Mining Metadata. This being a relational database outlining all data mining projects, models built, their frequency, contacts to its creators, data sources involved, variables, and a models version control. Details about particular model parameter settings, treatment of NULLs, outliers, etc. will be included. The Metadata will be published on the Intranet and accessible by all Bank employees.
This solution will help in interpretation of DM results, especially in cases where similar project objectives and data deliver discrepancies in results. It supports a unite view of patterns in data.
Data Mining and Analytics in general demand complex set of skills: business, statistics, databases, operations systems. CDMS the DM Metadata is a natural platform for sharing those skills and it helps to avoid duplications of effort.
Distributed access via Intranet:
The Data Mining Center of Excellence will deliver pre-canned predictive models for light users, i.e. departments lacking DM expertise, and will enable them to modify and run these models through Intranet and IBMs Discovery Series (IM4RM).
Dr. Jan Mrazek is the Chief Specialist at the Data Mining Group of the Bank of Montreals Operations Division (Emfisys). His responsibilities include implementation of Data Mining Technologies and supportive Architectures and large scale data mining projects. He represents the specific Data Mining needs in the design of the corporate Bank Information Warehouse. He is also project leader of the Customer Knowledge System, the largest Data Mart (1.7 TB) built at the Bank. Dr. Mrazek received his Ph.D. from the Hagen University (Germany), has held various teaching and research positions at universities in Germany and the Czech Republic and is the author of a book on Degeneracy Graphs. In the past 6 years he has established himself in the Data Warehouse/Data Mining/Database Marketing Industry and has his signature on some of the most technologically advanced projects in this area.
Automated Model Selection and Boosting Technologies
Practical Techniques for Automating Data Mining Modeling
Thinking Machines Corporation 16 New England Executive Park
Burlington, MA 01803
Developers of data mining software tools aspire to the "one button" approach to model development. Nevertheless, a survey of state-of-the-art data mining projects presents a far more complex and resource-intensive picture. Not only are trained experts critical to the formulation of business problems, preparation of data and practical deployment of data mining results, they are also critical to developing robust and accurate data mining models. Despite our industrys increasingly sophisticated tools, the development of "best fit" models remains a highly iterative process requiring the active intervention of highly skilled data mining experts.
This presentation explores two promising alternatives to human resource-intensive model iteration: Automated Model Selection and Model Boosting.
This presentation will compare and contrast these techniques from both a technology and end-user perspective. Specific topics to address include:
Not All In the Data
Nicholas J Radcliffe
16 Chester Street
Edinburgh EH3 7RA
Whichever of the various competing definitions of data mining is used, it is clear that it has something to do with finding patterns in data.
There would also probably be fairly broad agreement that desirable properties of patterns found would include that they be
Quadstone believes, and it will be argued in this paper, that it is important to distinguish between various different kinds of information that can be relevant to data mining. Perhaps the most important of these might be classed as:
1. information contained in the dataset being mined in a form that is reasonably accessible to automatic pattern detection, whether with traditional statistical methods, machine learning methods or other automated procedures (e.g. a strong, meaningful correlation between one or more independent variables that exist as fields in the data and a dependent variable being modelled that also exists as a field in the data);
2. information that is expressed in the dataset being mined but is in a form that is not readily available for exploitation by automated data mining methods (e.g. a relationship between the ratio of two obscure customer-aggregates that could be derived from a transaction stream and an outcome of interest in a customer table, such as a fraud tag);
3. patterns that exist in the dataset being mined in a form that is accessible to automated mining, but are either incorrect, open to misinterpretation or in some other way misleading (e.g. a strong but spurious correlation between a variable that is thought to be independent of an outcome of interest, but which is in fact causally depdendent on the outcome);
4. information that is not expressed in the dataset being mined at all, but is either essential or highly relevant to producing and understanding meaningful patterns in the data (e.g. information about competitor activity, which is hard to capture in an analysis dataset, or knowledge that the basis of aggregation for a particular quantity in the dataset changed on some date during an observation window, giving rise to apparently changed behaviour, when in fact no real behavioural change occurred).
This paper will argue that the consequences of the existence of these four kinds of information have direct and concrete implications for the construction of useful data mining tools, and that while most current tools concentrate strongly on information of the type 1 listed above, most of the business value is likely to be lost if types 2-4 cannot be handled.
The paper will argue that
The above suggests that a useful data mining tool should aim, first and foremost, to empower the business analyst to explore and understand the dataset in relation to his/her own knowledge, rather than aiming to replace the analyst with some automated data-discovery alorithm. Such a tool would necessarily provide an integrated set of functionality to facilitate the full range of activities necessary in a meaningful analysis, with an emphasis on interactivity, visualisation and flexibility. These characteristics are at least as important in determining the utility and power of a data mining package as the selection of a suitable set of methods from the ever-growing plethora of automated classification, discrimination and clustering algorithms.
Nicholas J Radcliffe is the Chief Technical Officer and one of the four founders of Quadstone Limited, producers of the Decisionhouse scalable customer behaviour modelling suite. In this capacity, he is responsible for all software development and analysis work carried out by the company. Quadstone has been trading since March 1995 and was ranked Europes leading and the worlds fifth largest supplier of data mining software in the recent Gartner Group/DataQuest report. Quadstone employs 70 staff, and has offices in Edinburgh, London and Boston. Radcliffe also holds a part time position as Visiting Professor in the Department of Mathematics and Statistics at the University of Edinburgh, Scotland, where his research focuses on optimisation problems with a particular emphasis on genetic algorithms and related stochastic search methods. Prior to these appontments, Radcliffe worked at the Edinburgh Parallel Computing Centre, Europes leading centre for High Performance Computing. He holds a PhD in Theoretical Physics from University of Edinburgh and a BSc in Mathematics Physics from the University of Sussex.
The Commercial Success of Data Mining
Tuition House, St George's Road
Wimbledon London SW19 4EU
Knowledge Discovery (KD) stands at the edge of a chasm. It is clear that KD cannot stand still, to succeed it must move forward. Behind it are two decades of research into machine learning and statistical analysis, making KD pre-dominantly the tool of specialist users. Ahead lie untold riches promised by the wide spread adoption of KD in the mainstream business world. To stand still, KD will be the next casualty in a line of promising technology casualties that have included Natural Language and Computer Aided Software Engineering (CASE) amongst others.
There exists a clear need for smarter analysis in business today. This is undisputed. Look at the way On-line Analytical Processing (OLAP) has leapt forward in the last three years, the most recent players being Oracle and Microsoft. This has been driven from the need for users to easily understand the information locked in their operational and warehouse systems.
KD promises much more, and fundamentally shifts the analysis from a control perspective to one of exploiting opportunity.
The question for those of us in the industry, is not whether there exists an opportunity to exploit, but one of how to do this. Gentia Software advocates that there are two paths to delivering real benefit to the business with KD.
These goals are hardly new, but how do we achieve them. What we do not need are new algorithms. Those that exist must be made accessible and delivered in a robust fashion to whoever will benefit from KD.
Not the subject of this discussion, but a clearly defined process/methodology must be accepted within the industry. This will provide a common vocabulary and demystify what appears to new KD adopters as yet another technology that will be difficult to control.
Gentia Software believes that KD software has to incorporate the following three tenants for KD to achieve its full potential.
The systems must be capable of being deployed to masses of users.
The systems must be scaleable to handle vast data volumes.
The systems must be automated to maximise developer or end user productivity.
Meaningful in them-selves, each of these three tenants requires multiple advances in technology from where we stand today.
To deploy a KD system to a mass of users implies that the system should be capable of addressing a wide range of user abilities and that it runs on mainstream client/server systems, and that Web use is an absolute necessity.
Ease of use is not achieved simply through well designed human interfaces, but should incorporate AI based help systems that assist the new user through the system offering guidance and learning from user selections. Ease of use also demands performance, witness the popularity of OLAP systems. KD systems will co-exist with departmental and divisional systems, running on existing platforms. This dictates that the KD system must run efficiently on Windows NT and low specification UNIX servers. The nature of the server or the client (fat or thin) must be transparent to the user, and interchangeable.
Much has been written about the merits of sampling or directly accessing the entire corporate data store. This is a not the issue. For KD to succeed the users demands will drive the appropriate technology, and there will be applications that demand both. The industry cannot afford to debate this any longer, as software developers we must deliver systems capable of sampling and handling massive data stores on an iterative basis.
The iterative nature of the KD process demands a new solution. This must be capable of handling vast volumes very quickly (100X RDBMS), not once, but repeatedly as the problem dictates, and must also be open to other tools in the developers and end users tool kit.
If we accept that KD is a well-defined process, then it must be possible to automate that process. This allows business templates to be built and then tailored on demand. The automation of a process, by a visual metaphor, and its storage we could refer to as a KD/plan. This plan with the appropriate APIs can be embedded in business applications, or recalled from a library for reuse by the author or authorised colleagues.
A framework of object services that may be combined into a KD/plan can deliver the above three tenants. Such a framework must be complete, offering security, dictionary, distribution, and automation facilities, etc. Within the framework, components provide the functionality to access, transform, analyse and visualise the results. Fundamental to the success of this architecture is that it can be continually extended to meet new requirements by the addition of external components, allowing continual evolution with minimal disruption to the users.
Knowledge Discovery is developing fast. A move to a common methodology is required, and work on this is underway, CRISP-DM for example. Alongside this, software must evolve to meet the demands of the user, whether they are End users performing a discovery process or professional developers embedding a KD process into business applications. Key to the success of both should be the ability for the KD software to be deployed to large user communities, scalable to handle the entire range of business problems, and highly productive ensuring a maximum return on investment from extensive automation capabilities.
Keys to the Commercial Success of Data Mining
Senior Analyst, Data Mining Group
pp34, Room 161, B81
Ipswich IP5 3RE
My group consists of 12 experienced data miners, with expertise in the areas of machine learning, statistics, data visualisation and databases: they also have a good knowledge of the telecommunications business, systems and processes. Set up at BT labs in 1994, the Data Mining Group helps BT exploit the benefits of data mining by:
The group works in many areas of commercial importance to BT, the main ones being:
Using new and existing data mining techniques to extract valuable marketing intelligence from large corporate databases.
Applying data mining technologies to detect, stop, and deter fraudulent activity on BT's telephony network.
Understanding how people navigate the web for better site design and more targeted marketing.
Investigating future data mining technologies and applications for a 1-5 years time horizon.
For our analysis work, we have used the following tools, amongst others: SAS, Splus, c4.5, Knowledge Seeker, NetMap, AVS, various other decision tree, clustering and neural net algorithms, Oracle/SQL, perl, and Excel.
For more information about us, see our website at: http://www.labs.bt.com/projects/mining/
Some points I would like to make based on our experience:
Commercial success criteria
Success criteria are much more commercially oriented than they were in the early days. It is no longer enough to produce a convincing demonstration system or model. It has to be integrated into existing systems, applied on a day-to-day basis, and be shown to meet a target in cost savings. Defining and estimating the costs and benefits of a proposed project can be difficult, but we have found that the more effort that is spent on this early on, the better.
Reasons for failure
We have carried out many data mining projects over the years. In our experience, the key to a successful data mining project is not obtaining some data and finding a useful pattern in it. Only a small number of our projects have been unsuccessful in the sense that we could not find anything useful in the data. More usual reasons for failure are:
To avoid these problems, we spend a lot of effort in the definition and planning phase of a data mining project, and have produced our own data mining project guidelines, in conjunction with Syntegra, BTs systems integration arm, for the company to follow.
A commercial data mining application takes more skills and people than just the data miners; the successful projects are those in which data mining is just seen as a component of an overall project or system. This tends to focus the objectives and deliverables of the data mining aspect, and avoids it being over-hyped, if that is possible!
Data extraction and pre-processing
We still experience delays and difficulties in obtaining the necessary data for projects, particularly if the data is not in a data warehouse. And a lot of our effort is spent pre-processing the data prior to analysis. This is not, however, a severe problem if planned for.
We have found that the role of the group has changed in recent years. Up until about a year ago, most of our non-research work was data analysis, that is, doing data mining for another part of the company. We still do a lot of this, but more and more we find ourselves acting as consultants, advising other parts of the company about data mining. For example:
This changing role is our response to the increase in demand for data mining, and also the increasing choice of suppliers and tools.
Areas for improvements in DM software
My suggestions for improvements to existing DM software concur with those often cited in the literature, for example: integration with relational databases, scalability, etc. Obviously, the onus is not just on the data mining tool developers to achieve integration, but also major database vendors. Databases should support operations often used in data mining efficiently, for example: random sampling.
Several tools suppliers have developed data mining tools that, it is claimed, can be used by non-specialists, often via an easy-to-use interface or data visualisation. While this is certainly an improvement, I believe such tools will not find a large market of non-specialists. This is because non-specialists do not want to have data mining made easy for them - they really do not want to do data mining at all in a general context - what they want is to solve their particular problem, be it targeting their marketing, highlighting potential fraudsters, etc. The likely outcome will be the development of vertical applications for common business problems in particular industry sectors, in which the data mining element is largely hidden from the user. Examples include systems designed to highlight customers likely to leave a mobile telecoms company for the competition (churners), telephony fraud detection, and targeting personalised adverts at Internet users.
Problem formulation, along with risk assessment and project planning, are very important. We spend a lot of time upfront with the customers of our work working out as well as we can exactly what the business benefit is hoped to be, and how it will be realised (see below).
We have produced our own guidelines for managing data mining projects, called M3, for use by BT. It uses a DM lifecycle similar to those in the literature (for example, Fayyad et al. 1996) but places emphasis on the early stages of a lifecycle: particularly problem formation, data investigation, risk assessment, feasibility check, cost-benefit analysis, project planning, role identification, and defining a clear exploitation route. This can act as a checklist for the analyst that all the relevant aspects have been covered, and provides some tips on pitfalls to watch out for. For example, the role identification part of M3 defines several roles that may have to be involved in, and kept committed to, a successful data mining project: the analyst him/herself, the domain expert, the database designer/administrator, the customer, the end -user, the legal expert, the system developer, and the data subjects themselves.
Since early 1994, Huw has been a senior analyst in BT's Data Mining Group, based at their UK laboratories. He has published several papers in the area, and has acted as a consultant, analyst, and team leader on many data mining projects, both within BT and for external customers. His previous experience includes software engineering and artificial intelligence systems. His main interests are:
For more information, see Huw's homepage.
Bringing Data Mining to the Forefront of Business Intelligence in Wholesale Banking
Ashok N. Srivastava, Ph.D.
Global Business Intelligence Solutions
IBM Almaden Research Center
650 Harry Road San Jose, CA 95120
Although the basic algorithms of data mining technology have been available for many years, data mining has not yet realized its full potential as an integral part of some business intelligence solutions. This article discusses the application of data mining in the wholesale banking industry, illustrates some of the associated challenges, and recommends the development of a domain-specific knowledge-encoding tool. Although we focus on wholesale banking, these observations have direct parallels elsewhere, including retail banking, pharmaceutical, and manufacturing industries. As in most industries, the success of a data mining implementation as part of a viable business intelligence solution depends primarily on the accessibility of the data, the level of integration of the data mining software to existing data bases, the ease of data manipulation, and the degree of "built-in" domain knowledge. Each of these issues merits careful thought and analysis, although our focus here is on the last issue.The wholesale side of a commercial bank has a significant need to invest in business intelligence and data mining technology because wholesale banking:Provides a large percentage of a commercial bank's revenue, thus making the potential return on investment attractive; Needs to manage client relationships across a broad spectrum: from a relatively small corporate entity, to a large multinational corporation with potentially dozens of subsidiaries and closely aligned business partners; andIs affected by a large number of factors, including macro-economic factors such as international economic forces, industry-specific trends, the real estate market, and micro-economic factors such as a particular client's economic health and leadership.
A business intelligence solution which summarizes this information in the form of query-based reports, augmented by the predictive power of data mining technology can greatly enhance the corporate decision making process.
In a typical data-mining project, these data are brought together at an appropriate level of generality that describes product usage as well as client information. Although obtaining data from internal sources and external vendors is not difficult, creating an appropriate data set for mining is challenging for many reasons, a few of which are given here:
Given that this list is not nearly exhaustive, simply the creation of an appropriate mining database is time consuming and challenging. Furthermore, consideration of these factors in the data mining analysis is crucial for data mining to gain acceptance within wholesale banking marketing channels. Lines of corporate influence, product cohorts, and data quality issues are often known by domain experts but are not directly reflected in the data.
To illustrate, consider a product demand forecasting analysis, which is a typical application of data mining in this industry, where the task is to predict a corporation's demand for a banking product given other product usage information, relevant economic ratios, and other macroeconomic variables. Building a good predictor here requires a clear understanding of the relationships between product groups. A forecasting analysis might indicate that "usage of cash management products imply usage of investment banking products," but this may be information that is implied by bank policies, or is necessary because of other economic considerations. A tool that reflects these internal relationships when delivering results would greatly improve the feasibility of data mining in this industry.
The details of such a tool falls outside of the scope of this article, but a few points can be made regarding it. A simple tool may include the following characteristics:
The addition of a tool similar to the one outlined here would significantly enhance the effectiveness of data mining in this and other industries by delivering the analyses directly to decision makers. In many business settings the comprehensibility of the analyses as well as the degree of validation against known relationships help enhance the perception of the data mining activity, thus creating an environment where this new technology can take root. The success of data mining does not solely depend on the quality of new algorithms, but also on the usability, comprehensibility, and degree of domain-specific knowledge integrated in the tool.
Data Mining In Commercial Applications
Sunny Tara, Systems Technology, AutoZone Inc., firstname.lastname@example.org
Gautam Das, University of Memphis, email@example.com
King-Ip Lin, University of Memphis, firstname.lastname@example.org
There is a tremendous motivation for Data Mining applications in the commercial world. Many companies, after their initial success with their data mining research and pilot projects, are starting to move these projects into real world business applications. Today most of the data mining development has come from the research (academic) world. As a result, a lot of business issues have not been addressed and so far they have only received cursory attention. The purpose of this paper is to highlight some of these issues to the KDD research community so that we could bridge the gap between the research (academic) world and the business world.
Data Cleaning and Data Preparation
An important component of the KDD process is data cleaning and data preparation, which has to be done before the actual data mining can take place. The current state of art of data cleaning is far from being as sophisticated as, say data mining algorithms. Most of the data cleaning is done laboriously by humans. There are few general and automated tools that assist in this process.
Our idea is to use some of the data mining techniques in data cleaning. We envision data cleaning to be performed in several stages, ranging from application of simple techniques to increasingly more sophisticated methods. For example, one can use visualization techniques to
look of missing data, imperfect data, etc. Or, one can use more sophisticated clustering tools to look for outliers which may indicate potentially erroneous data. Statistical methods that look for missing data need to be studied in this context. Of course, the entire process will be an iterative one, with periodic evaluation by humans and the reapplication of appropriate cleaning tools.
Let us take an example. Consider a database maintained by a county office. This database contains information on every piece of real estate in the county. Such information might include the address, price, owner, size, etc. Such databases may contain lots of errors, missing data, etc.
Initially, we could use visualization techniques to locate each house on a map of the county. If a high priced house appears in a low-income neighborhood, then that is possibly an error.
Secondly, we could use multidimensional clustering techniques to look for "outliers" in clusters, which might point to erroneous data. If we have much prior knowledge about the data, we can assume a prior distribution that models the database, and use Bayesian methods to compute posterior distributions. Then, we could selectively omit certain data items and re-compute posterior distributions again, thereby allowing us to detect outliers.
Involving Business Users In The KDD Process
Successful data mining applications in business world would require constant interaction and feedback between different users and the data mining process.
User interaction, in different aspects of the KDD process:
Previous knowledge: - The business user may have some underlying (but incomplete) knowledge about the data to be mined. Incorporating this knowledge into the data mining process will allow the data mining process to be more efficient.
On the other hand, the data mining process can also be used as a check against the knowledge that was supplied. For instance, mining results may contradict some conventional wisdom. The system should allow for that. Thus we should provide methods to allow data mining algorithms to provide reasonably good results if there is not enough resources for a full-scale job.
Resource constraints: - The need to be able to mine information from a huge resource in real-time (or near real-time) is increasing. The demand of the real world does not allow the luxury of time.
Intermediate feedback: - Each step of the KDD process should provide some intermediate results. For example, results of the data cleaning process, decision of which algorithm to use, etc. The system should be able to provide appropriate feedback to appropriate personnel (users) so that they can make decisions to guide the data mining process.
Business Challenges For The KDD Process
The business issues are many, such as Scalability, Integration of Current Systems, Data Visualization, need for Database System Support, Incremental Processing, Flexibility and so on For the purpose of this workshop we would like to concentrate on the following
Data Cleaning & Data Preparation
The research issues are many. How does one design algorithms that can work for heterogeneous datasets? Do the algorithms scale well? Can we perform automatic error 'correction in addition to just error detection?
Knowledge representation and incorporation:
How to represent the underlying knowledge that is known? How to incorporate this knowledge into the appropriate KDD algorithm? How to discern whether the underlying knowledge is useful or counter-productive?
To devise algorithms that would take resource constraints into account; to provide algorithms that can provide good intermediate results even if severe resource constraints is imposed. There has been some work on constraint algorithms and incremental algorithms. We need to devise method to incorporate them into the KDD process.
Data visualization and feedback:
To provide effective methods for the data mining process to provide feedback understandable by the user. Different kind of users may require different kind of visualization techniques. Also, effective means of allowing user feedback to the data mining system need to be devised, ideally coupling these to the data visualization process.
Keys to the Commercial Success of Data Mining
222 Third Street Suite 3122
Cambridge, MA 02142
Decision support is like a symphony of data. In the same way you need a variety of musical instruments functioning together for the purpose of playing a symphony, you need a variety of software tools working together for the purpose of doing decision support.
Its best to look at decision support from a functional (rather than a tool-based) perspective and to look at available technologies in terms of the functions they provide. In other words, tool features get mapped to DSS functions.
A functional perspective
Basic DSS functions
DSS is about synthesizing useful knowledge from large data sets. Its about integration, summarization and abstraction as well as ratios, trends and allocations. Its about comparing data-based generalizations with model-based assumptions and reconciling them when theyre different. Its about good, data-facilitated creative thinking and the monitoring of those creative ideas that were implemented. Its about using all types of data wisely and understanding how derived data was calculated. Its about continuously learning, and modifying goals and working assumptions based on data-driven models and experience. In short, decision support should function like a virtuous cycle of decision making improvement.
Lets laundry-list these concepts to identify the minimum set of basic functions that comprise any DSS framework:
A cognitive metaphor for DSS: Beyond closed loop systems
The best metaphor that I can think of for understanding how all these decision support functions fit together is a cognitive one. In contrast, earlier metaphors focused on the uni-directional flow of information from raw data to synthesized knowledge. Second generation metaphors, currently in vogue, focus on bi-directional, closed loop systems wherein the results of DSS analysis are fed back into production systems. The hallmark of a third generation cognitive metaphor is the interplay of two separate information loops. The first is akin to the closed loop system and I would characterize it as a data-driven loop. But in addition to that loop there exists an inner loop where data driven information meets model-driven goals and beliefs at the moment of decision. Although that inner loop is frequently provided by a living, breathing person, it is a function that needs to take place and in automated systems needs to take place in the form of software within the overall decision support system. AI workers have known for a long time that it takes a combination of data-driven and model-driven information to produce high quality decisions.
Using a cognitive metaphor, the universe of DSS functions is composed of five distinct functional layers within which the two above mentioned information loops interact: a sensory/motor layer, a primary memory layer, a data-based interpretive and understanding layer, a decision layer and a model-driven layer of goals and beliefs
From 20,000 feet, data driven understanding is the process of synthesizing knowledge from large disparate data sets. Understanding is a loaded term, so let me break it down into smaller chunks. The key components of understanding are describing, explaining and predicting. The main obstacles to understanding are lack of tool integration, missing, meaningless and uncertain data and lack of verification capabilities.
Descriptions form the basis. Examples include The Cambridge store sold 500 pairs of shoes last week, Our corporation did 35 million last year, The Boston stores paid an average of 36 dollars per foot in 1997 and Boston rent is twice as expensive as the rent in Portland, Maine. Descriptions are more than just measurements. Descriptive processes take whatever raw measurements there are, and through aggregations, ratios and other certainty-preserving operations, creating a fleshed out multilevel multidimensional description.
In practice, there may be a variety of inferential techniques employed to arrive at a descriptive model of a business or organization. For example, inferential techniques may be used to guess what values may apply to what are otherwise missing cells. When there are a lot of blanks that need to be filled in for a descriptive model the process is akin to data archeology. Good data archeology requires the close integration of OLAP and data mining or statistics tools.
Explanatory modeling starts where descriptive modeling left off. Explanatory models are representations of relationships between descriptions. Such statements as For every increase of 1% in the prime rate, housing sales decrease by 2% represent explanations or relationships inferred from descriptions of both housing sales and interest rates. The functionality provided by statistics and data mining tools of all varieties belong in this arena. Regressions (the mother of all analyses), decision trees, neural nets, association rules and clustering algorithms are examples of explanatory modeling.
Predictive modeling is just an extension of explanatory modeling. You cant make a prediction without having at least one relationship that youre banking on. And while most all mining activities aim at building predictive models, the key algorithms are in the discovery of the patterns. Predictions are just the extension of some pattern already discovered. Thats why all the data mining algorithm buzzwords you hear are about pattern discovery techniques not pattern extension.
OLAP tools do not provide for explanatory or predictive modeling. Data mining doesnt provide for dimensional structuring. Yet, it is best to perform data mining within an OLAP (multi-level, multidimensional) environment. For example, to design a new promotional campaign based on point-of-sale POS and demographic data you might
all in order to support the data-based brainstorming for a new promotion campaign. In short, you needed to use a combination of OLAP, data mining (and visualization to be described below) to accomplish a single BI task: promotion development. I call this kind of integration DSS fusion.
Happily, the market as a whole is beginning to move in this direction. A number of OLAP companies are adding or claiming to add data mining capabilities although not all of them are fully integrated with their OLAP products. For example, Holos is adding mining capabilities; Cognos has a simple mining application; MIS AG has a mining application as does Pilot Software. I believe it will be easier for OLAP companies to add data mining capabilities than it will be for data mining companies to add OLAP capabilities.
Although it is good to see so many OLAP vendors offering mining capabilities, these capabilities still need to be better integrated. Mining functions should be as simple to invoke as ratios. It should be possible to perform data transformations from within an OLAP/mining environment. And mining should be fully integrated within the dimensional structure meaning that operations like drill down from an interface to the results of an association rule algorithm should work, and depending on how things were defined either return a set of associations already calculated for lower down in perhaps a product hierarchy, or trigger the calculation of such associations. Thus the same thinking that goes into an OLAP design of what should be pre-calculated and what can be calculated on demand can apply to data mining as well.
Missing, meaningless and uncertain data are frequently present in data sets and pose a significant hurdle to understanding. Missing and meaningless data are logically distinct, they both need to be distinguished from the value zero (frequently the default value assigned an empty cell), and they need to be differentially processed. For example, in an averaging function, where empty cells denoted missing, you would need to assign some kind of default value to the empty cells for the purposes of aggregation. In contrast, if the empty cells denoted meaningless, you could not assign a default value. Most OLAP and data mining tools lack good empty cell handling techniques.
Unlike missing and meaningless data, uncertain data is present as a data value in a data set. For example, a statement like "We predict that our new brand of ice cream will capture 5% of the market for ice cream products.", needs to qualify the uncertainty associated with the estimate of 5%. Typical statistical measures of uncertainty include the variance and bias associated with an estimate. The overall picture of uncertainty can get a little more complicated as derived measures follow from business rules which have their own sources of uncertainty. For example high level sales forecasts based on aggregating lower level predicted sales data need to carry forward the uncertainties derived from the predictive models through the aggregation process. Whats more, the predictive models themselves may rely on certain rules of thumb for their forecasting logic. As more assumptions become embedded in business data, OLAP tools especially will need to provide ways to process uncertainty.
Finally, in the same way that an astronauts working environment is composed of fabricated living elements (temperature, pressure, oxygen..), a DSS enduser/analysts working environment is composed of fabricated data elements (daily summaries, weekly aggregates, brand reports, and so on). Given the degree to which endusers are dependent on derived data as their own inputs, it is crucial that DSS vendors provide better verification capabilities.
Unless youve hitchhiked across the galaxy, understanding is more than just a number. Theres no point in doing all this cool synthesis if you cant make sense of what youre doing. Seeing numbers as numbers is O.K. up to a point. But much of what you need to understand for decision making are relationships. Its not just that you sold 553 pairs of mens shoes last month, its that you sold 15% more pairs of mens shoes last month than you did for the same month in the prior year, or that you sold 25% more mens shoes per employee in the eastern region last month than in the western region, or that you sold 20% more mens shoes then womens shoes.
Comparative analysis is a hallmark of decision support views and graphical methods excel at portraying this kind of relationship information. Points and lines, (of which maps are a special case), derive their meaning from their location on the screen. Consider the graph shown in figure 3. Without any analysis, you can easily see that the green sales line is rising at an accelerating rate while the red cost line, though still rising is doing so at a de-accelerating rate. Acceleration is a second order derivative. You would never be able to see that in a large table of numbers without doing some extensive calculations.
Often, the relationships we see in the world are not crisp like in the text books. This is another reason why it is so useful for good graphics. They allow the human eye which is still, by far, the best pattern matcher around (especially in noisy situations), to look for ill-and raggedly defined patterns that would have been extremely difficult to quantitatively specify in advance.
Ive been expecting visualization to pickup marketing steam any day now for a while. It gets some press, but nowhere near the amount that OLAP and data mining get. I think one of the reasons is that there are very few, if any general purpose dimensional visualization tools on the market. Typical business analysts of dimensional data need a visual tool/interface that can keep up with the dimensional changes inherent in a typical query session. In other words, every time you drill down on a dimension from the root or switch dimensions between row/column layout and pages, you are changing the dimensionality, not just the data, of the result set.
What I do see are a lot of components that provide basic graphics as well as some very sophisticated tools/environments that offer very highly customized visual solutions to particular data sets. (Products from companies like Visual Insights, AVS, Visible Decisions, and SGI come to mind here.) A possible exception to this is Crossgraphs from Belmont Research which seems to offer a lot of the mid range general purpose graphics such as tiling that are very useful for general business analysts.
Selfishly speaking, one really important relationship that I always like to understand is, "Where am I?". Working in a DOS mode ten years ago we tried to display summary graphic information to situate the user in a multidimensional database and identify areas of interest towards which the user might want to navigate. These days, advances in visual representation technology coming from such places as Xerox parc, are also having an impact on how we navigate through data sets. Examples include such techniques as hyperbolic trees and information walls that leverage the human eyes ability to take in lots of detail in the center of vision and smaller amounts of detail from the periphery.
Non numeric data-driven understanding
Most of what I see called DSS or BI fits into the numbers component of data-driven understanding. M/OLAP, R/OLAP, data mining and statistics, visualization and query/reporting tools are the mainstay of the numeric data-driven understanding component of business intelligence.
The biggest obstacle to the adoption of non-numeric data understanding tools is our ability to interpret non-numeric data. For all the effort that needs to go into numeric analysis at least we know how to recognize a number. But when it comes to most non-numeric data such as images, sounds, and text, its precisely the question of recognition that becomes problematical. How do you recognize a happy face in a database of facial images? How do you recognize all reports that seemed pessimistic? How do you recognize stress in a telemarketers voice? Document synthesizers are coming on line (see for example, Northern Light, InXight, Autonomy, IBM), but the technology is still very new. Likewise there are starting to appear image-based pattern technologies that will make it possible to query for images like x where x is a sample image (see for example, Chromagraphics). And I know of at least one company, Sonic Systems, whose software can be trained to identify particular sounds in noisy environments.
Deciding is the process or function of combining goals and predictive models. To decide that prices need to be lowered on certain products is the result of a goal to maximize sales and a predictive model that relates sales to product price. To decide that a certain loan applicant should be denied credit is the result of a goal to minimize loan write-offs and a predictive model that relates certain loan applicant attributes with the likelihood of loan default.
If there were no goals, it would be impossible to decide what course of action to take as any action would be as acceptable as any other. Without the goal of maximizing sales, for example, there is no right decision concerning product pricing. And without a predictive model equating product prices with product sales there is no way to know which decision will be most likely to maximize sales.
Decision making challenges may arise from
Business rule automation tools focus on the first two challenges. Decision analysis tools focus on challenges 3 6. Group decision support tools focus on challenge 7. Challenges 8 and 9 lie a little further down the road.
Looking ahead, I think we will start to see self-modifying rule systems that continuously monitor the world to see if it behaved like predicted, and when it doesnt, then changing the predictive models it used to make rules. In the process, systems could try out different scenarios or predictive models and analyze how well the system would have fared under each scenario.
I would also like to see rules bases connect to OLAP tools wherein the rules base was the source of cost allocation rules used in the OLAP system. Although OLAP tools provide a sophisticated calculation environment they would benefit from an organized method of defining and managing rules.
You should also be able to deduce rules given goals and predictive models which brings me to the next major category of decision making software, decision analysis software. The need for decision analysis software kicks in where decisions are based on multiple predictive models with complex measures of uncertainty and where the goals themselves are variable. Typically this appeals to higher up in an organization. Decision analysis is closely related to operations research where there are several mutually exclusive goals and shared scarce resources and the trick is to maximize a global property like profit, stability or happiness.
There are quite a few companies in this area including Lumina Decision Systems, Logical Decisions, Strategic Decisions Group and Decisioneering, as well as societies such as INFORMS Decision Analysis Society. (Links to these and others can be found on our web site www.dimsys.com).
Finally, there are challenges that are more inter personal or political. For example, a bunch of managers may be sitting in a room trying to arrive at a common decision and needing to vote on whether to fire 300 people or increase sales enough to justify them and every manager has his or her own agenda so it can be tough to know what people really think during open brainstorming, discussions and voting.
One way to overcome these challenges is to provide an anonymous electronic meeting environment where ideas can be presented, discussed and voted on based on their merit rather than on the identities (and associated interpersonal dynamics) of the persons involved. Companies like Boeing, that have tested these sorts of environments, found them to provide a tremendous savings in time and energy as well as a significant improvement in decision quality. Terms normally associated with software tools that provide this sort of functionality include, Groupware, Group Decision Support Systems and Corporate Memories. Some companies in this area include Ventana and Corporate Memories.
The Vision From a Consultant
Thierry Van de Merckt
Computer Sciences Corporation
Benelux Division, Hippokrateslaan 14, B-1932 Sint-Stevens-Woluwe
In this short contribution I would like to raise three comments that seemed critical to me in using data mining in commercial organisations.
1. Data Warehouse labelled "data mining ready to use"
In an engagement in a Telco, I have learnt the following: Its a mistake to think that even if the client has a data warehouse, its ready to be used for data mining.
In Europe, data warehouses are just starting to build. Hence, most data mining projects start by collecting and cleaning data from operational systems or so-called "data warehouses" which are just some kind of time-stamped copies of operational systems.
Our client was the largest telco in Belgium. We (CSC) did a "real" data warehouse for them, completely focussed on Customer Information, with an OLAP tool above it. Hence, we naively believed that starting our data mining engagement would save the hard work usually done on building appropriate data structure for optimal analysis (aggregating at the data at the right level of detail and correcting of errors). A data warehouse of this size (1.5 Tera) cannot be accessed easily. Lots of aggregates and summaries are built above the raw data files. It appeared quickly that the aggregates were useless for us because the level of detail was too high. We had to build churn models and we did not have access to the level of the client himself. Aggregates are build according to the requirements of OLAP tools. And the dimension client was always put at the segment level as the highest detail. We had a hard time to access the huge raw data files to build or own "data marts" and found lots of errors at that level of detail.
My position is this: data warehouse built with OLAP in mind does not help for data mining at the level of the aggregates.
This raises the question of addressing raw data directly from the data mining tools. I do not think that this is a good solution because aggregates are always required: whats the point of accessing the full details of the CDR (Call Description Record) for building churn models? Aggregated revenues on the services used by the clients for one month is much more useful than 200,000 records that contribute to that value. In addition, building aggregates helps understanding the data, and gaining tremendously useful domain knowledge.
Aggregates are necessary. However, the level of details in the dimensions of the star schema should be adapted to data mining requirements. Building a datawarehouse with data mining in mind will radically change the Data Base structure and will enable data mining, OLAP and reporting. The inverse is not true.
2. Data Warehouse Cleansing
I have done just two full size projects in data mining: the first on in a Telco mentioned above, and the second one for a large Health Care organisation. In both cases, we had a data warehouse available. Both were huge: 1,5 Tera and 1 Tera respectively. In both cases, we found an tremendous number of errors in the warehouse when we started to dig into the data with data mining tools.
Data warehouses are supposed to be cleaned. Validation tests are done when data is entered. However, if errors are still there, nobody will see it, except by chance.
Once again, this is because OLAP and reporting make aggregates at a high level. Hence, positive and negative errors might compensate each others. Also, errors are not the majority and in many cases represented just 2 to 7% of the result of a query. Hence, these records are discarded and the effect is not perceived by the users.
In the healthcare organisation, we were doing fraud detection. Hence, we were looking for the 0.5% outliers Seven percent of errors can destroy the whole search.
It appears to me that many errors commonly found in "clean" data warehouses are of two kinds:
Hence, my position here is the following: do not trust a data warehouse when doing data mining. The level of detail we are looking at will raise many data cleansing issues.
The second position is the following: there is a tremendous potential market for data mining in data warehouse cleansing and for semi-automatic correction systems.
Many of my clients ask for using data mining for correcting their DWH.
3. The sampling issue
A common problem found in the application of our algorithms on living data is the sampling bias. I had to do a proof-of-concept for building credit scoring for a bank. The bank had a scoring mechanism already in place. Hence the available customer data base on the banks client was already biased by this system. How can I cope with this bias in my own scoring model?
If I build a scoring without taking this problem into account and that I just replace their old scoring with my brilliant new one, I clearly introduce a bias between the learning population and the one that will be effectively scored. However, I cannot keep their scoring as an entry to my system because I would discard the possibility to recover false positives (false bad payers).
The solution is probably to build a scoring on the biased population, use the scoring "as it is" and to accumulate fresh information on its performance. However, this transition phase can take some time and may have undesirable effects that will fool the performance monitoring of the new system. This problem should be clearly explained to the client before putting any system in production.
However, should you have any good suggestion on how to efficiently cope with this problem, I would be happy to hear it.
Keys to the Commercial Success of Data Mining:
A Service Provider's View
Arthur Andersen, Binzmühlestrasse 14, CH-8050 Zürich, Switzerland
Arthur Andersen is a global multi-disciplinary professional services firm that helps its clients improve their business performance through assurance and business advisory services, business consulting, economic and financial consulting and tax, legal and business advisory services. Arthur Andersen Business Consulting in Switzerland offers services and provides implementations in the areas among others of Cost Management, Revenue Enhancement, Activity-Based Management, Performance Management, Knowledge Management, Transaction Systems, Collaborative Systems, Data Warehousing, and OLAP and Data Mining technologies.
Since we are both a user of data mining software and provider of data mining solutions, my view will include both aspects: what are benefits and application areas of data mining technology as well as directions for improvement for data mining software.
We used data mining so far in the area of Customer Relationship Management for customer segmentation and customer retention analysis as well as for credit scoring. Industries in which this technology is traditionally applied are those who collect customer data for the purpose of billing. They have always been required to store this data and, thus, can make use of it without introducing new operational systems beforehand. We have helped banks, insurances, and internet service providers. More and more industries are obtaining customer data while implementing information systems for marketing, sales, and customer service.
Besides that, every company could benefit from data mining technology applied to the analysis of costs and revenues, activities and processes, and its performance in general.
Currently, I'm a member of the Data Warehouse, Data Mining, and Performance Management team of Arthur Andersen Business Consulting in Switzerland. Our work includes any part of the whole process of going from a strategy to measurable performance indicators on one side and from the collection to the analysis of data on the other. We help our clients with the selection of the technology and the implementation. In other areas, we collaborate with the Cost Management and Revenue Enhancement teams.
Prior to that I did consulting in object-oriented methodologies for software development. Our group taught courses on the concepts and programming languages of software development as well as how to analyze and design software. We developed prototypes and feasibility studies and coached clients in the analysis, design, and implementation phases of a software project.
For two and a half years, I was research assistant at the Swiss Federal Institute of Technology in Zurich. My task was the high-level part of a project in Computer Vision, whereby the interpretation of an image is the reconstruction of the scene by determining the values of a number of adjustable parameters of a model.
I studied Computer Science in Germany and at the University of California, San Diego, with a focus on Artificial Intelligence, including classical AI, Computer Vision, Neural Networks, Genetic Algorithms and Programming, Computational Linguistics, and Information Retrieval.
[ Data Mining Page ] [ White Papers ] [ Data Mining Tutorial ]