DIG White Paper 95/02
by Kurt Thearling
The market for data mining if you believe the hype will be billions of dollars by the turn of the century1. Unfortunately, much of what is now considered data mining will be irrelevant, since it is disconnected from the business world. In general, marketing analysts predictions that the technology of data mining will be very relevant to businesses in the future are correct. The key to making a successful data mining software product is to embrace the business problems that the technology is meant to solve, not to incorporate the hottest technology. In this report I will address some of the issues related to the development of data mining technology as it relates to business users.
The current state-of-the-art analysis of databases is done by high-tech analysts (typically statisticians) using sophisticated tools (e.g., SAS or S-Plus). In essence these analysts are manual data miners. In contrast, data mining software technology promises to automate that analysis, allowing business users (who don't have a Ph.D. in statistics) to develop a more accurate and sophisticated understanding of their data.
Before we go any further, it is probably a good idea to discuss the terminology found in much of the data mining literature2. There seems to be a multitude of terms related to the process of analyzing information contained in a database: data mining, database mining, and database marketing. Is there a difference between these terms?
Let's start with the technology. The technology is "data mining." Data mining is, in some ways, an extension of statistics, with a few artificial intelligence and machine learning twists thrown in. Like statistics, data mining is not a business solution, it is just the underlying technology. Statistics does not of itself solve business problems. Unfortunately, data mining is being touted as a business solution when it is simply the base technology upon which business solutions might be built.
Database mining, which incorporates the ability to access directly data stored in a database, is one step beyond the core technology of data mining. The distinction (database rather than data) might seem to be a trivial improvement but like most transitions from technology to solutions, it requires a major leap for developers. For example, at a recent data mining conference3, only one presenter discussed how their work interacted with a database. All the other presenters assumed that the data was available in flat files or that any interaction with a database was so irrelevant as to be not worth mentioning. However, anyone familiar with commercial information processing knows the critical impact of interacting with data stored in relational databases (RDBMS).
Database marketing, on the other hand, supports a variety of business processes. It involves transforming a data base into business decisions. For example, consider a catalog retailer who needs to decide who to send a new catalog to. The information incorporated into the database marketing process is the historical database of previous mailings and the features associated with the (potential) customers, such as age, zip code, their response in the past, etc. The database marketing software would use this information to build a model of customer behavior that would generate a mailing list of customers most likely to respond to the new catalog. In the end, any models of the database the data mining software might create are irrelevant - what matters is the list of potential customers who receive the catalog and the accuracy of the list.
OK, now you know that data mining, the technology, is not the solution to your problems. But what is the technology? How does it differ from statistics and other time proven techniques? And what is the end product from the technology? In a field filled with hype, the answers to these questions can often be vague or misleading. In this section I will ground some expectations.
The phrase "discover interesting patterns" often comes up during discussions of data mining. A pretty vague statement since "interesting" usually depends on a specific vertical market and "pattern" is irrelevant without some specific of business problem. For most problems, a pattern is some set of measurable characteristics that can be correlated with some other characteristic. For example, a pattern that might be discovered by a data mining application could be something like this: if your age is between 16 and 20 and your zip code is 90210, then you probably drive a car costing greater than $50,000. What this pattern doesn't say is everyone matching this pattern must drive an expensive car. Usually a pattern is associated with an "accuracy," which specifies the percentage of pattern matches where the correlated characteristic is correct. As far as "interesting" is concerned, that would depend on your business problem. If you are trying to market luxury products, this sort of pattern might very well be interesting. But if you are trying to predict medical insurance fraud, this pattern is unlikely to be useful, and therefore uninteresting.
Coverage is also an important concept. In the previous example, the discovered pattern only applies to some fraction of people living in one zip code. If your business is national, a pattern that includes only one zip code is not enough. In that case the database marketing system would need to discover many more patterns. Coverage relates the total number of possible pattern matches to the number of records that do match some pattern for a desired characteristic. If a collection of patterns match all records with the desired characteristic, the coverage is one hundred percent. The tradeoff is between accuracy and coverage. A pattern that matches everyone in the US would naturally match all people who buy luxury cars. The pattern would have 100% coverage but very low accuracy.
Another word that often shows up in data mining is "model." A model4 is simply a collection of patterns for some desired characteristic (models usually come in a form more complicated than a simple list of characteristics to match). For example, one common model is known as ARMA (auto-regressive moving average). Recently neural network and other models based on biological concepts have come into vogue. There are lots of model types out there, but in the end they are irrelevant to the business problem. A model should never be confused with a solution.
Given that the model and the business solution are two different things, how can a model be turned into a business solution? To start, there are some things that apply to nearly all database marketing applications. For instance, actionable characteristics, those characteristics that your business has some control over, are usually more important than those that are non-actionable. An example of an actionable characteristic is whether or not someone is sent a catalog. A non-actionable characteristic might be the amount of their last order. A business can decide to send or not send a catalog but it cannot control the amount of a customer's last order. This is especially important when targeting new customers. A pattern that says "if someone is sent a catalog with a 10%-off coupon, they will order $100 worth of merchandise" is much more useful than the pattern "if someone ordered $100 before, they will order $100 again." In the first case the catalog retailer can take action to target potential customers while in the second they must simply wait for the order to come in.
Researchers, primarily in the fields of computer science and statistics, have been
responsible for the development of most of the data mining technology currently available.
From a business standpoint, this has been a problem since (academic) researchers are good
at developing and evaluating data mining technologies, but they tend to get caught up in
minute details of the technology. They are not interested (nor, should they be) in the
fact that the core technology is only a small part of delivering a business solution, and
that compromises must be made in order to deliver a usable piece of software. Another
group of data mining researchers are what I call, "downsized data miners." These
are people, primarily with research backgrounds, who worked on data mining research until
cutbacks and company downsizing forced them into product development. When downsized data
miners develop software, the end product is usually a complex tool (as opposed to a
problem solving application) or intermediate software product. Lately some downsized data
miners have claimed that they will be deploying business solutions however most software
is currently in some form of pre-release (Beta, Alpha, even pre-alpha!). These complex
data mining tools compete with other high-end analysis tools (e.g., SAS or S-Plus) that
require users to have sophisticated skills. Ultimately very few of these researchers will
directly impact the development of database marketing as a business solution.
On the other side of the coin from the researchers are the developers who are trying to create database marketing software applications for business users5. Unlike data mining tools, these applications do not require users to know how to set up statistical experiments or build data models. The developers of database marketing applications start with the business problems and try to determine if some piece of data mining technology might be useful in solving the problem The technology associated with a data mining software application, just one small part of the overall product, will be built using techniques developed by researchers. Although current software products could be more sophisticated, the future for these software companies is the future of data mining.
The technology commonly referred to as data mining already exists in at least cursory form. Unfortunately, for business users, the data mining community is currently focusing on refining the technology, without attempting to validate it in business applications. From a practical standpoint, who cares if some algorithm is a 5% improvement over the best data mining technique if it only works from a command line interface on some supercomputer? If it isn't easily usable, it is irrelevant to most users.
To deliver data mining technology into the hands of business users, several changes from the current state of the technology will be required. These changes can be broken down into three key areas:
The first point is the most important. A database marketing software product will not succeed if it does not start with an understanding of real-world business problems. Ultimately the transition between model and business solution will require a thorough understanding of the marketplace to formulate the problem in a way that will affect a business. The ability of a database marketing application to make use of this information will determine if is truly useful to a business. Therefore, industry-specific value added solution providers will probably have an important place in the field of database marketing. They should be able to contribute vertical market specific templates and meta-data that will guide the database mining technology toward solutions to the business problems.
Once the business problem has been taken into consideration, the process and results need to be conveyed to the business person who needs to make a decision. It cannot be assumed that the person who makes the decision will understand how to work with a neural network model or how to interpret the results from such a model. User-friendly graphical user interfaces (GUIs) are a necessity. These GUIs must integrate smoothly into the business user's overall decision support (DSS) application environment. This environment is usually client/server, with a PC running Windows as the preferred client platform. Technologically related input parameters must be avoided at all costs. A decision tree database mining application shouldn't require the user to specify search width, search depth, amount of training records, etc. The user won't understand what these terms mean, let alone know what to provide as input values. Instead the user should be asked for things related to his or her world. How much time can the process take? How much "effort" should be dedicated to the problem6. The application will need to translate between the user specified parameters and the parameters required by the technology. A feedback process by which the application provides the user information related to their input parameters would be very useful. For example, the system might tell the user that when the "effort" knob is set to 5, the process will take about three hours and will look at 40% of the database. By increasing the setting to 7, the time might increase to five hours but 75% of the database will be analyzed. Such tradeoffs are within the scope of knowledge of business users.
Finally, database marketing applications must be smoothly integrated with standard relational database products. Business users do not want to deal with dumping an RDBMS as a flat file or translating between different data formats. Database marketing applications need to work with ODBC (Open Database Connectivity) and leading relational database interfaces so that they can interact directly with the databases. When an application speaks to a database, it will probably be in SQL, the standard for the relational database industry. These things would be obvious to developers of business software, but not necessarily to those in the research oriented field of data mining. One bad sign: at the 1995 data mining conference, not a single speaker mentioned the words client/server or ODBC.
What does the future have in store for data mining? In the end, much of what is called data mining will likely end up as standard tools built into database or data warehouse software products. As a motivation for this statement, I would like to use the field of spell checking software as an example. Just look back ten years to the infancy of computer word processing. Many companies made spell checking software. You would usually buy a spell checker as a separate piece of software for use with whatever word processor you might have. Sometimes the spell-checker wouldn't understand a particular word processor's file format. Some spell-checkers might have even required you to dump your document as an ASCII file before it would check the spelling (on the ASCII file). In that case, you would have had to manually make corrections in the original document. Eventually the spell checkers became more user friendly and understood every possible document format. Functionality also increased. The future of spell checking probably looked pretty rosy.
So, where are the spell checking companies today? Where is the spell checking software? If you look at your local computer store you won't find much there. Instead you will find that your new word processor comes with a built-in spell checker. As word processor software increased in sophistication and functionality, it was a natural progression to include spell checking into the standard system.
The future of data mining may very well parallel the history of spell checking. The functionality of database marketing products will increase to integrate with relational database products (no more dumping a RDBMS into a flat file!) and with key DSS application environments, it will stress the business problem rather than the technology, and present the process to the user in a friendly manner. Database marketing will start losing some of the hype and begin to provide real value to users. This will make database marketing an important business in and of itself. The larger RDBMS and data warehouse companies have already expressed an interest in integrating data mining into their database products. In the end, this new market and its business opportunities will drive mainstream database companies to database marketing. Ten years from now there may be only a few independent data mining companies left in existence. The real survivors will likely be the ones with the foresight to develop a strong relationship with the mainstream database industry.
Database marketing software applications will have a tremendous impact on how business is done in the future. Although the core data mining technology is here today, developers need to take what already exists and turn it into something that business users can work with. The successful database marketing applications will combine data mining technology with a thorough understanding of business problems and present the results in a way that the user can understand. At that point the knowledge contained in a database will be understood by people who can turn what is known into what can be done.
1. According to a recent Gartner Group report, by the year 2000 at least half of the Fortune 1000 companies worldwide will be using data mining technology.
2. For a general overview of some applications of data mining and database marketing, see "Database Marketing - A Potent New Tool for Selling," Business Week, September 5, 1994, and "Using Computers to Divine Who Might Buy a Gas Grill," The Wall Street Journal, August 16, 1994.
3. The First International Conference on Knowledge Discovery and Data Mining (KDD-95), Montreal, August 20-21, 1995.
4. A list of some common models: linear, auto-regressive moving average, decision tree, and neural network. Models can also be combined.
5. Some of the more interesting data mining/database marketing companies include Unica, Trajecta, and HNC.
6. "Effort" is a knob somewhere on the GUI that runs from 0 to 10. If the user turns it to 10, the process will take quite a while and will produce very good models. If they set the dial to 2, the results will pop out quicker but the quality of the model will be less.
[ Data Mining Page ] [ White Papers ] [ Data Mining Tutorial ]