Some Thoughts on the Current State of Data Mining Software Applications

by Kurt Thearling

Published in the January 13, 1998 edition of DS*

As a former developer of data mining software, I can understand how difficult it is to create applications that are relevant to business users. Much of the data mining community comes from an academic background and has focused on the algorithms buried deep in the bowels of the technology. But algorithms are not what business users care about. Over the past few years the technology of data mining has moved from the research lab to Fortune 500 companies, requiring a significant change in focus. The core algorithms are now a small part of the overall application, being perhaps 10% of a larger part, which itself is only 10% of the whole.

That being said, the focus of this article is to point out some areas in the remaining 99% that need to be improved upon. Here’s my current top ten list:

  1. Database integration. No flat files. One more time: No flat files. Not supporting database access (reading and writing) via ODBC or native methods is just plain lazy. Companies spend millions of dollars to build data warehouses to hold their data and data mining applications must take advantage of this. Besides saving significant manual effort and storage space, relational integration allows data mining applications to access the most up-to-date information available. I’m happy to say that many of the leading data mining vendors have heard this message but there’s still room for improvement.
  1. Automated Model Scoring. Scoring is the unglamorous workhorse of data mining. It doesn't have the sexiness of a neural network or a genetic algorithm but without it, data mining is pretty useless. (There are some data mining applications that cannot score the models that they produce - to me this is like building a house and forgetting to put in any doors.) At the end of the day, when your data mining tools have given you a great predictive model, there's still a lot of work to be done. Scoring models against a database is currently a time consuming, error prone activity that hasn't been given the consideration that it is due. When someone in marketing needs to have a database scored, they usually have to call someone in IT and cross their fingers that it will be done correctly. If the marketing campaigns that rely on the scores are run on a continuous (daily) basis, this means a lot of phone calls and lot of manual processing. Instead, the process that makes use of the scores should drive the model scoring. Scoring should be integrated with the driving applications via published API's (a standard would be nice but it's probably too soon for this) and run-time-library scoring engines. Automation will reduce processing time, allow for the most up-to-date data to be used, and reduce error.
  1. Exporting Models to Other Applications. This is really an extension to #2. Once a model has been produced, other applications (especially applications will drive the scoring process) need to know that they exist. Technologies such as OLE automation can make this process relatively straightforward. It's just a matter of adding the "export" button on the data mining user interface and creating a means to extend the export functionality by external applications. Exporting models will then close the loop between data mining and the applications that need to use the results (scores). Besides exporting the model itself, it would be useful to include summary statistics and other high-level pieces of information about the model so that the external application could incorporate this information into its own process.
  1. Business Templates. Solving a business problem is much more valuable to a user than is solving a statistical modeling problem. This means that a cross-selling specific application is more valuable than a general modeling tool that can create cross-selling models. It might be simply a matter of changing terminology and a few modifications to the user interface but those changes are important. From the user’s perspective, it means that they don’t have to stretch very far in order to take their current understanding of their problem and map it to the software they are using.
  1. Effort Knob. Users do not necessarily understand the relationship between complex algorithm parameters and the performance that they will see. As a result, the user might naively change a tuning parameter in order to improve modeling accuracy, increasing processing time by an order of magnitude. This is not a relationship that the user can (or should) understand. Instead, a better solution is to provide an "effort knob" that allows a user to control global behavior. Set it to a low value and the system should produce a model quickly, doing the best it can given the limited amount of time. On the other hand, if it is set to the maximum value the system might run overnight to produce the best model possible. Because time and effort are concepts that a business user can understand, an effort knob is relevant in a way that tuning parameters are not.
  1. Incorporate Financial Information. Data mining does not operate in a vacuum. The results of the data mining process will drive efforts in areas such as marketing, risk management, and credit scoring. Each of these areas is influenced by financial considerations that need to be incorporated in the data mining modeling process. A business user is concerned with maximizing profit, not minimizing RMS error. The information necessary to make these financial decisions (costs, expected revenue, etc.) is often available and should be provided as an input to the data mining application.
  1. Computed Target Columns. In many cases the desired target variable does not necessarily exist in the database. If the database includes information about customer purchases, a business user might only be interested in customers whose purchases were more than one hundred dollars. Obviously, it would be straightforward to add a new column to the database that contained this information. But this would probably involve database administrator and IT personnel, complicating a process that is probably complicated already. In addition, the database could become messy as more and more possible targets are added during an exploratory data analysis phase. The solution is to allow the user to interactively create a new target variable. Combining this with an application wizard (#10), it would be relatively simple to allow the user to create computed targets on the fly.
  1. Time-Series Data. Much of the data that exists in data warehouses has a time-based component. A year’s worth of monthly balance information is qualitatively different than twelve distinct non-time-series variables. Data mining applications need to understand that fact and use it to create better models. Knowing that a set of variables is a time-series allows for calculations to be done that make sense only for time series data: trends, slopes, deltas, etc. These calculations have been in use manually by statisticians for years but most data mining applications cannot perform them because time-series data is considered as a set of unrelated variables.
  1. Use vs. View. Data mining models are often complex objects. A decision tree with four hundred nodes is impossible to fit on a high-resolution video display, let alone be understood by a human viewer. Unfortunately most data mining applications do not differentiate between the model that is used to score a database and the model representation that is presented to users. This needs to be changed. The model that is presented visually to the user does not necessarily have to be the full model that is used to score data. A slider on the interface that visualizes a decision tree could be used to limit the display to the first few (most important) levels of the tree. Interacting with the display would not have an effect on the complexity of the model but it would simplify its representation. As a result, users would be able to interact with the system to provide only the amount of information they can comprehend.
  1. Wizards. Not necessarily a must-have, application wizards can significantly improve the user’s experience. Besides simplifying the process, they can help prevent human error by keeping the user on track.

[ Data Mining Page ] [ White Papers ] [ Data Mining Tutorial ]