Some Thoughts on the Current State of Data Mining Software
by Kurt Thearling
Published in the January 13, 1998 edition of DS*
As a former developer of data mining software, I can understand how difficult it is to
create applications that are relevant to business users. Much of the data mining community
comes from an academic background and has focused on the algorithms buried deep in the
bowels of the technology. But algorithms are not what business users care about. Over the
past few years the technology of data mining has moved from the research lab to Fortune
500 companies, requiring a significant change in focus. The core algorithms are now a
small part of the overall application, being perhaps 10% of a larger part, which itself is
only 10% of the whole.
That being said, the focus of this article is to point out some areas in the remaining
99% that need to be improved upon. Heres my current top ten list:
- Database integration. No flat files. One more time: No flat files. Not
supporting database access (reading and writing) via ODBC or native methods is just plain
lazy. Companies spend millions of dollars to build data warehouses to hold their data and
data mining applications must take advantage of this. Besides saving significant manual
effort and storage space, relational integration allows data mining applications to access
the most up-to-date information available. Im happy to say that many of the leading
data mining vendors have heard this message but theres still room for improvement.
- Automated Model Scoring. Scoring is the unglamorous workhorse
of data mining. It doesn't have the sexiness of a neural network or a genetic algorithm
but without it, data mining is pretty useless. (There are some data mining applications
that cannot score the models that they produce - to me this is like building a house and
forgetting to put in any doors.) At the end of the day, when your data mining tools have
given you a great predictive model, there's still a lot of work to be done. Scoring models
against a database is currently a time consuming, error prone activity that hasn't been
given the consideration that it is due. When someone in marketing needs to have a database
scored, they usually have to call someone in IT and cross their fingers that it will be
done correctly. If the marketing campaigns that rely on the scores are run on a continuous
(daily) basis, this means a lot of phone calls and lot of manual processing. Instead, the
process that makes use of the scores should drive the model scoring. Scoring should be
integrated with the driving applications via published API's (a standard would be nice but
it's probably too soon for this) and run-time-library scoring engines. Automation will
reduce processing time, allow for the most up-to-date data to be used, and reduce error.
- Exporting Models to Other Applications. This is really an
extension to #2. Once a model has been produced, other applications (especially
applications will drive the scoring process) need to know that they exist. Technologies
such as OLE automation can make this process relatively straightforward. It's just a
matter of adding the "export" button on the data mining user interface and
creating a means to extend the export functionality by external applications. Exporting
models will then close the loop between data mining and the applications that need to use
the results (scores). Besides exporting the model itself, it would be useful to include
summary statistics and other high-level pieces of information about the model so that the
external application could incorporate this information into its own process.
- Business Templates. Solving a business problem is much more
valuable to a user than is solving a statistical modeling problem. This means that a
cross-selling specific application is more valuable than a general modeling tool that can
create cross-selling models. It might be simply a matter of changing terminology and a few
modifications to the user interface but those changes are important. From the users
perspective, it means that they dont have to stretch very far in order to take their
current understanding of their problem and map it to the software they are using.
- Effort Knob. Users do not necessarily understand the
relationship between complex algorithm parameters and the performance that they will see.
As a result, the user might naively change a tuning parameter in order to improve modeling
accuracy, increasing processing time by an order of magnitude. This is not a relationship
that the user can (or should) understand. Instead, a better solution is to provide an
"effort knob" that allows a user to control global behavior. Set it to a low
value and the system should produce a model quickly, doing the best it can given the
limited amount of time. On the other hand, if it is set to the maximum value the system
might run overnight to produce the best model possible. Because time and effort are
concepts that a business user can understand, an effort knob is relevant in a way that
tuning parameters are not.
- Incorporate Financial Information. Data mining does not
operate in a vacuum. The results of the data mining process will drive efforts in areas
such as marketing, risk management, and credit scoring. Each of these areas is influenced
by financial considerations that need to be incorporated in the data mining modeling
process. A business user is concerned with maximizing profit, not minimizing RMS error.
The information necessary to make these financial decisions (costs, expected revenue,
etc.) is often available and should be provided as an input to the data mining
- Computed Target Columns. In many cases the desired target
variable does not necessarily exist in the database. If the database includes information
about customer purchases, a business user might only be interested in customers whose
purchases were more than one hundred dollars. Obviously, it would be straightforward to
add a new column to the database that contained this information. But this would probably
involve database administrator and IT personnel, complicating a process that is probably
complicated already. In addition, the database could become messy as more and more
possible targets are added during an exploratory data analysis phase. The solution is to
allow the user to interactively create a new target variable. Combining this with an
application wizard (#10), it would be relatively simple to allow the user to create
computed targets on the fly.
- Time-Series Data. Much of the data that exists in data
warehouses has a time-based component. A years worth of monthly balance information
is qualitatively different than twelve distinct non-time-series variables. Data mining
applications need to understand that fact and use it to create better models. Knowing that
a set of variables is a time-series allows for calculations to be done that make sense
only for time series data: trends, slopes, deltas, etc. These calculations have been in
use manually by statisticians for years but most data mining applications cannot perform
them because time-series data is considered as a set of unrelated variables.
- Use vs. View. Data mining models are often complex objects. A
decision tree with four hundred nodes is impossible to fit on a high-resolution video
display, let alone be understood by a human viewer. Unfortunately most data mining
applications do not differentiate between the model that is used to score a database and
the model representation that is presented to users. This needs to be changed. The model
that is presented visually to the user does not necessarily have to be the full model that
is used to score data. A slider on the interface that visualizes a decision tree could be
used to limit the display to the first few (most important) levels of the tree.
Interacting with the display would not have an effect on the complexity of the model but
it would simplify its representation. As a result, users would be able to interact with
the system to provide only the amount of information they can comprehend.
- Wizards. Not necessarily a must-have, application wizards
can significantly improve the users experience. Besides simplifying the process,
they can help prevent human error by keeping the user on track.
[ Data Mining Page ] [ White Papers ] [ Data Mining Tutorial ]