Understanding Data Mining: It's All in the Interaction
by Kurt Thearling
Published in the December 9, 1997 edition of DS*
Data mining is a relatively unique process. In most standard database operations, nearly all of the results presented to the user are something that they knew existed in the database already. A report showing the breakdown of sales by product line and region is straightforward for the user to understand because they intuitively know that this kind of information already exists in the database. If the company sells different products in different regions of the county, there is no problem translating a display of this information into a relevant understanding of the business process.
Data mining, on the other hand, extracts information from a database that the user did not know existed. Relationships between variables and customer behaviors that are non-intuitive are the jewels that data mining hopes to figure out. And because the user does not know beforehand what the data mining process has discovered, it is a much bigger leap to take the output of the system and translate it into a solution to a business problem.
This is where visualization comes in. The purpose of visualization is pretty simple: to let the user understand what is going on. Since data mining usually involves extracting "hidden" information from a database, the understanding process can get a bit complicated. The key is to put the user in a context they feel comfortable in and then let them poke and prod until they understand what they didn't see before.
How does someone actually use the output of a data mining analysis? The simplest way is to leave the output (the "model") in the form of a black box. If they take the black box and score a database, they can get a list of customers to target (send them a catalog, increase their credit limit, etc.). There's not much for the user to do other than sit back and watch the envelopes go out. This can be a very effective approach. Mailing costs can often be reduced by an order of magnitude without significantly reducing the response rate. But it does require that the user trusts the system, and that's where things get complicated.
Then there's the more difficult way to use the results of data mining: getting the user to actually understand what is going on so that they can take action directly. For example, if the user is responsible for ordering a print advertising campaign, then understanding customer demographics is critical. A data mining analysis might determine that customers in New York City are now focused in the 30 to 35 age range while previous analyses showed that these customers were primarily in the ages 22 to 27. This change means the print campaign might move from the Village Voice to the New Yorker. There's no automated way to do this. It's all in the marketing manager's head. Unless the output of the data mining system can be understood qualitatively, it won't be of any use.
Both these cases (trusting as well as understanding) are inextricably linked. The user needs to view the output of the data mining in a context they understand. If they can understand what has been discovered they will trust it and put it into use. There are two parts to this problem: 1) visualization of the data mining output in a meaningful way, and 2) allowing the user to interact with the visualization so that simple questions can be answered. Creative solutions to the first part have recently been incorporated into a number of commercial data mining products. Graphing lift, response, and (probably most importantly) financial indicators (e.g., profit, cost, ROI) give the user a sense of context that can quickly ground the results in reality. After that simple representations of the data mining results allow the user to see the data mining results.
Graphically displaying a decision tree (generated by a CART or CHAID tool) can significantly change the data mining software is used. (Until recently, nearly every user of Pilot's Discovery Server CART software drew the extracted decision tree by hand when they weren't able to get the system to do it.) Some algorithms can pose more problems than others can (e.g., neural networks) but solutions are starting to appear (look at Trajecta's neural network visualization tools).
It is the second part that has yet to be addressed fully. Interaction is, for many users, the Holy Grail of visualization in data mining. Seeing a decision tree is nice, but what the user really wants to do is drag-and-drop the best customer segments onto a map of the United States in order to see if there are sales regions that are neglected. The number of questions that can be asked is endless: How do the most likely customers break down by gender? What is the average balance for the predicted defaulters? The interaction will continue until the user understands what is going on with their customers. That's what it's all about -- understanding. By incorporating interaction into the process, a user will be able to connect the data mining results with his or her customers.
It is also important to realize that the background of a user can significantly change the way that an interaction will take place. Direct mail marketing people think about and interact with their problems quite differently than credit card risk managers (both in terms of the terminology they use to discuss the same issues as well as the representations that they are used to seeing). Instead of seeing this as a problem, data mining software developers should seize the opportunity to create applications that solve specific problems well. This can be done either by creating a series of problem specific templates that sit on top of a more general technology (e.g., Unica's Model1 software) or by creating single problem applications (e.g., HNC's Falcon credit card fraud detection software). In either case, the user will be able to interact with the system from a position with familiar landmarks.
My hope is that data mining application developers will begin spending more time on understandability and interaction instead of tweaking the internals of the algorithm. Those that do will find that their customers are happier and the results generated by their software will be put to good use.
[ Data Mining Page ] [ White Papers ] [ Data Mining Tutorial ]