An Overview of Data Mining Techniques

Excerpted from the book Building Data Mining Applications for CRM
by Alex Berson, Stephen Smith, and Kurt Thearling


Introduction

This overview provides a description of some of the most common data mining algorithms in use today.   We have broken the discussion into two sections, each with a specific theme:

Each section will describe a number of data mining algorithms at a high level, focusing on the "big picture" so that the reader will be able to understand how each algorithm fits into the landscape of data mining techniques.   Overall, six broad classes of data mining algorithms are covered.  Although there are a number of other algorithms and many variations of the techniques described, one of the algorithms from this group of six is almost always used in real world deployments of data mining systems.

I. Classical Techniques: Statistics, Neighborhoods and Clustering

1.1. The Classics

These two sections have been broken up based on when the data mining technique was developed and when it became technically mature enough to be used for business, especially for aiding in the optimization of customer relationship management systems.  Thus this section contains descriptions of techniques that have classically been used for decades the next section represents techniques that have only been widely used since the early 1980s.

This section should help the user to understand the rough differences in the techniques and at least enough information to be dangerous and well armed enough to not be baffled by the vendors of  different data mining tools.

The main techniques that we will discuss here are the ones that are used 99.9% of the time on existing business problems.  There are certainly many other ones as well as proprietary techniques from particular vendors - but in general the industry is converging to those techniques that work consistently and are understandable and explainable.

1.2. Statistics

By strict definition "statistics" or statistical techniques are not data mining.  They were being used long before the term data mining was coined to apply to business applications.  However, statistical techniques are driven by the data and are used to discover patterns and build predictive models.  And from the users perspective you will be faced with a conscious choice when solving a "data mining" problem as to whether you wish to attack it with statistical methods or other data mining techniques.  For this reason it is important to have some idea of how statistical techniques work and how they can be applied.

What is different between statistics and data mining?

I flew the Boston to Newark shuttle recently and sat next to a professor from one the Boston area Universities.  He was going to discuss the  drosophila (fruit flies) genetic makeup to a pharmaceutical company in New Jersey.  He had  compiled the world's largest database on the genetic makeup of the fruit fly and had made it available to other researchers on the internet through Java applications accessing a larger relational database.

He explained to me that they not only now were storing the information on the flies but also were doing "data mining" adding as an aside "which seems to be very important these days whatever that is".  I mentioned that I had written a book on the subject and he was interested in knowing what the difference was between "data mining" and statistics.  There was no easy answer. 

The techniques used in data mining, when successful, are successful for precisely the same reasons that statistical techniques are successful (e.g. clean data, a well defined target to predict and good validation to avoid overfitting).  And for the most part the techniques are used in the same places for the same types of problems (prediction, classification discovery).  In fact some of the techniques that are classical defined as "data mining" such as CART and CHAID arose from statisticians.

So what is the difference?  Why aren't we as excited about "statistics" as we are about data mining?  There are several reasons.  The first is that the classical data mining techniques such as CART, neural networks and nearest neighbor techniques tend to be more robust to both messier real world data and also more robust to being used by less expert users.  But that is not the only reason.  The other reason is that the time is right.  Because of the use of computers for closed loop business data storage and generation there now exists large quantities of data that is available to users.  IF there were no data - there would be no interest in mining it.  Likewise the fact that computer hardware has dramatically upped the ante by several orders of magnitude in storing and processing the data makes some of the most powerful data mining techniques feasible today.

The bottom line though, from an academic standpoint at least, is that there is little practical difference between a statistical technique and a classical data mining technique.  Hence we have included a description of some of the most useful in this section.

What is statistics?

Statistics is a branch of mathematics concerning the collection and the description of data.  Usually statistics is considered to be one of those scary topics in college right up there with chemistry and physics.  However, statistics is probably a much friendlier branch of mathematics because it really can be used every day.  Statistics was in fact born from very humble beginnings of real world problems from business, biology, and gambling!

 Knowing statistics in your everyday life will help the average business person make better decisions by allowing them to figure out risk and uncertainty when all the facts either aren’t known or can’t be collected.  Even with all the data stored in the largest of data warehouses business decisions still just become more informed guesses.  The more and better the data and the better the understanding of statistics the better the decision that can be made.

Statistics has been around for a long time easily a century and arguably many centuries when the ideas of probability began to gel.  It could even be argued that the data collected by the ancient Egyptians, Babylonians, and Greeks were all statistics long before the field was officially recognized.  Today data mining has been defined independently of statistics though “mining data” for patterns and predictions is really what statistics is all about.  Some of the techniques that are classified under data mining such as CHAID and CART really grew out of the statistical profession more than anywhere else, and the basic ideas of probability, independence and causality and overfitting are the foundation on which both data mining and statistics are built.

Data, counting and probability

One thing that is always true about statistics is that  there is always data involved,  and usually enough data so that the average person cannot keep track of all the data in their heads.   This is certainly more true today than it was when the basic ideas of probability and statistics were being formulated and refined early this century.  Today people have to deal with up to terabytes of data and have to make sense of it and glean the important patterns from it.  Statistics can help greatly in this process by helping to answer several important questions about your data:

Certainly statistics can do more than answer these questions but for most people today these are the questions that statistics can help answer.  Consider for example that a large part of statistics is concerned with summarizing data, and more often than not, this summarization has to do with counting.   One of the great values of statistics is in presenting a high level view of the database that provides some useful information without requiring every record to be understood in detail.  This aspect of statistics is the part that people run into every day when they read the daily newspaper and see, for example, a pie chart reporting the number of US citizens of different eye colors, or the average number of annual doctor visits for people of different ages.   Statistics at this level is used in the reporting of important information from which people may be able to make useful decisions.   There are many different parts of statistics but the idea of collecting data and counting it is often at the base of even these more sophisticated techniques.  The first step then in understanding statistics is to understand how the data is collected into a  higher level form - one of the most notable ways of doing this is with the histogram.

Histograms

One of the best ways to summarize data is to provide a histogram of the data.  In the simple example database shown in Table 1.1 we can create a histogram of eye color by counting the number of occurrences of different colors of eyes in our database.  For this example database of 10 records this is fairly easy to do and the results are only slightly more interesting than the database itself.   However, for a database of many more records this is a very useful way of getting a high level understanding of the database.

ID

Name

Prediction

Age

Balance

Income

Eyes

Gender

1

Amy

No

62

$0

Medium

Brown

F

2

Al

No

53

$1,800

Medium

Green

M

3

Betty

No

47

$16,543

High

Brown

F

4

Bob

Yes

32

$45

Medium

Green

M

5

Carla

Yes

21

$2,300

High

Blue

F

6

Carl

No

27

$5,400

High

Brown

M

7

Donna

Yes

50

$165

Low

Blue

F

8

Don

Yes

46

$0

High

Blue

M

9

Edna

Yes

27

$500

Low

Blue

F

10

Ed

No

68

$1,200

Low

Blue

M

Table 1.1 An Example Database of Customers with Different Predictor Types

This histogram shown in figure 1.1 depicts a simple predictor (eye color) which will have only a few different values no matter if there are 100 customer records in the database or 100 million.  There are, however, other predictors that have many more distinct values and can create a much more complex histogram.  Consider, for instance, the histogram of ages of the customers in the population.  In this case the histogram can be more complex but can also be enlightening.  Consider if you found that the histogram of your customer data looked as it does in figure 1.2.

Figure 1.1 This histogram shows the number of customers with various eye colors.  This summary can quickly show important information about the database such as that blue eyes are the most frequent.

Figure 1.2  This histogram shows the number of customers of different ages and quickly tells the viewer that the majority of customers are over the age of 50.

By looking at this second histogram the viewer is in many ways looking at all of the data in the database for a particular predictor or data column.  By looking at this histogram it is also possible to build an intuition about other important factors.  Such as the average age of the population, the maximum and minimum age.  All of which are important.  These values are called summary statistics.  Some of the most frequently used summary statistics include:

When there are many values for a given predictor the histogram begins to look smoother and smoother (compare the difference between the two histograms above).  Sometimes the shape of the distribution of data can be calculated by an equation rather than just represented by the histogram.  This is what is called a data distribution.  Like a histogram a data distribution can be described by a variety of statistics.  In classical statistics the belief is that there is some “true” underlying shape to the data distribution that would be formed if all possible data was collected.  The shape of the data distribution can be calculated for some simple examples. The statistician’s job then is to take the limited data that may have been collected and from that make their best guess at what the “true” or at least most likely underlying data distribution might be.

Many data distributions are well described by just two numbers, the mean and the variance.  The mean is something most people are familiar with, the variance, however, can be problematic.  The easiest way to think about it is that it measures the average distance of each predictor value from the mean value over all the records in the database.  If the variance is high it implies that the values are all over the place and very different.  If the variance is low most of the data values are fairly close to the mean.  To be precise the actual definition of the variance uses the square of the distance rather than the actual distance from the mean and the average is taken by dividing the squared sum by one less than the total number of records.  In terms of prediction a user could make some guess at the value of a predictor without knowing anything else just by knowing the mean and also gain some basic sense of how variable the guess might be based on the variance.

Statistics for Prediction

In this book the term “prediction” is used for a variety of types of analysis that may elsewhere be more precisely called regression.  We have done so in order to simplify some of the concepts and to emphasize the common and most important aspects of predictive modeling.  Nonetheless regression is a powerful and commonly used tool in statistics and it will be discussed here.

Linear regression

In statistics prediction is usually synonymous with regression of some form.   There are a variety of different types of regression in statistics but the basic idea is that a model is created that maps values from predictors in such a way that the lowest error occurs in making a prediction.  The simplest form of regression is simple linear regression that just contains one predictor and a prediction.  The relationship between the two can be mapped on a two dimensional space and the records plotted for the prediction values along the Y axis and the predictor values along the X axis.  The simple linear regression model then could be viewed as the line that minimized the error rate between the actual prediction value and the point on the line (the prediction from the model).  Graphically this would look as it does in Figure 1.3. The simplest form of regression seeks to build a predictive model that is a line that maps between each predictor value to a prediction value.  Of the many possible lines that could be drawn through the data the one that minimizes the distance between the line and the data points is the one that is chosen for the predictive model.

On average if you guess the value on the line it should represent an acceptable compromise amongst all the data at that point giving conflicting answers.  Likewise if there is no data available for a particular input value the line will provide the best guess at a reasonable answer based on similar data.

Figure 1.3 Linear regression is similar to the task of finding the line that minimizes the total distance to a set of data.

The predictive model is the line shown in Figure 1.3.  The line will take a given value for a predictor and map it into a given value for a prediction.  The actual equation would look something like: Prediction = a + b * Predictor.  Which is just the equation for a line Y = a + bX.  As an example for a bank the predicted average consumer bank balance might equal $1,000 + 0.01 * customer’s annual income.  The trick, as always with predictive modeling, is to find the model that best minimizes the error. The most common way to calculate the error is the square of the difference between the predicted value and the actual value.  Calculated this way points that are very far from the line will have a great effect on moving the choice of line towards themselves in order to reduce the error.  The values of a and b in the regression equation that minimize this error can be calculated directly from the data relatively quickly.

What if the pattern in my data doesn't look like a straight line?

Regression can become more complicated than the simple linear regression we’ve introduced so far.  It can get more complicated in a variety of different ways in order to better model particular database problems.   There are, however, three main modifications that can be made:

1.      More predictors than just one can be used.

2.      Transformations can be applied to the predictors.

3.      Predictors can be multiplied together and used as terms in the equation.

4.      Modifications can be made to accommodate response predictions that just have yes/no or 0/1 values.

Adding more predictors to the linear equation can produce more complicated lines that take more information into account and hence make a better prediction.  This is called multiple linear regression and might have an equation like the following if 5 predictors were used (X1, X2, X3, X4, X5):

Y = a + b1(X1) + b2(X2) + b3(X3) + b4(X4) + b5(X5)

This equation still describes a line but it is now a line in a6 dimensional space rather than the two dimensional space.

By transforming the predictors by squaring, cubing or taking their square root it is possible to use the same general regression methodology and now create much more complex models that are no longer simple shaped like lines.  This is called non-linear regression.  A model of just one predictor might look like this: Y = a + b1(X1)  + b2 (X12).  In many real world cases analysts will perform a wide variety of transformations on their data just to try them out.  If they do not contribute to a useful model their coefficients in the equation will tend toward zero and then they can be removed.  The other transformation of predictor values that is often performed is multiplying them together.  For instance a new predictor created by dividing hourly wage by the minimum wage might be a much more effective predictor than hourly wage by itself.

When trying to predict a customer response that is just yes or no (e.g. they bought the product or they didn’t or they defaulted or they didn’t) the standard form of a line doesn’t work.  Since there are only two possible values to be predicted it is relatively easy to fit a line through them.  However, that model would be the same no matter what predictors were being used or what particular data was being used.  Typically in these situations a transformation of the prediction values is made in order to provide a better predictive model.  This type of regression is called logistic regression and because so many business problems are response problems, logistic regression is one of the most widely used statistical techniques for creating predictive models.

1.3. Nearest Neighbor

Clustering and the Nearest Neighbor prediction technique are among the oldest techniques used in data mining.  Most people have an intuition that they understand what clustering is - namely that like records are grouped or clustered together.  Nearest neighbor is a prediction technique that is quite similar to clustering - its essence is that in order to predict what a prediction value is in one record look for records with similar predictor values in the historical database and use the prediction value from the record that it “nearest” to the unclassified record.

A simple example of clustering

A simple example of clustering would be the clustering that most people perform when they do the laundry - grouping the permanent press, dry cleaning, whites and brightly colored clothes is important because they have similar characteristics.  And it turns out they have important attributes in common about the way they behave (and can be ruined) in the wash.  To “cluster” your laundry most of your decisions are relatively straightforward.  There are of course difficult decisions to be made about which cluster your white shirt with red stripes goes into (since it is mostly white but has some color and is permanent press).  When clustering is used in business the clusters are often much more dynamic - even changing weekly to monthly and many more of the decisions concerning which cluster a record falls into can be difficult.

A simple example of nearest neighbor

A simple example of the nearest neighbor prediction algorithm is that if you look at the people in your neighborhood (in this case those people that are in fact geographically near to you).  You may notice that, in general, you all have somewhat similar incomes.  Thus if your neighbor has an income greater than $100,000 chances are good that you too have a high income.  Certainly the chances that you have a high income are greater when all of your neighbors have incomes over $100,000 than if all of your neighbors have incomes of $20,000.  Within your neighborhood there may still be a wide variety of incomes possible among even your “closest”  neighbors but if you had to predict someone’s income based on only knowing their neighbors you’re best chance of being right would be to predict the incomes of the neighbors who live closest to the unknown person. 

The nearest neighbor prediction algorithm works in very much the same way except that “nearness” in a database may consist of a variety of factors not just where the person lives.   It may, for instance, be far more important to know which school someone attended and what degree they attained when predicting income.  The better definition of “near” might in fact be other people that you graduated from college with rather than the people that you live next to.

Nearest Neighbor techniques are among the easiest to use and understand because they work in a way similar to the way that people think - by detecting closely matching examples.  They also perform quite well in terms of automation, as many of the algorithms are robust with respect to dirty data and missing data.  Lastly they are particularly adept at performing complex ROI calculations because the predictions are made at a local level where business simulations could be performed in order to optimize ROI.   As they enjoy similar levels of accuracy compared to other techniques the measures of accuracy such as lift are as good as from any other.

How to use Nearest Neighbor for Prediction

One of the essential elements underlying the concept of clustering is that one particular object (whether they be cars, food or customers) can be closer to another object than can some third object.  It is interesting that most people have an innate sense of ordering placed on a variety of different objects.  Most people would agree that an apple is closer to an orange than it is to a tomato and that a Toyota Corolla is closer to a Honda Civic than to a Porsche.  This sense of ordering on many different objects helps us place  them in time and space and to make sense of the world.  It is what allows us to build clusters - both in databases on computers as well as in our daily lives.  This definition of nearness that seems to be ubiquitous also allows us to make predictions.

The nearest neighbor prediction algorithm simply stated is:

Objects that are “near” to each other will have similar prediction values as well.  Thus if you know the prediction value of one of the objects you can predict it for it’s nearest neighbors.

Where has the nearest neighbor technique been used in business?

One of the classical places that nearest neighbor has been used for prediction has been in text retrieval.  The problem to be solved in text retrieval is one where the end user defines a document (e.g. Wall Street Journal article, technical conference paper etc.) that is interesting to them and they solicit the system to “find more documents like this one”.  Effectively defining a target of: “this is the interesting document” or “this is not interesting”.  The prediction problem is that only a very few of the documents in the database actually have values for this prediction field (namely only the documents that the reader has had a chance to look at so far).  The nearest neighbor technique is used to find other documents that share important characteristics with those documents that have been marked as interesting.

Using nearest neighbor for stock market data

As with almost all prediction algorithms, nearest neighbor can be used in  a variety of places.  Its successful use is mostly dependent on the pre-formatting of the data so that nearness can be calculated and where individual records can be defined.  In the text retrieval example this was not too difficult - the objects were documents. This is not always as easy as it is for text retrieval. Consider what it might be like in a time series problem - say for predicting the stock market.  In this case the input data is just a long series of stock prices over time without any particular record that could be considered to be an object.   The value to be predicted is just the next value of the stock price.

The way that this problem is solved for both nearest neighbor techniques and for some other types of prediction algorithms is to create training records by taking, for instance, 10 consecutive stock prices and using the first 9 as predictor values and the 10th as the prediction value.  Doing things this way, if you had 100 data points in your time series you could create 10 different training records. 

You could create even more training records than 10 by creating a  new record starting at every data point.  For instance in the you could take the first 10 data points and create a record.  Then you could take the 10 consecutive data points starting at the second data point, then the 10 consecutive data point starting at the third data point.  Even though  some of the data points would overlap from one record to the next the prediction value would always be different.  In our example of 100 initial data points 90 different training records could be created this way as opposed to the 10 training records created via the other method.

Why voting is better - K Nearest Neighbors

One of the improvements that is usually made to the basic nearest neighbor algorithm is to take a vote from the “K” nearest neighbors rather than just relying on the sole nearest neighbor to the unclassified record.  In Figure 1.4 we can see that unclassified example C has a nearest neighbor that is a defaulter and yet is surrounded almost exclusively by records that are good credit risks.  In this case the nearest neighbor to record C is probably an outlier - which may be incorrect data or some non-repeatable idiosyncrasy.  In either case it is more than likely that C is a non-defaulter yet would be predicted to be a defaulter if the sole nearest neighbor were used for the prediction.

Figure 1.4  The nearest neighbors are shown graphically for three unclassified records: A, B, and C.

In cases like these a vote of the 9 or 15 nearest neighbors would provide a better prediction accuracy for the system than would just the single nearest neighbor.  Usually this is accomplished by simply taking the majority or plurality of predictions from the K nearest neighbors if the prediction column is a binary or categorical or taking the average value of the prediction column from the K nearest neighbors.

How can the nearest neighbor tell you how confident it is in the prediction?

Another important aspect of any system that is used to make predictions is that the user be provided with, not only the prediction, but also some sense of the confidence in that prediction (e.g. the prediction is defaulter with the chance of being correct 60% of the time).  The nearest neighbor algorithm provides this confidence information in a number of ways:

The distance to the nearest neighbor provides a level of confidence.  If the neighbor is very close or an exact match then there is much higher confidence in the prediction than if the nearest record is a great distance from the unclassified record.

The degree of homogeneity amongst the predictions within the K nearest neighbors can also be used.  If all the nearest neighbors make the same prediction then there is much higher confidence in the prediction than if half the records made one prediction and the other half made another prediction.

1.4. Clustering

Clustering for Clarity

Clustering is the method by which like records are grouped together.  Usually this is done to give the end user a high level view of what is going on in the database.  Clustering is sometimes used to mean segmentation - which most marketing people will tell you is useful for coming up with a birds eye view of the business.  Two of these clustering systems are the PRIZM™ system from Claritas corporation and MicroVision™ from Equifax corporation.  These companies have grouped the population by demographic information into segments that they believe are useful for direct marketing and sales.  To build these groupings they use information such as income, age, occupation, housing and race collect in the US Census.  Then they assign memorable “nicknames” to the clusters.  Some examples are shown in Table 1.2.

Name

Income

Age

Education       

Vendor

Blue Blood Estates

Wealthy

35-54

College

Claritas Prizm™

Shotguns and Pickups

Middle

35-64

High School

Claritas Prizm™

Southside City

Poor

Mix

Grade School

Claritas Prizm™

Living Off the Land

Middle-Poor

School Age Families

Low

Equifax MicroVision™

University USA

Very low

Young - Mix

Medium to High

Equifax MicroVision™

Sunset Years

Medium

Seniors

Medium

Equifax MicroVision™

Table 1.2 Some Commercially Available Cluster Tags

This clustering information is then used by the end user to tag the customers in their database.  Once this is done the business user can get a quick high level view of what is happening within the cluster. Once the business user has worked with these codes for some time they also begin to build intuitions about how these different customers clusters will react to the marketing offers particular to their business.   For instance some of these clusters may relate to their business and some of them may not.  But given that their competition may well be using these same clusters to structure their business and marketing offers it is important to be aware of how you customer base behaves in regard to these clusters.

Finding the ones that don't fit in - Clustering for Outliers

Sometimes clustering is performed not so much to keep records together as to make it easier to see when one record sticks out from the rest.  For instance:

Most wine distributors selling inexpensive wine in Missouri and that ship a certain volume of product produce a certain level of profit.  There is a cluster of stores that can be formed with these characteristics.  One store stands out, however, as producing significantly lower profit.   On closer examination it turns out that the distributor was delivering product to but not collecting payment from one of their customers.

A sale on men’s suits is being held in all branches of a department store for southern California .   All stores with these characteristics  have seen at least a 100% jump in revenue since the start of the sale except one.  It turns out that this store had, unlike the others,  advertised via radio rather than television.

How is clustering like the nearest neighbor technique?

The nearest neighbor algorithm is basically a refinement of clustering in the sense that they both use distance in some feature space to create either structure in the data or predictions.  The nearest neighbor algorithm is a refinement since part of the algorithm usually is a way of automatically determining the weighting of the importance of the predictors and how the distance will be measured within the feature space.  Clustering is one special case of this where the importance of each predictor is considered to be equivalent.

How to put clustering and nearest neighbor to work for prediction

To see clustering and nearest neighbor prediction in use let’s go back to our example database and now look at it in two ways.  First let’s try to create our own clusters - which if useful we could use internally to help to simplify and clarify large quantities of data (and maybe if we did a very good job sell these new codes to other business users).  Secondly let’s try to create predictions based on the nearest neighbor.

First take a look at the data.  How would you cluster the data in Table 1.3?

ID

Name

Prediction

Age

Balance

Income

Eyes

Gender

1

Amy

No

62

$0

Medium

Brown

F

2

Al

No

53

$1,800

Medium

Green

M

3

Betty

No

47

$16,543

High

Brown

F

4

Bob

Yes

32

$45

Medium

Green

M

5

Carla

Yes

21

$2,300

High

Blue

F

6

Carl

No

27

$5,400

High

Brown

M

7

Donna

Yes

50

$165

Low

Blue

F

8

Don

Yes

46

$0

High

Blue

M

9

Edna

Yes

27

$500

Low

Blue

F

10

Ed

No

68

$1,200

Low

Blue

M

Table 1.3 A Simple Example Database

If these were your friends rather than your customers (hopefully they could be both) and they were single, you might cluster them based on their compatibility with each other.  Creating your own mini dating service.  If you were a pragmatic person you might cluster your database as follows because you think that marital happiness is mostly dependent on financial compatibility and create three clusters as shown in Table 1.4.

 

ID

Name

Prediction

Age

Balance

Income

Eyes

Gender

 

3

Betty

No

47

$16,543

High

Brown

F

5

Carla

Yes

21

$2,300

High

Blue

F

6

Carl

No

27

$5,400

High

Brown

M

8

Don

Yes

46

$0

High

Blue

M

 

1

Amy

No

62