Scoring Your Customers

Once A Data Mining Model has been Created, the Real Work Begins

by Kurt Thearling


1. Introduction

Once a model has been created by a data mining application, the model can then be used to make predictions for new data. The process of using the model is distinct from the process that creates the model. Typically, a model is used multiple times after it is created to score different databases. For example, consider a model that has been created to predict the probability that a customer will purchase something from a catalog if it is sent to them. The model would be built by using historical data from customers and prospects that were sent catalogs, as well as information about what they bought (if anything) from the catalogs. During the model-building process, the data mining application would use information about the existing customers to build and validate the model. In the end, the result is a model that would take details about the customer (or prospects) as inputs and generate a number between 0 and 1 as the output. This process is illustrated below:

After a model has been created based on historical data, it can then be applied to new data in order to make predictions about unseen behavior. This is what data mining (and more generally, predictive modeling) is all about. The process of using a model to make predictions about behavior that has yet to happen is called "scoring." The output of the model, the prediction, is called a score. Scores can take just about any form, from numbers to strings to entire data structures, but the most common scores are numbers (for example, the probability of responding to a particular promotional offer).

Scoring is the unglamorous workhorse of data mining. It doesn't have the sexiness of a neural network or a genetic algorithm, but without it, data mining is pretty useless. (There are some data mining applications that cannot score the models that they produce -- this is akin to building a house and forgetting to put in any doors.) At the end of the day, when your data mining tools have given you a great predictive model, there's still a lot of work to be done. Scoring models against a database can be a time-consuming, error-prone activity, so the key is to make it part of a smoothly flowing process.

2. The Process

Scoring usually fits somewhere inside of a much larger process. In the case of one application of data mining, database marketing, it usually goes something like this:

1.     The process begins with a database containing information about customers or prospects. This database might be part of a much larger data warehouse or it might be a smaller marketing data mart.

2.     A marketing user identifies a segment of customers of interest in the customer database. A segment might be defined as "existing customers older than 65, with a balance greater than $1000 and no overdue payments in the last three months." The records representing this customer segment might be siphoned off into a separate database table or the records might be identified by a piece of SQL that represents the desired customers.

3.     The selected group of customers is then scored by using a predictive model. The model might have been created several months ago (at the request of the marketing department) in order to predict the customer's likelihood of switching to a premium level of service. The score, a number between 0 and 1, represents the probability that the customer will indeed switch if they receive a brochure describing the new service in the mail. The scores are to be placed in a database table, with each record containing the customer ID and that customer's numerical score.

4.     After the scoring is complete, the customers then need to be sorted by their score value. The top 25% will be chosen to receive the premium service offer. A separate database table that contains the records for the top 25% of the scoring customers will be created.

5.     After the customers with the top 25% of the scores are identified, the information necessary to send them the brochure (name and address) will need to be pulled out of the data warehouse and a tape created containing all of this information.

6.     Finally, the tape will be shipped to a company (sometimes referred to as a "mail house")where the actual mailing will occur.

The marketing department typically determines when and where the marketing campaigns take place. In past years, this process might be scheduled to happen once every six months, with large numbers of customers being targeted every time the marketing campaign is executed. Current thinking is to move this process into a more continuous schedule, whereby small groups of customers are targeted on a weekly or even daily basis.

When marketing campaigns are infrequent, manual selection and scoring of the data is not a significant impediment to the process. There is usually significant lead time to allow for the various parties to do their work before the actual mailing will take place. When someone in marketing needs to have a segment of customers selected for the campaign, they simply call someone in IT. When the scores are needed, the statistician who created the model is asked to apply the model to the customers in the desired segment. Because the processing is performed manually, the possibility of an error being introduced into the system is considerable, as follows:

When the frequency of the marketing campaigns is increased so that they occur on a daily or weekly basis, there are two significant impacts on the campaign. First, the decreased time between mailings means that there is much less room for error when carrying out the individual steps in the process. If a mistake is found, there is less time to correct it compared to the less frequent campaigns. Second, the sheer number of scoring "events" will increase dramatically, due to both the increased frequency of the campaigns and an increase in the number of segments that need to be scored.

If the marketing campaigns that rely on the scores are run on a continuous (daily) basis, this means a lot of phone calls between marketing and IT, as well as between marketing and the modelers. The best approach to solving this problem is to use the campaign management software that is integrated with the scoring engine (see section 5 for a discussion of how this integrated software might work). If integrated software is not available, care will need to be taken so that difficulties are minimized.

3. Scoring Architectures and Configurations

The software systems that are used to carry out the scoring process are usually simpler than the applications used to build the models. This is because the statistical functions and optimization procedures that were used to create the model are no longer needed; all that is required is a piece of software that can evaluate mathematical functions on a set of data inputs.

 

Scoring involves invoking a software application (often called the "scoring engine"), which then takes a model and a dataset and produces a set of scores for the records in the dataset. There are three common approaches to scoring engines:

The type of model generated will depend upon the data mining system that is used. Some data mining systems can produce multiple types of models, whereas others will generate only a single type.

In the first two cases, the scoring engine is a software application that needs to be run by the user. It might have a graphical user interface or it might be a command line program, in which the user specifies the input parameters by typing them onto a console interface when the program is run. There are usually three inputs to the scoring engine: the model that is to be run, the data that is to be scored, and the location where the output scores should be put.

In some cases, a data mining system might generate a model that can be executed by another software vendor's scoring engine. Although there are currently no standards for the specification of a predictive model, some data mining vendors have decided to use the modeling formats created by established statistical software vendors. As of the writing of this book, at least two data mining software vendors have optional model output formats that are compatible with the modeling language supported by the SAS Institute's software. Models that are written out in the SAS modeling format can then be executed by the SAS Institute's scoring engine (known as SAS/Base). 

In the last type of scoring engine, the model acts as its own scoring engine. After the model is generated by the data mining software application, it will need to be compiled into an executable form. This step is usually done manually and often requires knowledge of system and programming level details (for example, linking ODBC database drivers). The primary reason to use a compiled model is to increase performance because a compiled model will usually run significantly faster than a model that requires a separate scoring engine.

There are obvious downsides to this approach, though. First is the fact that preparing a model for execution (compiling, linking, etc.) requires expertise that might not be available. Second, if the models change on a regular basis, they will need to be recompiled whenever they change. The use of compiled models can significantly increase the complexity of model management, especially if there are large numbers of models in use and/or the models change on a frequent basis.

4. Preparing the Data

Before you can score a model, you need to prepare the data on which the model is going to operate. Key to this process is the concept of consistency. The customers that are to be scored by the model should be consistent with the customer data that was used to build the model. For example, if a model was built using response data from low balance customers aged 40 to 50, it should not be used on customers aged 50 to 60.

A second type of consistency involves the type of interaction that will take place with the customer or prospect. The interaction needs to be consistent with the original data, or else the results might not be correct. The historical data that was used to build the model had a context that needs to be considered. The color of the envelope, the wording used in the offer, the type of offer, and other variables will affect the results of the interaction. If your model was built from historical response data for a mailing that used a blue envelope, the results that you will see if you send out a new offer in a green envelope could be different from what the model predicts. Care must be taken so that any assumptions, both from the marketing and modeling sides of the fence, are not lost when the implementation of a model takes place. A process (possibly part of a corporate knowledge base) should be maintained to describe customer segments, as well as the types of offers that are made to those customers/prospects.

After you are sure that the data is consistent with the historical customer data and interaction details, you need to map the individual columns (the variables) in your data set to the inputs of the model. The data that is to be scored using an existing predictive model needs to "match" the data that was used to build the model. Matching means that all of the data fields that were used as inputs to the model need to be made available for the model during the scoring process. It should be noted that not all fields that were used to build the model are necessary when scoring the model. It is likely that many of the available fields were not used as inputs to the model because the data mining process determined that they did not provide any predictive information. Only the fields that were actually used in the in the model need to be included. This can usually improve performance because not all data needs to be passed to the scoring engine.

When mapping the data in the database to the inputs of the model, there are two types of mapping that can take place: direct and offset.

4.1 Direct Mapping

In a direct mapping approach, a variable that was used to build the predictive model and is included as an input is mapped to the same variable. For example, if the variable "Account Type" were an input to the model, it would simply map to the same variable. This approach is best used for input variables that are not part of a time series.

4.2 Offset Mapping

In offset mapping, the variables that were as inputs to the model are mapped to variables that are different from those used to build the model. This is often the case when input variables are part of a time series. For example, if a model was built using data from January, there might be inputs that are specific to that month (for example, "Outstanding_Balance_Jan"). When this model is applied to data after January, the inputs will need to be offset to match the time period for which the predictions are being made. When applied to February data, the input should be mapped to "Outstanding_Balance_Feb." The easiest approach, if the data is in a database, is to use a database view to re-direct the inputs to the appropriate table and column. The view would be updated to whenever new monthly data was made available so that it pointed to the latest outstanding balance.

In the real world, the scoring process would probably use a combination of both direct and offset mappings.

The last step in preparing the data, if necessary, is to transform the input to conform to any requirements specific to the model. For example, an account type in the database that is represented as a string (for example, "checking," "savings") might need to be transformed into numbers before it can be fed to the model. The form of the transformation is usually specific to the model type and should be specified by the person who created the model. Although this functionality should be incorporated into the model itself by the data mining system, some applications require the user to do any transformations manually.

5. Integrating Scoring with Other Applications

Scoring isn't something that takes place in a vacuum. After a model has been produced, other applications need to know that they exist and make use of the scores that they generate. Tight integration of data mining applications with other software systems is relatively new, but it is a trend that will continue for some time. Some of the software categories that are likely to embrace integration with data mining applications include enterprise resource planning (ERP), customer relationship management (CRM), and tools such as online analytical processing (OLAP) and data visualization.

As an example, consider how a data mining system might be integrated with a marketing campaign management system. Marketing managers are interested in using the output of a data mining model in order to further refine the customer segments that they have specified. The simplest example might involve segregating a group of customers into separate yes/no categories. The customers that fall into the "yes" category will end up receiving a marketing offer, whereas the other group will not receive the offer. The marketing department will use a campaign management software system to manage the selection of the customers and the segments they fall into.

The closer that the data mining and campaign management software work together, the better are the business results. In the past, the use of a model within campaign management was often a manual, time-intensive process. When someone in marketing wanted to run a campaign that used model scores, he or she usually called someone in the modeling group to get a file containing the database scores. With the file in hand, the marketer would then solicit the help of someone in the information technology group to merge the scores with the marketing database.

Integration is crucial in two areas:

5.1 Creating the Model

In the case of data mining for a marketing campaign, an analyst or user with a background in modeling creates a predictive model using the data mining application. This modeling is usually completely separate from the process of creating the marketing campaign. The complexity of the model creation typically depends on many factors, including database size, the number of variables known about each customer, the kind of data mining algorithms used, and the modeler's experience.

Interaction with the campaign management software begins when a model of sufficient quality has been found. At this point, the data mining user exports his or her model to a campaign management application, which can be as simple as dragging and dropping the data from one application to the other. This process of exporting a model tells the campaign management software that the model exists and is available for later use.

5.2 Dynamically Scoring the Data

Dynamic scoring is a type of software integration that allows the scoring process to be invoked by another software application that will use the scores for some other purpose. In our database marketing example, the campaign management system will interface with the scoring engine so that the scores are generated when the campaign manager needs the scores. Further, only the required records will be scored because the campaign management system determines when and what to score. Dynamic scoring avoids mundane, repetitive manual chores and eliminates the need to score an entire database. Instead, dynamic scoring marks only relevant record subsets, and only when needed. Scoring only the relevant customer subset and eliminating the manual process shrinks the overall processing time significantly. Moreover, scoring records segments only when needed assures "fresh," up-to-date results.

After a model is in the campaign management system, a user (usually someone other than the person who created the model) can start to build marketing campaigns using the predictive models. Models are invoked by the campaign management system.

When a marketing campaign invokes a specific predictive model to perform dynamic scoring, the output is usually stored as a temporary "score" table. When the score table is available in the data warehouse, the data mining engine notifies the campaign management system, and the marketing campaign execution continues. 




[ Data Mining Page ] [ White Papers ] [ Data Mining Tutorial ]