搜档网
当前位置:搜档网 › Classification of Imbalanced Marketing Data with Balanced Random Sets

Classification of Imbalanced Marketing Data with Balanced Random Sets

Classification of Imbalanced Marketing Data with Balanced Random Sets
Classification of Imbalanced Marketing Data with Balanced Random Sets

JMLR:Workshop and Conference Proceedings7:89-100KDD cup2009 Classi?cation of Imbalanced Marketing Data with Balanced

Random Sets

Vladimir Nikulin v.nikulin@https://www.sodocs.net/doc/346820822.html,.au Department of Mathematics,University of Queensland,Australia

Geo?rey J.McLachlan gjm@https://www.sodocs.net/doc/346820822.html,.au Department of Mathematics,University of Queensland,Australia

Editor:Gideon Dror,Marc Boull′e,Isabelle Guyon,Vincent Lemaire,David Vogel

Abstract

With imbalanced data a classi?er built using all of the data has the tendency to ignore the minority class.To overcome this problem,we propose to use an ensemble classi?er constructed on the basis of a large number of relatively small and balanced subsets,where representatives from both patterns are to be selected randomly.As an outcome,the system produces the matrix of linear regression coe?cients whose rows represent the random sub-sets and the columns represent the features.Based on this matrix,we make an assessment of how stable the in?uence of a particular feature is.It is proposed to keep in the model only features with stable in?uence.The?nal model represents an average of the base-learners, which is not necessarily a linear regression.Proper data pre-processing is very important for the e?ectiveness of the whole system,and it is proposed to reduce the original data to the most simple binary sparse format,which is particularly convenient for the construction of decision trees.As a result,any particular feature will be represented by several binary variables or bins,which are absolutely equivalent in terms of data structure.This property is very important and may be used for feature selection.The proposed method exploits not only contributions of particular variables to the base-learners,but also the diversity of such contributions.Test results against KDD-2009competition datasets are presented.

Keywords:ensemble classi?er,gradient-based optimisation,boosting,random forests, decision trees,matrix factorisation

1.Introduction

Ensemble(including voting and averaged)classi?ers are learning algorithms that construct a set of many individual classi?ers(called base-learners)and combine them to classify test data points by sample average.It is now well-known that ensembles are often much more accurate than the base-learners that make them up(Biau et al.,2007).The tree ensemble called“random forests”(RF)was introduced in(Breiman,2001)and represents an example of a successful classi?er.In another example,the bagged version of the support vector machine(SVM)(Zhang et al.,2007)is very important because direct application of the SVM to the whole data set may not be possible.In the case of the SVM,the dimension of the kernel matrix is equal to the sample size,which thus needs to be limited.

Our approach was motivated by(Breiman,1996),and represents a compromise between two major considerations.On the one hand,we would like to deal with balanced data.On the other hand,we wish to exploit all available information.We consider a large number n c 2009Vladimir Nikulin and Geo?rey J.McLachlan.

Nikulin McLachlan

of balanced subsets of available data where any single subset includes two parts(a)nearly all ‘positive’instances(minority)and(b)randomly selected‘negative’instances.The method of balanced random sets(RS)is general and may be used in conjunction with di?erent base-learners.

Regularised linear regression(RLR)represents the most simple example of a decision https://www.sodocs.net/doc/346820822.html,bined with quadratic loss function it has an essential advantage:using a gradient-based search procedure we can optimise the value of the step size.Consequently, we will observe a rapid decline in the target function.

By de?nition,regression coe?cients may be regarded as natural measurements of in?u-ence of the corresponding features.In our case we have n vectors of regression coe?cients, and we can use them to investigate the stability of the particular coe?cients.

Proper feature selection(FS)may reduce over?tting signi?cantly.We remove features with unstable coe?cients,and recompute the classi?ers.Note that stability of the coe?-cients may be measured using di?erent methods.For example,we can apply the t-statistic given by the ratio of the mean to the standard deviation(Nikulin,2008).

Matrix factorisation,an unsupervised learning method,is widely used to study the structure of the data when no speci?c response variable is speci?ed.In principle,it would be better to describe the data in terms of a small number of meta-features,derived as a result of matrix factorisation,which could reduce noise while still capturing the essential features of the data.In addition,latent factor models are generally e?ective at estimating overall structure that relates simultaneously to most or all items.

Note that the methods for non-negative matrix factorisation(NMF)which were intro-duced in(Lee and Seung,2001)are valid only under the condition that all the elements of all input and output matrices are non-negative.In Section3.4we formulate a general method for matrix factorisation,which is signi?cantly faster compared with NMF.Note also that some interesting and relevant ideas for the stochastic gradient descent algorithm were motivated by methods used in the well-known Net?ix Cup;see,for example,(Paterek, 2007).

The proposed approach is?exible.We do not expect that a single speci?cation will work optimally on all conceivable applications and,therefore,an opportunity of tuning and tailoring the algorithm is important.

Results which were obtained during the KDD-2009Data Mining Competition are pre-sented.

2.Task Description

The KDD Cup20091o?ered the opportunity to work on large marketing databases from the French Telecom company Orange to predict the propensity of customers to switch provider (churn),buy new products or services(appetency),or buy upgrades or add-ons proposed to them to make the sale more pro?table(up-selling).

Churn(Wikipedia de?nition)is a measure of the number of individuals or items moving into or out of a collection over a speci?c period of time.The term is used in many contexts, but is,most widely,applied in business with respect to a contractual customer base.For instance,it is an important factor for any business with a subscriber-based service model,

Classification of imbalanced marketing data

including mobile telephone networks and pay TV operators.The term is also used to refer to participant turnover in peer-to-peer networks.Appetency is the propensity to buy a service or a product.Up-selling(Wikipedia de?nition)is a sales technique whereby a salesman attempts to have the customer purchase more expensive items,upgrades,or other add-ons in an attempt to make a more pro?table sale.Up-selling usually involves marketing more pro?table services or products,but up-selling can also be simply exposing the customer to other options that he or she may not have considered previously.Up-selling can imply selling something additional,or selling something that is more pro?table or,otherwise,preferable for the seller instead of the original sale.

Customer Relationship Management(CRM)is a key element of modern marketing strategies.The most practical way in a CRM system to build knowledge on customer is to produce scores.The score(the output of a model)is computed using input variables which describe instances.Scores are then used by the information system(IS),for example, to personalize the customer relationship.There is also an industrial customer analysis plat-form able to build prediction models with a very large number of input variables(known as explanatory variables or features).

Generally,all features may be divided into two main parts:primary(collected directly from the customer)and secondary,which may be computed as a functions of primary fea-tures or may be extracted from other databases according to the primary https://www.sodocs.net/doc/346820822.html,ually, the number of primary features is rather small(from10to100).On the other hand,the number of secondary features may be huge(up to a few thousands).In most cases,proper design of the secondary features requires deep understanding of the most important primary features.

The rapid and robust detection of the features that have most contributed to the output prediction can be a key factor in a marketing applications.Time e?ciency is often a crucial point,because marketing patterns have a dynamic nature and in a few days time it will be necessary to recompute parameters of the model using fresh data.Therefore,part of the competition was to test the ability of the participants to deliver solutions quickly.

3.Main Models

Let X=(x t,y t),t=1,...,n,be a training sample of observations where x t∈R?is?-dimensional vector of features,and y t is binary label:y t∈{?1,1}.Boldface letters denote vector-columns,whose components are labelled using a normal typeface.

In supervised classi?cation algorithms,a classi?er is trained with all the labelled training data and used to predict the class labels of unseen test data.In other words,the label y t may be hidden,and the task is to estimate it using vector of features.Let us consider the most simple linear decision function

u t=u(x t)=

?

j=1

w j·x tj+b,

where w i are weight coe?cients and b is a bias term.

We used AUC as an evaluation criterion,where AUC is the area under the receiver operating curve.By de?nition,ROC is a graphical plot of true positive rates against false positive rates.

91

Nikulin McLachlan

3.1RLR Model

Let us consider the most basic quadratic minimization model with the following target

function:

L(w)=?(φ,n,w)+

n

t=1

ξt·(y t?u t)2,(1)

where?(φ,n,w)=φ·n· w 2is a regularization term with ridge parameterφ;theξt are non-negative weight coe?cients.

Remark1The aim of the regularization term with parameterφis to reduce the di?erence between training and test results.Value ofφmay be optimized using cross-validation;see Mol et al.(2009)for more details.

3.1.1Gradient-based optimisation

The direction of the steepest decent is de?ned by the gradient vector

g(w)={g j(w),j=1,...,?},

where

g j(w)=

?L(w)

n

n

t=1

ξt·(y t?u t),

where k is the iteration number.Minimizing(1),we?nd the size of the step according to the formula

δ=

L1?L2?φ·n ?j=1w j g j

Classification of imbalanced marketing data

Figure1:MVF,ratios(2):(a-b)Churn,(c-d)Appetency and(e-f)Upselling.In order to improve visibility,we displayed two fragments(out of9586)with500ratios each.

This model includes two very important regulation parameters:m and q=α

k .Using gradient-based optimization(see Section3.1.1),we can compute the

matrix of linear regression coe?cients:W={w ij,i=1,...,m,j=0,...,?}.

3.3Mean-Variance Filtering

The mean-variance?ltering(MVF)technique was introduced in(Nikulin,2006),and can be used to reduce over?https://www.sodocs.net/doc/346820822.html,ing the following ratios,we can measure stability of the contributions of the particular features by

r j=

|μj|

Nikulin McLachlan

3.4Learning from the Test Data with Matrix Factorisation

Unlike classi?cation and regression,matrix decomposition requires no response variable and thus falls into the category of unsupervised learning https://www.sodocs.net/doc/346820822.html,ing this fact as a motivation,we can merge training and test datasets into one matrix X+.There are possible di?erences between training and test sets,and we can expect that the matrix factorisation will be able to smooth such di?erences.

In this section,we describe the procedure for undertaking the matrix factorisation,

X+~AB,(3) where matrices A={a if,i=1,...,2n,f=1,...,k,}and B={b fj,f=1,...,k,j= 1,...,?}.After the factorisation,the?rst n rows of the matrix A will used for training and the last n rows will be used for testing.

The matrix factorisation represents a gradient-based optimisation process with the ob-jective to minimise the following squared loss function,

1

L(A,B)=

Algorithm1Matrix factorisation.

The above target function(4)includes in total k(2n+?)regulation parameters and may be unstable if we minimise it without taking into account the mutual dependence between elements of the matrices A and B.

94

Classification of imbalanced marketing data

Figure2:Convergence of Algorithm1:(a)Churn,?=477;(b)Appetency,?=532;and(c) Upselling,?=385,where blue lines correspond to k=8and red lines correspond to

k=20.

As a solution to the problem,we can cycle through all the di?erences E ij,minimising them as a function of the particular parameters which are involved in the de?nition of E ij. Compared to usual gradient-based optimisation,in our optimisation model we are dealing with two sets of parameters,and we therefore mix uniformly updates of these parameters, because the latter are dependent.

The following partial derivatives are necessary for Algorithm1:

?E2ij

=?2E ij a if.(5b)

?b fj

Similarly,as in Section3.1,we can optimise here the value of the step-size.However, taking into account the complexity of the model,it will be better to maintain?xed and small values of the step size or learning rate.In all our experiments,we conducted matrix factorisation with the above Algorithm1using300global iterations with the following regulation parameters:λ=0.01-initial learning rate,ξ=0.75-correction rate.We conducted experiments with Algorithm1against the datasets in a binary format,and Figure2illustrates convergence of the algorithm.

95

Nikulin McLachlan

4.Experiments

4.1Pre-processing Technique

The sizes of the training and test datasets are the same and are equal to50000.There are 14740numerical and260categorical features in the large dataset.The training data are very imbalanced.The numbers of positive cases were3672(Churn),890(Appetency)and 3682(Upselling)out of a total number of50000.

Firstly,we conducted a basic checking of the categorical data.We detected72variables with only one value.In addition,we removed74variables,where the number of missing variables was greater than49500.The number of the remaining variables was184.As a next step,we considered all possible values for the latter184variables,and found that 1342values occurred frequently enough to be considered as independent binary variables (otherwise,values were removed from any further consideration).

An e?ective way to link information contained in numerical and categorical variables is to transfer the numerical variables to binary format(as it was before in the application to the categorical variables).We used a technique similar to that used before in converting the categorical variables to binary format.We removed all variables with numbers of missing and zeros greater than49500.The number of the remaining variables was3245.Next,we split the range of values for any particular variable into1000subintervals,and computed the numbers of occurrences for any subinterval.These numbers were considered later as weights of the bins.

Then,we split all subintervals for the particular variable into10consecutive bins with approximately the same size(in terms of weights).In many cases,where weights of some subintervals were too big,the numbers of bins were smaller than10.

Finally,we got a totally binary dataset in a sparse format with13594variables.After secondary trimming,we were left with9586binary features.

Remark3It is a well-known fact that the exact correspondence between small and large datasets may be found.We managed to?nd such a correspondence as some other teams (it was rather a response to the?ndings of other teams).However,we can not report any signi?cant progress(in the sense of the absolute scores),which was done by the help of this additional information.

4.2Results

The initial experiments,which were conducted against the vector of toy labels,were in-teresting and helpful for further studies.The system clearly detected all binary variables, which are secondary to the only one important original variable N5963(see Table1).The de?nition of the transformation function between two known variables is a rather technical issue,which may be solved easily using two steps procedure:(1)sorting according to the explanatory variable,and(2)smoothing using sample averages in order to reduce noise.

As a?rst step(after labels for the Churn,Appetency and Upselling were released),we applied regularised linear regression model as described in Section3.1.The number of the random sets was100and the ridge parameter wasφ=0.01.In order to form any random set,we used about90%of positive cases and k=1.That is,any random set contains

96

Classification of imbalanced marketing data

Table1:Training and test in terms of AUC with averaged score0.8059(initial results);0.8373 (fast and slow tracks results);0.8407(best results);0.851(post-challenge results).The

column FS indicates the number of variables,which were used in the model,where by?

we indicate the number of binary variables

Status Data Method Train Test FS

Initial Churn RLR0.85540.70159586?

Initial Appetency LinearSVM0.92530.83449586?

Initial Upselling RLR0.95190.88199586?

Initial Toy RLR0.76300.7190645?

Slow/Best Churn LogitBoost0.76660.748441

Slow Appetency LogitBoost0.93450.859733

Slow Upselling LogitBoost0.92260.90454

Post Churn Ensemble0.87720.7618278

Post Appetency Ensemble0.95860.8835348

Post Upselling Ensemble0.96280.9077285

Nikulin McLachlan

Table3:Results of the winners

Track Team Churn Appetency Upselling Score

Fast IBM Research0.76110.8830.90380.8493

Slow Uni.Melb.0.75420.88360.90470.8475

Overall-0.76510.88360.90920.8526

Data Model Test Weight Appetency BinaryRF0.878410

Upselling GBM0.907220 Appetency ADA0.87423

5.1Ensemble Constructor

Using results of Table2separately we can achieve the score of0.8489.Now,we shall describe a general framework how using ensemble technique we can increase the score to0.851based on the results of the above Table2and without any additional information.Note that the particular solutions,which were produced using di?erent software may have approximately the same AUC,but they are very di?erent in the structural sense and we can not link them directly.

Suppose we have k solutions,which may be represented by the squared matrix T with k columns.By S we shall denote the matrix,which was derived from the original matrix of solutions T by sorting on any column in an increasing order.In order to de?ne the ensemble constructor we also need a matrix R of indices de?ning an exact correspondence between

the elements of T and S.More precisely,s r

ij j =t ij.

Let us denote byβj the quality measurement of the solution with index j=1,...,k. In order to simplify the following notation and without loss of generality we shall assume

98

Classification of imbalanced marketing data

Figure3:Histograms with300bins illustrating the structural similarities between training and test trajectories,where the right columns corresponds to the score0.8509:(a-b)Churn,(c-d)

Appetency and(e-f)Upselling.

that(a)solution i is better comparing with solution j ifβi>βj,(b)the top solution corresponds to the index1,that meansβ1≥βj,j=2,...,k.Then,we can accept solution N1as a base solution,and shall adjust remaining k?1solutions according to the rule:

q ij=s r

ij1,i=1,...,n,j=2,...,k.Note that the?rst column in the matrix Q coincides

with the?rst column of the original matrix T(base solution).

We shall de?ne vector-column of the weight coe?cients w j=ψ(βj), k j=1w j=1,where ψis an increasing function.In line with the proposed method,the ensemble solution will be computed according to the formula:f=QW.

We can report a signi?cant progress with above technique in application to the Ap-petency case.Our ensemble solution was constructed using three pure solutions as it is displayed in Table2,where the weight coe?cients are shown without normalisation.Also, we achieved some modest progress in application to the Upselling case.However,in the case of Churn we were not able to improve GBM-solution using above technique.

On the other hand,we were trying to exploit mutual dependences between di?erent cases.For example,higher combined score in relation to the Appetency and Upselling cases most likely indicates lower score in application to the Churn case:

new.score C=old.score C?c1score A?c2score U,(6) where c1and c2are non-negative constants.In fact,using above method(6)we managed to achieve some small improvement only in application to the Churn case.Possibly,some

99

Nikulin McLachlan

interesting results may be achieved using multinomial logistic regression(MLR)(Bohning, 1992),as this model explore a hidden dependencies between labels.In our case we have four di?erent labels:1)Churn,2)Appetency,3)Upselling and4)Other.In the case of MLR,we shall be dealing with the3?-dimensional Hessian matrix of second derivatives.We can expect that the matrix factorisation,as it is described in Section3.4,will be e?ective to reduce original dimensionality?to k-number of the latent factors.

It is interesting to note that the histograms displayed on Figure3illustrate remarkable similarity between left(training)and right(test)columns.The structure of the histograms corresponding to the Upselling case is far from simple and it will be very important to?nd explanations in the terms of the particular features.

6.Concluding Remarks

The main philosophy of our method may be described as follows.We can not apply fairly complex modelling systems to the original huge and noisy database,which contains more than90%of useless information.So we conducted the FS step with three very simple and reliable methods,namely RS,RLR,and MVF.As an outcome,we produced signi?cantly smaller datasets,which are able to be used as an input for more advanced studies.

In general terms,we have found that our results are satisfactory,particularly,for the most important fast track.Based on the results of our post-challenge submission,we can conclude that signi?cant progress may be achieved using more advanced pre-processing and feature selection techniques.Also,it will be a good idea not to rely completely on any particular model or software,and conduct cross-validation experiments with several di?erent models.In this way,the performance of the?nal solution might be improved using various ensemble techniques.

References

G.Biau,L.Devroye,and G.Lugosi.Consistency of random forests and other averaging classi?ers. Journal of Machine Learning Research,9:2015–2033,2007.

D.Bohning.Multinomial logistic regression algorithm.Ann.Inst.Statist.Math.,44(1):197–200, 1992.

L.Breiman.Random forests.Machine Learning,45(1):5–32,2001.

L.Breiman.Bagging predictors.Machine Learning,24:123–140,1996.

D.Lee and H.Seung.Algorithms for non-negative matrix factorisation.NIPS,2001.

C.Mol,S.Mosci,M.Traskine,and A.Verri.A regularised method for selecting nested groups of relevant genes from microarray data.Journal of Computational Biology,16(5):677–690,2009. V.Nikulin.Learning with mean-variance?ltering,SVM and gradient-based optimization.In In-ternational Joint Conference on Neural Networks,Vancouver,BC,Canada,July16-21,pages 4195–4202.IEEE,2006.

V.Nikulin.Classi?cation of imbalanced data with random sets and mean-variance?ltering.Inter-national Journal of Data Warehousing and Mining,4(2):63–78,2008.

A.Paterek.Improving regularised singular value decomposition for collaborative?ltering.pages 39–42.KDD,San Jose,2007.

B.Zhang,T.Pham,and Y.Zhang.Bagging support vector machine for classi?cation of SELDI-ToF mass spectra of ovarian cancer serum samples.In LNAI,volume4830,pages820–826.Springer, 2007.

100

相关主题