搜档网
当前位置:搜档网 › Introduction to data mining-chap4

Introduction to data mining-chap4

4 Classi?cation:

Basic Concepts,

Decision Trees,and

Model Evaluation

Classi?cation,which is the task of assigning objects to one of several prede?ned categories,is a pervasive problem that encompasses many diverse applications. Examples include detecting spam email messages based upon the message header and content,categorizing cells as malignant or benign based upon the results of MRI scans,and classifying galaxies based upon their shapes(see Figure4.1).

(a)A spiral galaxy.(b)An elliptical galaxy.

Figure4.1.Classi?cation of galaxies.The images are from the NASA website.

146Chapter 4

Classi?cation

Classification

model

Input

Attribute set

(x )

Output

Class label

(y )

Figure 4.2.Classi?cation as the task of mapping an input attribute set x into its class label y .

This chapter introduces the basic concepts of classi?cation,describes some of the key issues such as model over?tting,and presents methods for evaluating and comparing the performance of a classi?cation technique.While it focuses mainly on a technique known as decision tree induction,most of the discussion in this chapter is also applicable to other classi?cation techniques,many of which are covered in Chapter 5.

4.1Preliminaries

The input data for a classi?cation task is a collection of records.Each record,also known as an instance or example,is characterized by a tuple (x ,y ),where x is the attribute set and y is a special attribute,designated as the class label (also known as category or target attribute).Table 4.1shows a sample data set used for classifying vertebrates into one of the following categories:mammal,bird,?sh,reptile,or amphibian.The attribute set includes properties of a vertebrate such as its body temperature,skin cover,method of reproduction,ability to ?y,and ability to live in water.Although the attributes presented in Table 4.1are mostly discrete,the attribute set can also contain continuous features.The class label,on the other hand,must be a discrete attribute.This is a key characteristic that distinguishes classi?cation from regression ,a predictive modeling task in which y is a continuous attribute.Regression techniques are covered in Appendix D.

De?nition 4.1(Classi?cation).Classi?cation is the task of learning a tar-get function f that maps each attribute set x to one of the prede?ned class labels y .

The target function is also known informally as a classi?cation model .A classi?cation model is useful for the following purposes.

Descriptive Modeling A classi?cation model can serve as an explanatory tool to distinguish between objects of di?erent classes.For example,it would be useful—for both biologists and others—to have a descriptive model that

4.1Preliminaries147

Table4.1.The vertebrate data set.

Name Body Skin Gives Aquatic Aerial Has Hiber-Class Temperature Cover Birth Creature Creature Legs nates Label human warm-blooded hair yes no no yes no mammal python cold-blooded scales no no no no yes reptile salmon cold-blooded scales no yes no no no?sh whale warm-blooded hair yes yes no no no mammal frog cold-blooded none no semi no yes yes amphibian cold-blooded scales no no no yes no reptile komodo

dragon

bat warm-blooded hair yes no yes yes yes mammal pigeon warm-blooded feathers no no yes yes no bird cat warm-blooded fur yes no no yes no mammal cold-blooded scales yes yes no no no?sh leopard

shark

turtle cold-blooded scales no semi no yes no reptile penguin warm-blooded feathers no semi no yes no bird porcupine warm-blooded quills yes no no yes yes mammal eel cold-blooded scales no yes no no no?sh salamander cold-blooded none no semi no yes yes amphibian

summarizes the data shown in Table4.1and explains what features de?ne a vertebrate as a mammal,reptile,bird,?sh,or amphibian.

Predictive Modeling A classi?cation model can also be used to predict the class label of unknown records.As shown in Figure4.2,a classi?cation model can be treated as a black box that automatically assigns a class label when presented with the attribute set of an unknown record.Suppose we are given the following characteristics of a creature known as a gila monster:

Name Body Skin Gives Aquatic Aerial Has Hiber-Class Temperature Cover Birth Creature Creature Legs nates Label gila monster cold-blooded scales no no no yes yes?

We can use a classi?cation model built from the data set shown in Table4.1 to determine the class to which the creature belongs.

Classi?cation techniques are most suited for predicting or describing data sets with binary or nominal categories.They are less e?ective for ordinal categories(e.g.,to classify a person as a member of high-,medium-,or low-income group)because they do not consider the implicit order among the categories.Other forms of relationships,such as the subclass–superclass re-lationships among categories(e.g.,humans and apes are primates,which in

148Chapter 4Classi?cation

turn,is a subclass of mammals)are also ignored.The remainder of this chapter focuses only on binary or nominal class labels.

4.2

General Approach to Solving a Classi?cation Problem

A classi?cation technique (or classi?er)is a systematic approach to building classi?cation models from an input data set.Examples include decision tree classi?ers,rule-based classi?ers,neural networks,support vector machines,and na¨?ve Bayes classi?ers.Each technique employs a learning algorithm to identify a model that best ?ts the relationship between the attribute set and class label of the input data.The model generated by a learning algorithm should both ?t the input data well and correctly predict the class labels of records it has never seen before.Therefore,a key objective of the learning algorithm is to build models with good generalization capability;i.e.,models that accurately predict the class labels of previously unknown records.

Figure 4.3shows a general approach for solving classi?cation problems.First,a training set consisting of records whose class labels are known

must

Figure 4.3.General approach for building a classi?cation model.

4.2General Approach to Solving a Classi?cation Problem149

Table4.2.Confusion matrix for a2-class problem.

Predicted Class

Class=1Class=0

Actual Class=1f11f10

Class Class=0f01f00

be provided.The training set is used to build a classi?cation model,which is subsequently applied to the test set,which consists of records with unknown class labels.

Evaluation of the performance of a classi?cation model is based on the counts of test records correctly and incorrectly predicted by the model.These counts are tabulated in a table known as a confusion matrix.Table4.2 depicts the confusion matrix for a binary classi?cation problem.Each entry f ij in this table denotes the number of records from class i predicted to be of class j.For instance,f01is the number of records from class0incorrectly predicted as class1.Based on the entries in the confusion matrix,the total number of correct predictions made by the model is(f11+f00)and the total number of incorrect predictions is(f10+f01).

Although a confusion matrix provides the information needed to determine how well a classi?cation model performs,summarizing this information with a single number would make it more convenient to compare the performance of di?erent models.This can be done using a performance metric such as accuracy,which is de?ned as follows:

Accuracy=Number of correct predictions

Total number of predictions

=

f11+f00

f11+f10+f01+f00

.(4.1)

Equivalently,the performance of a model can be expressed in terms of its error rate,which is given by the following equation:

Error rate=Number of wrong predictions

Total number of predictions

=

f10+f01

f11+f10+f01+f00

.(4.2)

Most classi?cation algorithms seek models that attain the highest accuracy,or equivalently,the lowest error rate when applied to the test set.We will revisit the topic of model evaluation in Section4.5.

150Chapter4Classi?cation

4.3Decision Tree Induction

This section introduces a decision tree classi?er,which is a simple yet widely used classi?cation technique.

4.3.1How a Decision Tree Works

To illustrate how classi?cation with a decision tree works,consider a simpler version of the vertebrate classi?cation problem described in the previous sec-tion.Instead of classifying the vertebrates into?ve distinct groups of species, we assign them to two categories:mammals and non-mammals.

Suppose a new species is discovered by scientists.How can we tell whether it is a mammal or a non-mammal?One approach is to pose a series of questions about the characteristics of the species.The?rst question we may ask is whether the species is cold-or warm-blooded.If it is cold-blooded,then it is de?nitely not a mammal.Otherwise,it is either a bird or a mammal.In the latter case,we need to ask a follow-up question:Do the females of the species give birth to their young?Those that do give birth are de?nitely mammals, while those that do not are likely to be non-mammals(with the exception of egg-laying mammals such as the platypus and spiny anteater).

The previous example illustrates how we can solve a classi?cation problem by asking a series of carefully crafted questions about the attributes of the test record.Each time we receive an answer,a follow-up question is asked until we reach a conclusion about the class label of the record.The series of questions and their possible answers can be organized in the form of a decision tree,which is a hierarchical structure consisting of nodes and directed edges. Figure4.4shows the decision tree for the mammal classi?cation problem.The tree has three types of nodes:

?A root node that has no incoming edges and zero or more outgoing edges.

?Internal nodes,each of which has exactly one incoming edge and two or more outgoing edges.

?Leaf or terminal nodes,each of which has exactly one incoming edge and no outgoing edges.

In a decision tree,each leaf node is assigned a class label.The non-terminal nodes,which include the root and other internal nodes,contain attribute test conditions to separate records that have di?erent characteris-tics.For example,the root node shown in Figure4.4uses the attribute Body

4.3Decision Tree Induction151

Figure4.4.A decision tree for the mammal classi?cation problem. Temperature to separate warm-blooded from cold-blooded vertebrates.Since all cold-blooded vertebrates are non-mammals,a leaf node labeled Non-mammals is created as the right child of the root node.If the vertebrate is warm-blooded, a subsequent attribute,Gives Birth,is used to distinguish mammals from other warm-blooded creatures,which are mostly birds.

Classifying a test record is straightforward once a decision tree has been constructed.Starting from the root node,we apply the test condition to the record and follow the appropriate branch based on the outcome of the test. This will lead us either to another internal node,for which a new test condition is applied,or to a leaf node.The class label associated with the leaf node is then assigned to the record.As an illustration,Figure4.5traces the path in the decision tree that is used to predict the class label of a?amingo.The path terminates at a leaf node labeled Non-mammals.

4.3.2How to Build a Decision Tree

In principle,there are exponentially many decision trees that can be con-structed from a given set of attributes.While some of the trees are more accu-rate than others,?nding the optimal tree is computationally infeasible because of the exponential size of the search space.Nevertheless,e?cient algorithms have been developed to induce a reasonably accurate,albeit suboptimal,de-cision tree in a reasonable amount of time.These algorithms usually employ a greedy strategy that grows a decision tree by making a series of locally op-

152Chapter4Classi?cation

Figure4.5.Classifying an unlabeled vertebrate.The dashed lines represent the outcomes of applying various attribute test conditions on the unlabeled vertebrate.The vertebrate is eventually assigned to the Non-mammal class.

timum decisions about which attribute to use for partitioning the data.One such algorithm is Hunt’s algorithm,which is the basis of many existing de-cision tree induction algorithms,including ID3,C4.5,and CART.This section presents a high-level discussion of Hunt’s algorithm and illustrates some of its design issues.

Hunt’s Algorithm

In Hunt’s algorithm,a decision tree is grown in a recursive fashion by parti-tioning the training records into successively purer subsets.Let D t be the set of training records that are associated with node t and y={y1,y2,...,y c}be the class labels.The following is a recursive de?nition of Hunt’s algorithm.

Step1:If all the records in D t belong to the same class y t,then t is a leaf node labeled as y t.

Step2:If D t contains records that belong to more than one class,an at-tribute test condition is selected to partition the records into smaller subsets.A child node is created for each outcome of the test condi-tion and the records in D t are distributed to the children based on the outcomes.The algorithm is then recursively applied to each child node.

4.3Decision Tree Induction153

b i n a

r y

c a

t e

g o

r i c

a l

c o

n t

i n

u o

u s

c l a

s

s

Figure4.6.Training set for predicting borrowers who will default on loan payments.

To illustrate how the algorithm works,consider the problem of predicting whether a loan applicant will repay her loan obligations or become delinquent, subsequently defaulting on her loan.A training set for this problem can be constructed by examining the records of previous borrowers.In the example shown in Figure4.6,each record contains the personal information of a bor-rower along with a class label indicating whether the borrower has defaulted on loan payments.

The initial tree for the classi?cation problem contains a single node with class label Defaulted=No(see Figure4.7(a)),which means that most of the borrowers successfully repaid their loans.The tree,however,needs to be re?ned since the root node contains records from both classes.The records are subsequently divided into smaller subsets based on the outcomes of the Home Owner test condition,as shown in Figure4.7(b).The justi?cation for choosing this attribute test condition will be discussed later.For now,we will assume that this is the best criterion for splitting the data at this point.Hunt’s algorithm is then applied recursively to each child of the root node.From the training set given in Figure4.6,notice that all borrowers who are home owners successfully repaid their loans.The left child of the root is therefore a leaf node labeled Defaulted=No(see Figure4.7(b)).For the right child,we need to continue applying the recursive step of Hunt’s algorithm until all the records belong to the same class.The trees resulting from each recursive step are shown in Figures4.7(c)and(d).

154Chapter 4Classi?cation

(a)

(b)Defaulted = No

(d)

(c)Figure 4.7.Hunt’s algorithm for inducing decision trees.

Hunt’s algorithm will work if every combination of attribute values is present in the training data and each combination has a unique class label.These assumptions are too stringent for use in most practical situations.Ad-ditional conditions are needed to handle the following cases:

1.It is possible for some of the child nodes created in Step 2to be empty;i.e.,there are no records associated with these nodes.This can happen if none of the training records have the combination of attribute values associated with such nodes.In this case the node is declared a leaf node with the same class label as the majority class of training records associated with its parent node.

2.In Step 2,if all the records associated with D t have identical attribute values (except for the class label),then it is not possible to split these records any further.In this case,the node is declared a leaf node with the same class label as the majority class of training records associated with this node.

4.3Decision Tree Induction155 Design Issues of Decision Tree Induction

A learning algorithm for inducing decision trees must address the following two issues.

1.How should the training records be split?Each recursive step

of the tree-growing process must select an attribute test condition to divide the records into smaller subsets.To implement this step,the algorithm must provide a method for specifying the test condition for di?erent attribute types as well as an objective measure for evaluating the goodness of each test condition.

2.How should the splitting procedure stop?A stopping condition is

needed to terminate the tree-growing process.A possible strategy is to continue expanding a node until either all the records belong to the same class or all the records have identical attribute values.Although both conditions are su?cient to stop any decision tree induction algorithm, other criteria can be imposed to allow the tree-growing procedure to terminate earlier.The advantages of early termination will be discussed later in Section4.4.5.

4.3.3Methods for Expressing Attribute Test Conditions Decision tree induction algorithms must provide a method for expressing an attribute test condition and its corresponding outcomes for di?erent attribute types.

Binary Attributes The test condition for a binary attribute generates two potential outcomes,as shown in Figure4.8.

Warm-blooded

Cold-blooded

Figure4.8.Test condition for binary attributes.

156Chapter 4

Classi?cation

{Married}

{Single,Divorced}

(a) Multiway split

{Single}

{Married,Divorced}(b) Binary split {by grouping attribute values}

{Single,Married}

{Divorced}

OR

OR

Figure 4.9.Test conditions for nominal attributes.

Nominal Attributes Since a nominal attribute can have many values,its test condition can be expressed in two ways,as shown in Figure 4.9.For a multiway split (Figure 4.9(a)),the number of outcomes depends on the number of distinct values for the corresponding attribute.For example,if an attribute such as marital status has three distinct values—single,married,or divorced—its test condition will produce a three-way split.On the other hand,some decision tree algorithms,such as CART,produce only binary splits by considering all 2k ?1?1ways of creating a binary partition of k attribute values.Figure 4.9(b)illustrates three di?erent ways of grouping the attribute values for marital status into two subsets.

Ordinal Attributes Ordinal attributes can also produce binary or multiway splits.Ordinal attribute values can be grouped as long as the grouping does not violate the order property of the attribute values.Figure 4.10illustrates various ways of splitting training records based on the Shirt Size attribute.The groupings shown in Figures 4.10(a)and (b)preserve the order among the attribute values,whereas the grouping shown in Figure 4.10(c)violates this property because it combines the attribute values Small and Large into

4.3

Decision Tree Induction 157

{Small,Medium}{Large,

Extra Large}(a)

{Small}

{Medium, Large,Extra Large}

(b)

{Small,Large}

{Medium,Extra Large}

(c)

Figure 4.10.Different ways of grouping ordinal attribute values.

the same partition while Medium and Extra Large are combined into another partition.

Continuous Attributes For continuous attributes,the test condition can be expressed as a comparison test (A

preserved.

(b)

(a)

{10K, 25K}{25K, 50K}{50K, 80K}

Figure 4.11.Test condition for continuous attributes.

158Chapter4Classi?cation

(a)

(b)(c) Figure4.12.Multiway versus binary splits.

4.3.4Measures for Selecting the Best Split

There are many measures that can be used to determine the best way to split the records.These measures are de?ned in terms of the class distribution of the records before and after splitting.

Let p(i|t)denote the fraction of records belonging to class i at a given node t.We sometimes omit the reference to node t and express the fraction as p i. In a two-class problem,the class distribution at any node can be written as (p0,p1),where p1=1?p0.To illustrate,consider the test conditions shown in Figure4.12.The class distribution before splitting is(0.5,0.5)because there are an equal number of records from each class.If we split the data using the Gender attribute,then the class distributions of the child nodes are (0.6,0.4)and(0.4,0.6),respectively.Although the classes are no longer evenly distributed,the child nodes still contain records from both classes.Splitting on the second attribute,Car Type,will result in purer partitions.

The measures developed for selecting the best split are often based on the degree of impurity of the child nodes.The smaller the degree of impurity,the more skewed the class distribution.For example,a node with class distribu-tion(0,1)has zero impurity,whereas a node with uniform class distribution (0.5,0.5)has the highest impurity.Examples of impurity measures include

Entropy(t)=?c?1

i=0

p(i|t)log2p(i|t),(4.3)

Gini(t)=1?c?1

i=0

[p(i|t)]2,(4.4)

Classi?cation error(t)=1?max

i

[p(i|t)],(4.5) where c is the number of classes and0log20=0in entropy calculations.

4.3

Decision Tree Induction 159

Entropy

Gini

Misclassification error 10.90.80.70.60.50.40.30.20.100

0.1

0.2

0.3

0.4

0.50.6

0.70.80.91

p

Figure https://www.sodocs.net/doc/178497935.html,parison among the impurity measures for binary classi?cation problems.

Figure 4.13compares the values of the impurity measures for binary classi-?cation problems.p refers to the fraction of records that belong to one of the two classes.Observe that all three measures attain their maximum value when the class distribution is uniform (i.e.,when p =0.5).The minimum values for the measures are attained when all the records belong to the same class (i.e.,when p equals 0or 1).We next provide several examples of computing the di?erent impurity measures.

Node N 1Count Class=00Class=16Gini =1?(0/6)2?(6/6)2=0

Entropy =?(0/6)log 2(0/6)?(6/6)log 2(6/6)=0Error =1?max[0/6,6/6]=0

Node N 2Count Class=01Class=15Gini =1?(1/6)2?(5/6)2=0.278

Entropy =?(1/6)log 2(1/6)?(5/6)log 2(5/6)=0.650Error =1?max[1/6,5/6]=0.167

Node N 3Count Class=03Class=1

3

Gini =1?(3/6)2?(3/6)2=0.5

Entropy =?(3/6)log 2(3/6)?(3/6)log 2(3/6)=1Error =1?max[3/6,3/6]=0.5

160Chapter4Classi?cation

The preceding examples,along with Figure4.13,illustrate the consistency among di?erent impurity measures.Based on these calculations,node N1has the lowest impurity value,followed by N2and N3.Despite their consistency, the attribute chosen as the test condition may vary depending on the choice of impurity measure,as will be shown in Exercise3on page198.

To determine how well a test condition performs,we need to compare the degree of impurity of the parent node(before splitting)with the degree of impurity of the child nodes(after splitting).The larger their di?erence,the better the test condition.The gain,?,is a criterion that can be used to determine the goodness of a split:

?=I(parent)?

k

j=1

N(v j)

N

I(v j),(4.6)

where I(·)is the impurity measure of a given node,N is the total number of records at the parent node,k is the number of attribute values,and N(v j) is the number of records associated with the child node,v j.Decision tree induction algorithms often choose a test condition that maximizes the gain ?.Since I(parent)is the same for all test conditions,maximizing the gain is equivalent to minimizing the weighted average impurity measures of the child nodes.Finally,when entropy is used as the impurity measure in Equation4.6, the di?erence in entropy is known as the information gain,?info. Splitting of Binary Attributes

Consider the diagram shown in Figure4.14.Suppose there are two ways to split the data into smaller subsets.Before splitting,the Gini index is0.5since there are an equal number of records from both classes.If attribute A is chosen to split the data,the Gini index for node N1is0.4898,and for node N2,it is0.480.The weighted average of the Gini index for the descendent nodes is (7/12)×0.4898+(5/12)×0.480=0.486.Similarly,we can show that the weighted average of the Gini index for attribute B is0.375.Since the subsets for attribute B have a smaller Gini index,it is preferred over attribute A. Splitting of Nominal Attributes

As previously noted,a nominal attribute can produce either binary or multi-way splits,as shown in Figure4.15.The computation of the Gini index for a binary split is similar to that shown for determining binary attributes.For the ?rst binary grouping of the Car Type attribute,the Gini index of{Sports,

4.3

Decision Tree Induction 161

Gini = 0.375

N114

52

N2C0C1

Gini = 0.500

Parent 66

C0C1

Gini = 0.486N143

23

N2C0Node N1

C1

Node N2

A

Y es No Node N1

Node N2

B

Y es No Figure 4.14.Splitting binary attributes.

(a) Binary split

(b) Multiway split

Figure 4.15.Splitting nominal attributes.

Luxury }is 0.4922and the Gini index of {Family }is 0.3750.The weighted average Gini index for the grouping is equal to

16/20×0.4922+4/20×0.3750=0.468.

Similarly,for the second binary grouping of {Sports }and {Family,Luxury },the weighted average Gini index is 0.167.The second grouping has a lower Gini index because its corresponding subsets are much purer.

162Chapter4Classi?cation

Figure4.16.Splitting continuous attributes.

For the multiway split,the Gini index is computed for every attribute value. Since Gini({Family})=0.375,Gini({Sports})=0,and Gini({Luxury})= 0.219,the overall Gini index for the multiway split is equal to

4/20×0.375+8/20×0+8/20×0.219=0.163.

The multiway split has a smaller Gini index compared to both two-way splits. This result is not surprising because the two-way split actually merges some of the outcomes of a multiway split,and thus,results in less pure subsets. Splitting of Continuous Attributes

Consider the example shown in Figure4.16,in which the test condition Annual Income≤v is used to split the training records for the loan default classi?ca-tion problem.A brute-force method for?nding v is to consider every value of the attribute in the N records as a candidate split position.For each candidate v,the data set is scanned once to count the number of records with annual income less than or greater than v.We then compute the Gini index for each candidate and choose the one that gives the lowest value.This approach is computationally expensive because it requires O(N)operations to compute the Gini index at each candidate split position.Since there are N candidates, the overall complexity of this task is O(N2).To reduce the complexity,the training records are sorted based on their annual income,a computation that requires O(N log N)time.Candidate split positions are identi?ed by taking the midpoints between two adjacent sorted values:55,65,72,and so on.How-ever,unlike the brute-force approach,we do not have to examine all N records when evaluating the Gini index of a candidate split position.

For the?rst candidate,v=55,none of the records has annual income less than$55K.As a result,the Gini index for the descendent node with Annual

4.3Decision Tree Induction163 Income<$55K is zero.On the other hand,the number of records with annual income greater than or equal to$55K is3(for class Yes)and7(for class No), respectively.Thus,the Gini index for this node is0.420.The overall Gini index for this candidate split position is equal to0×0+1×0.420=0.420.

For the second candidate,v=65,we can determine its class distribution by updating the distribution of the previous candidate.More speci?cally,the new distribution is obtained by examining the class label of the record with the lowest annual income(i.e.,$60K).Since the class label for this record is No,the count for class No is increased from0to1(for Annual Income≤$65K) and is decreased from7to6(for Annual Income>$65K).The distribution for class Yes remains unchanged.The new weighted-average Gini index for this candidate split position is0.400.

This procedure is repeated until the Gini index values for all candidates are computed,as shown in Figure4.16.The best split position corresponds to the one that produces the smallest Gini index,i.e.,v=97.This procedure is less expensive because it requires a constant amount of time to update the class distribution at each candidate split position.It can be further optimized by considering only candidate split positions located between two adjacent records with di?erent class labels.For example,because the?rst three sorted records (with annual incomes$60K,$70K,and$75K)have identical class labels,the best split position should not reside between$60K and$75K.Therefore,the candidate split positions at v=$55K,$65K,$72K,$87K,$92K,$110K,$122K, $172K,and$230K are ignored because they are located between two adjacent records with the same class labels.This approach allows us to reduce the number of candidate split positions from11to2.

Gain Ratio

Impurity measures such as entropy and Gini index tend to favor attributes that have a large number of distinct values.Figure4.12shows three alternative test conditions for partitioning the data set given in Exercise2on page198. Comparing the?rst test condition,Gender,with the second,Car Type,it is easy to see that Car Type seems to provide a better way of splitting the data since it produces purer descendent nodes.However,if we compare both conditions with Customer ID,the latter appears to produce purer partitions. Yet Customer ID is not a predictive attribute because its value is unique for each record.Even in a less extreme situation,a test condition that results in a large number of outcomes may not be desirable because the number of records associated with each partition is too small to enable us to make any reliable predictions.

164Chapter4Classi?cation

There are two strategies for overcoming this problem.The?rst strategy is to restrict the test conditions to binary splits only.This strategy is employed by decision tree algorithms such as CART.Another strategy is to modify the splitting criterion to take into account the number of outcomes produced by the attribute test condition.For example,in the C4.5decision tree algorithm, a splitting criterion known as gain ratio is used to determine the goodness of a split.This criterion is de?ned as follows:

Gain ratio=

?info

Split Info

.(4.7)

Here,Split Info=? k

i=1

P(v i)log2P(v i)and k is the total number of splits.

For example,if each attribute value has the same number of records,then ?i:P(v i)=1/k and the split information would be equal to log2k.This example suggests that if an attribute produces a large number of splits,its split information will also be large,which in turn reduces its gain ratio.

4.3.5Algorithm for Decision Tree Induction

A skeleton decision tree induction algorithm called TreeGrowth is shown in Algorithm4.1.The input to this algorithm consists of the training records E and the attribute set F.The algorithm works by recursively selecting the best attribute to split the data(Step7)and expanding the leaf nodes of the Algorithm4.1A skeleton decision tree induction algorithm.

TreeGrowth(E,F)

1:if stopping cond(E,F)=true then

2:leaf=createNode().

3:https://www.sodocs.net/doc/178497935.html,bel=Classify(E).

4:return leaf.

5:else

6:root=createNode().

7:root.test cond=find best split(E,F).

8:let V={v|v is a possible outcome of root.test cond}.

9:for each v∈V do

10:E v={e|root.test cond(e)=v and e∈E}.

11:child=TreeGrowth(E v,F).

12:add child as descendent of root and label the edge(root→child)as v. 13:end for

14:end if

15:return root.

相关主题