当前位置：文档之家› Improved Feature Selection Algorithm Based on SVM and Correlation

Improved Feature Selection Algorithm Based on SVM and Correlation

Improved Feature Selection Algorithm Based on

SVM and Correlation

Zong-Xia Xie,Qing-Hua Hu,and Da-Ren Yu

Harbin Institute of Technology,Harbin,P.R.China

xiezongxia@https://www.doczj.com/doc/a35886934.html,

Abstract.As a feature selection method,support vector machines-

recursive feature elimination(SVM-RFE)can remove irrelevance fea-

tures but don’t take redundant features into consideration.In this pa-

per,it is shown why this method can’t remove redundant features and

an improved technique is presented.Correlation coe?cient is introduced

to measure the redundancy in the selected subset with SVM-RFE.The

features which have a great correlation coe?cient with some important

feature are removed.Experimental results show that there actually are

several strongly redundant features in the selected subsets by SVM-RFE.

The coe?cients are high to0.99.The proposed method can not only re-

duce the number of features,but also keep the classi?cation accuracy.

1Introduction

In recent years,data has become increasingly larger not only in number of in-stances but also in number of features in many applications,such as gene selec-tion from micro-array data and text automatic categorization,etc.,where the number of features in the raw data ranges from hundreds to tens of thousands. High dimensionality brings great di?culty in understanding the data,construct-ing classi?er systems for pattern recognition and machine learning.Therefore, Feature selection has attracted much attention from pattern recognition and machine learning society[1,2,3,4].The aim of FS is to?nd a subset of origi-nal features from a given dataset through removing irrelevant and/or redundant features[5].FS can improve the prediction performance of the predictions,pro-vide faster and more cost-e?ective predictors and give a better insight of the underlying process that generated the data[3,4].

Traditional statistical methods for clustering and classi?cation have been ex-tensively used for feature selection.Support Vector Machines(SVMs),proposed in1990s by Vapnik,have already been known as an e?cient tool for discover-ing informative attributes[6].SVMs minimize the upper bound of true error by maximizing margin with which they separate the data into distinct classes. In the framework of statistical learning theory,the upper bound of errors of a classi?er depends on its VC dimension and the number of the training data. Feature subset selection may lead to reduce the VC dimension of the trained classi?er,and then improves the classi?cation performance.A lot of empirical analysis showed in the literatures supports this argument.

J.Wang et al.(Eds.):ISNN2006,LNCS3971,pp.1373–1380,2006.

c Springer-Verlag Berlin Heidelberg2006

1374Z.-X.Xie,Q.-H.Hu,and D.-R.Yu

A series of feature selection algorithms for SVMs or using SVMs are proposed [7,8,9].In2002,SVMs was?rst used as a method of feature selection in gene selection for cancer classi?cation,which is called SVM recursive feature elimina-tion(SVM-RFE)[7].In2003,Rakotomamonjy showed some variable selection algorithms based on di?erent kinds of SVM criteria,such as weight vector,the radius/margin bound,the span estimate and so on[8].Li et al.extended these techniques to the cases of multi-class classi?cation in2004[9].In2005,Duan generalized SVM-RFE method to multi-class problems through computing the feature ranking score from a statistical analysis of weight vectors of multiple linear SVMs trained on subsamples of the original training data[10].

However,we can?nd that all of the above techniques mainly focus on?nding relevant features and delete those features which are irrelevant to the classi?ca-tion tasks,but don’t try to delete the redundant features from the data.The weight vector presents the relation between the attribute set and the decision.By deleting the features with little weights,we can cut down the irrelevant informa-tion.Just as Yu and Liu pointed[11]:an entire feature set can be conceptually divided into four basic disjoint parts:irrelevant features(I),redundant features (II,part of weakly relevant features),weakly relevant but non-redundant features (III),and strongly relevant features(IV).An optimal subset should contain the features in part III and IV.Therefore,feature selection should eliminate not only the irrelevant features,but also the redundant features[12].In this paper,we’ll ?rst show that SVM-RFE algorithm can’t remove the redundant features,and then present an improved feature selection method,which further deletes some redundant features by computing the correlation coe?cient of the selected fea-ture subset.Some datasets in UCI machine learning data repository are selected to show the proposed method e?cient.

2Some Related Work

Recursive feature elimination(RFE)is a method which performs backward fea-ture elimination:starts with all features,and then removes some irrelevant fea-tures according to a ranking criterion until satis?ed with a stop criterion.As for SVM-RFE,it is an application of RFE using SVM criteria for ranking.The stop criterion is to?nd the m features which lead to the largest margin of class separation or that the experience risk begins to go down[7].In the follows,the latter stop criterion is used.

Consider the following two-category classi?cation problem:

(x1,y1),(x2,y2),...,(x l,y l),x i∈R n,y i∈[?1,+1],

where x i is the training example and y i the class label.First,map x into a high dimensional space via a functionφ,then construct the optimal hyperplane (w·φ(x))+b=0,we have to solve:

minφ(w,ε)=1

2(w·w)+C

i=1

εi(1)

s.t.y i(w·φ(x)+b)?1+εi≥00≤i≤l

Improved Feature Selection Algorithm Based on SVM and Correlation1375

whereεi is slack variable,the constant C determines the trade o?between w·w

margin maximization and

i=1εi training error minimization.Then the hyper-

plane can generalize well based on SRM.From Karush-Kuhn-Tucker condition in optimization theory,optimization problem(1)is equivalent to

min1

i=1

a i a j y i y j K(x i,x j)?

i=1

a i(2)

s.t.y i(w·φ(x)+b)?1+εi≥0a i≥00≤i≤l

where K(x i,x j)=φ(x i)·φ(x j)is an inner product in the feature space which can map the data points into feature space without computingφ(x).Through the kernel trick,SVMs can deal with the classi?cation in nonlinear case easily. Then we can obtain the following decision function:

f(x,w,b)=sgn(w·φ(x)+b)=sgn(l

i=1

a i y i K(x i,x)+b)(3) where a is the Lagrange multiplier vector;w=

SV s

y i a i x i is the weight magni-tude,and w2is a measure of predictive ability,which is inversely proportionate to the margin.

In order to remove the irrelevant features,SVM-RFE algorithm uses w2as ranking criterion at each step.It starts with all the features and removes features based on the ranking criterion until the experience risk begins to go down[13]. This is done by performing the following iterative procedure:

1)Train SVM and get the original experience risk and the weight of every

features w(t)=

SV s y i a i x i(t)where x i(t)is the value of the t th feature of the i th

sample;

2)Rank the features by the values of w2;

3)Remove the feature with the smallest value w2,retrain the SVM with the reduced set of features,get the new experience risk and the weight of every feature;

4)If the new experience risk is equal to or larger than the original one,go to step2);else break;

5)The current feature subset is the selected subset.

SVM-RFE algorithm is originally designed to solve2-class problems and a multi-class version have been proposed in[9,10].

In a word,RFE computes a feature ranking using the w2,which is a measure of predictive ability.And the features which have small value of w2also have some e?ect on SVM decision.

The ranking criterion above measures the relevance between features and de-cision,by removing the attributes with little weights,some irrelevant information can be deleted,however the question is whether this technique can remove the redundant features from the data set.

1376Z.-X.Xie,Q.-H.Hu,and D.-R.Yu

As w=

SV s

y i a i x i and the values of the Lagrange multiplier a of the samples

which are not support vectors are equal to0,then w=

SV s y i a i x i=

i=1

y i a i x i

where l is the number of the training samples.It can expand to the following

function:

[w1,w2,...,w f]=[a1y1,a2y2,...,a l y l]=?

x11x12...x1f

x21x22...x2f

............

x l1x l2...x lf

where f is the number of features,x ij stands for the value of the j th feature of the i th sample.

Let’s discuss the relation of weights and correlation between the attributes. Without loss of generality,we consider arbitrary two weights w m and w n:

w m=a1y1x1m+a2y2x2m+···+a l y l x lm

w n=a1y1x1n+a2y2x2n+···+a l y l x ln

where f m=[x1m,x2m,···,x lm]and f n=[x1n,x2n,···,x ln]are two features of the data,if f m=k·f n+b,then x im=k·x in+b.We have w m=k·w n+b.

The above analysis shows the weights are proportioned if the features are linear correlative.In particular,k=1and b=0,then w m=w n.As we know, f m and f n are redundant and one of them should be removed from the training data.However,if one of the features is with a great weight,the redundant feature will also get a large weight.According the recursive feature elimination algorithm,only the features with little weights can be deleted.

Certainly,the above situation is the ideal one.In the real world,there are few data sets with linear correlative features;however we can also?nd there are

a lot of data sets that the correlation coe?cients among the features are above

0.9,sometimes high to0.99.These features almost have the same value or are proportionate,the di?erence is very small.In this case,the two features will be removed or retained simultaneity if SVM-RFE is used.Therefore,there exist the linear redundant features in the selected feature subset by SVM-RFE.

3The Improved Algorithm

From the analysis of SVM-RFE in section2,we can know that it just consid-ers the relation of features and decision,while it doesn’t take the relation of features into consideration.In this section,we introduce the improve algorithm by correlation coe?cient matrix to further delete redundant features selected by RFE-SVM.

In order to remove the redundant features which are linear correlation with each other,we need a measure to compute the relation between features.Correla-tion coe?cient,information gain,Gini index are used to calculate the similarity among the feature in some literature[14].In this paper,Correlation coe?cient is

Improved Feature Selection Algorithm Based on SVM and Correlation1377 chose.Suppose that the selected feature subset with RFE is[f1,f2,···,f r].Let’s compute the similarity between features f i and f j with the following formula:

ρij=

m=1

(f im?f i)(f jm?f j)

m=1

(f im?f i)2

m=1

(f jm?f j)2

(4)

where f i is the averaged value of i th feature,f im is the value of the m th sample of the i th feature.Then the following coe?cient matrix for the feature subset is

gotten:

ρ=?

ρ11ρ12 (1)

ρ21ρ22 (2)

............ρr1ρr2...ρrr

We haveρii=1,ρij=ρji and|ρij≤1|.

The diagonal re?ects the auto correlation of the features and other elements in the matrix re?ect the cross correlative degrades,which are the overlapping information among feature set.The aim is to?nd a feature subset which keeps the useful information and remove the super?uous.

By computing the correlation coe?cient of the selected feature subset,we will use these values to remove the redundant features.Considering that the features with large weight are important for the decision function,we take the following processes[11]to remove the redundant features.

1)Input the original feature set F=[f1,f2,···,f K];

2)Perform SVM-RFE method on F and get a feature subset F=[f1,f2,···,f k1] ,where k1

3)Sort the features based on the values of w2with descending order;

4)Find the features whose correlation coe?cient with the?rst feature are larger thanλand remove them,whereλis the threshold prespeci?ed;get a new feature subset F=[f1,f2,···,f k2];

5)Find the features whose correlation coe?cient with the2nd feature in F2are larger thanλand remove them;get a new feature subset F=[f1,f2,···,f k3];

6)Repeat the above process until the last feature.

4Empirical Analysis

In order to compare the proposed method and RFE-SVM,we choose6data sets from UCI data repository[15].The description of data sets is shown in Table1. For all the datasets,we?rst normalize all of the attributes into the interval of [-1,1].

First,let’s study the correlation among the features;we take the data set WPBC as an example.There are33features in the original data.SVM-RFE

1378Z.-X.Xie,Q.-H.Hu,and D.-R.Yu

Table1.The description of data sets

Data set Number of instances Number of attributes

WDBC56930

WPBC19833

Clean1476166

Dermat36634

Sonar20860

Ionos35134

Fig.1.The relation of features2and4or5

removes6irrelative features and selects27features.We?nd the2nd feature is highly correlative with feature4and5,shown in?gures1.By computing the correlation coe?cient,there are7redundant features removed.

Table2shows the maximal correlation coe?cient of the SVM-RFE feature subsets and the thresholds we take in removing the redundant features,the numbers of the original features,SVM-RFE subsets and the subsets without redundancy.We can?nd several features in data sets WDBC,WPBC,Clean1 and Dermat have been deleted with the proposed technique.The classi?cation accuracies with three kinds of features are shown in table3.

Table2.The number of selected features and the accuracy of original,SVM-RFE method and the new method

Number of features

Data set Max coe?cient Threshold

ALL RFE New

WDBC0.99790.9303021

WPBC0.99290.9332720

Clean10.99090.951667958

Dermat0.94170.9342318

Sonar0.92580.9606057

Ionos0.74830.9343131

Improved Feature Selection Algorithm Based on SVM and Correlation1379

https://www.doczj.com/doc/a35886934.html,parison of classi?cation performance with di?erent feature selection

Data set ALL RFE-SVM New

WDBC0.9877±0.02750.9877±0.02750.9912±0.0124

WPBC0.8744±0.07110.8528±0.07080.8694±0.0817

Clean10.9625±0.11860.9729±0.08560.9667±0.0775

Dermat 1.000±0.00000.9944±0.01760.9944±0.0176

Sonar0.9762±0.07530.9762±0.07530.9619±0.1048

Ionos0.9088±0.08770.9117±0.07750.9117±0.0775 Max coe?cient is the maximal value in correlation coe?cient matrix of se-lected subset by SVM-RFE.The max coe?cients of5data sets are larger than 0.9,just one is about0.75.This fact shows that there actually are some linear redundant features after SVM-RFE based feature selection.

Of the selected6data sets,the accuracy of2data sets is higher than SVM-RFE.The number of features reduces to21from30and20from27respectively, while the accuracy improves.

5Conclusion

Support vector machines are widely used to select informative feature subset.In this paper we show the traditional SVM-RFE algorithm for feature selection just considers the relation of the condition attributes and decision,but don’t analyze the relation between the condition attributes.Therefore,SVM-RFE algorithm can just remove the irrelevant features from the data.We present a technique to delete the redundant information in data based on correlation coe?cient.There are two main steps in the proposed technique.First we remove the irrelevant features with SVM-RFE technique;then we compute the correlation coe?cients between the selected attributes with the?rst step and delete the redundant feature from the selected subset.The proposed improved method of feature se-lection can remove some linear redundant features in the selected feature subset obtained by SVM-RFE.With the experiment analysis,we?nd the number of the selected features with the proposed technique is much smaller than that with SVM-RFE,and the accuracy based on10-fold cross validation is comparable References

1.Dash,M.,Liu,H.:Feature Selection for Classi?cation.Intelligent Data Analysis1

(1997)131–156

2.Kohavi,R.,George,J.:Wrappers for Feature Subset Selection.Arti?cial Intelli-

gence97(1997)273–324

3.Guyon,I.,Elissee?,A.:An Introduction to Variable and Feature Selection.Journal

of Machine Learning Research3(2003)1157–1182

1380Z.-X.Xie,Q.-H.Hu,and D.-R.Yu

4.Hu,Q.,Yu,D.,Xie,Z.:Information Preserving Hybrid Data Reduction Based on

Fuzzy Rough Techniques.Pattern Recognition Letters(in press).

5.Liu,H.,Yu,L.,Dash,M.,Motoda,H.:Active Feature Selection Using Classes.

Paci?c-Asia Conference on Knowledge Discovery and Data Mining(2003)

6.Guyon,I.,Matic,N.,Vapnik,V.:Discovering Informative Patterns and Data Clean-

ing.Advances in Knowledge Discovery and Data Mining(1996)181–203

7.Guyon,I.,Weston,J.,Barnhill,S.and Vapnik,V.:Gene Selection for Can-

cer Classi?cation Using Support Vector Machines.Machine Learning46(2002) 389–422

8.Rakotomamonjy,A.:Variable Selection Using SVM-based Criteria.Journal of Ma-

chine Learning Research3(2003)1357–1370

9.Li,G.,Yang,J.,Liu,G.,Li,X.:Feature Selection for Multi-class Problems Using

Support Vector Machines.Proceedings of8th Paci?c Rim International Conference on Arti?cial Intelligence(PRICAI-04)LNCS3157(2004)292–300

10.Duan,K.,Rajapakse,J.C.,Wang,H.and Francisco,A.:Multiple SVM-RFE for

Gene Selection in Cancer Classi?cation with Expression Data.IEEE Transactions on Nanobioscience(2005)228–234

11.Yu,L.,Liu,H.:E?cient Feature Selection via Analysis of Relevance and Redun-

dancy.Journal of Machine Learning Research5(2004)1205–1224

12.Hsing,T.,Liu,L.,Brun,M.,et al.:The Coe?cient of Intrinsic Dependence(Feature

Selection Using el CID).Pattern Recognition(2005)623–636

13.Yao,K.,Lu,W.,Zhang,S.,et al.:Feature Expansion and Feature Selection for

General Pattern Recognition Problems.IEEE Int.Conf.Neural Networks and Sig-nal Processing(2003)29–32

14.Han,J.,Kamber,M.:Data Mining:Concepts and Techniques.Morgan Kaufmann

Publishers(2000)

15.Blake,C.,Keogh,E.,Merz,C.:UCI Repository of Machine Learning Databases.

Technical Report,Department of Information and Computer Science,University of California,Irvine,CA.(1998)