Classification of Microarray Data Based On
Feature Selection Method

C.Lavanya; M.N; ihini; R.Niranjana; C.Gunavathi

Classification of Microarray Data Based On Feature Selection Method

C.Lavanya, M.Nandihini, R.Niranjana, C.Gunavathi
Computer Science and Engineering, K.S.Rangasamy College of Technology, Tiruchengode, Tamilnadu, India

Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Science, Engineering and Technology

Abstract

Genes are encoding regions that form necessary building block inside the cell and show the way to proteins which are achieving a variety of functions. However, some genes may get mutated. Such genes are responsible for cancer occurrence. It can be discovered by closely examining samples taken from patients to identify faulty genes. Gene expression dataset usually comes with only dozens of tissues/samples but with thousands or even tens of thousands of genes/features. In this paper, we employ feature selection techniques for analyzing cancer microarray gene expression data. Feature selection technique is used to select the most possibly cancerrelated genes from huge microarray gene expression data. It aims to achieve improved classification performance. This can be achieved by the measures of T-Test, Chi-Square Test and Information gain. Cancer classification using microarray data poses another major challenge because of the huge number of genes compared to the number of tissue samples. Only a small number of genes in the microarray data which consisting of thousands of genes show strong correlation with the target phenotypes. This paper presents the Naive Bayes algorithm for the classification task. A comprehensive framework that incorporates feature selection and classification techniques is capable of successfully classifying new samples as infected or normal

Keywords

Gene Expression Data, Classification, Feature selection method, Naive Bayes algorithm

INTRODUCTION

Data mining is the computational process of analysis of large quantities of data. It uses information from past data to analyze the outcome of a particular problem or situation that may arise. Data classification is the form of data analysis that extracts models of describing important data classes. Such models called classifiers, predict categorical class labels. Such analysis can help us with better understanding of data at large. The class label of each training label is provided a state called supervised learning that is the learning of the classifier of supervised in that it is told to a class each training tuple belong. In learning, training data are analyzed by classification algorithm in which test data are used to estimate the accuracy of classification rules.

BASICS OF GENE EXPRESSION DATA

Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non protein coding genes such as rRNA genes or tRNA genes, product is a structural or housekeeping RNA. Gene expression studies can also involve looking at profile or patterns of expression of several genes whether quantitating changes in expression levels or looking at overall patterns of expression, real time PCR is used by most scientists performing gene expression. Based on the levels of the gene expression data optimized genes are classified based on different classifiers.

MICROARRAY DATA CLASSSIFICATION

The micro array data are images, which have to be transformed into gene expression matrices in which rows represent genes, columns represent various samples such as tissues or experimental conditions, and numbers in each cell characterizes the expression level of particular gene in the particular sample. Microarray based disease classification system takes labeled gene expression data samples and generates a classifier model that classifies new data samples into different predefined diseases. Microarray data classification is a supervised leaning task that predicts the diagnostic category of a sample from its expression array phenotype.

GENE EXPRESSION DATA SETS

The datasets considered in the simulation are Iris, yeast, Spellman dataset, breast cancer. All these data sets are publicly available and are two class gene expression

GENE SELECTION

In this study, a number of gene selection methods have been introduced to select informative genes. The different dataset genes are classified using classifier like SVM, Naive Bayesian and optimized genes obtained through feature selection methods like TTest, information gain and mutual information.

FEATURE SELECTION METHODS

The importance of feature selection methods is selecting informative genes prior to classification of microarray data for cancer prediction and diagnosis. Feature selection method removes irrelevant and redundant features to improve classification accuracy. Feature selection methods can be categorized into filter, wrapper, and embedded or hybrid. The filter approach selects features without involving any data-mining algorithm. The filter algorithms are evaluated based on four different evaluation criteria namely, distance, information, dependency and consistency. The wrapper approach selects feature subset based on the classifier and ranks feature subset using predictive accuracy or cluster goodness. It is more computationally expensive than the filter model.

T –TEST

To measure the relevance of a gene, the t-test is widely used, assuming that there are two classes of samples in a gene expression data set. When there are multiple classes of samples, the t-test is computed for one class versus all the other classes. It compares the actual difference between two means in relation to the variation in the data. T Test values are calculated as

CHI-SQUARE TEST

Chi-square is a statistical test commonly used to compare observed data with data we would expect to obtain according to a specific hypothesis. This is designed specifically for multiple tests having at least two discrete outcomes (such as normal and mutated gene). The chi-square test is always testing what scientists call the null hypothesis, which states that there is no significant difference between the expected and observed result. The formula for calculating chi-square (x2) is:

A Chi Square Test is often used to measure a goodness of fit between an observed and expected distribution of values knowing how to perform a Chi Square test can be useful for testing probable to expected outcomes, fitting points to a curve, or testing a statistical hypothesis.

INFORMATION GAIN

Information gain, of a term measures the number of bits of information obtained for category prediction by the presence or absence of the term in a document. Information Gain measures the decrease in entropy when the feature given is absent. This is the application of a more general technique, the measurement of informational entropy, to the problem of deciding how important a given feature is. Informational entropy, when measured using Shannon entropy, is notionally the number of bits of data it would take to encode a given piece of information. The more space a piece of information takes to encode, the more entropy it has. Intuitively, this makes sense because a random string has maximum entropy and cannot be compressed, while a highly ordered string can be written with a brief description of the string’s information. In the context of classification, the distribution of instances among classes is the information in question. If the instances are randomly assigned among the classes, the number of bits necessary to encode this class distribution is high, because each instance would need to be enumerated.

On the other hand, if all the instances are in a single class, the entropy would be lower, because the bitstring would simply say “All instances save for these few are in the first class.” Therefore function measuring entropy must increase when the class distribution gets more spread out and be able to be applied recursively to permit finding the entropy of subsets of the data. The following formula satisfies both of these requirements:

IG(X) = H(D)− H(D|X) where

H(D) = - Σ (ni/n) log(ni/n) i=1,…l and

H(D|X) = - Σ (|Xj|/n)H(D|X-Xj)

p±(S) is the probability of a training example in the set S to be of the positive/ negative class. We discretized continuous features using information theoretic binning.

For each dataset we selected the subset of features with non-zero information gain. Information Gain can be used only on discrete features and hence for numeric features discretization is necessary prior to computing Information Gain. Entropy-based discretization method is generally used for gene expression data. Similar, to t-Statistic, features are selected based on the larger values of Information Gain.

CLASSIFIERS

NAIVE BAYES CLASSIFIER

A Naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model". In simple terms, a naive Bayes classifier assumes that the presence or absence of a particular feature is unrelated to the presence or absence of any other feature, given the class variable. Naive Bayes is that it only requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Because independent variables are assumed, only the variances of the variables for each class need to be determined and not the entire covariance matrix.

DISCUSSION AND CONCLUSION

We showed how combining a filtering technique for feature selection with SVM leads to substantial improvement in generalization performance of the SVM models in the five classification datasets of the competition. Another lesson learned from our submission is that there is no single best feature selection technique across all five datasets. We experimented with different feature selection techniques and picked the best one for each dataset. Of course, an open question still remains: why exactly these techniques worked well together with Support Vector Machines. A theoretical foundation for the latter is an interesting topic for future work.

References

AnirbanMukhopadhyay, UjjwalMaulik and Sanghamitra Bandyopadhyay, “An Interactive Approach to Multi-Objective Clustering of Gene Expression Patterns”, IEEE Transactions on Biomedical Engineering, vol. 60, no. 1, pp. 35-41, 2013
Feng Yang and K.Z. Mao, “Robust Feature Selection for Microarray Data Based on Muliticriiterion Fusion”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 4, pp. 1080-1092, 2011
Garcia-Nieto, E. Albaa, L. Jourdanb and E. Talbi, “Sensitivity and Specificity Based Multi-Objective Approach for Feature Selection: Application to Cancer Diagnosis”, Information Processing Letters, vol.109, pp. 887-896, 2010
Jihong Liu and Guoxiong Wang, “A Hybrid Feature Selection Method for Data Sets of Thousands of Variables”, IEEE, pp. 288-291, 2010
Jinhuia Sheng, Hong-Wen Deng, Vince D. Calhoun and Yu-Ping Wang, “Integrated Analysis of Gene Expression and Copy Number Data on Gene Shaving Using Independent Component Analysis”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 6, pp. 1568-1578, 2011
Jiye Liang, Feng Wang, Chuangyin Dang and YuhuaQian, “A Group Incremental Approach to Feature Selection Applying Rough Set Technique”, IEEE Transactions on Knowledge and Data Engineering, pp. 1-30, 2012
Meng-Yun Wu, Dao-Qing Dai, Yu Shi, Hong Yan and Xiao-Fei Zhang, “Biomarker Identification and Cancer Classification Based on Microarray Data Using Laplace Naive Bayes Model With Mean Shrinkage”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 6, pp. 1649-1662, 2012
Patrick C. H. Ma and Keith C. C. Chan, “Incremental Fuzzy Mining of Gene Expression Data for Gene Function Prediction”, IEEE Transactions on Biomedical Engineering, vol. 58, no. 5, pp. 1246-1252, 2011
Shang Gao, Omar AddamandAlaQabaja, “Robust Integrated Framework for Effective Feature Selection and Sample Classification and Its Application to Gene Expression Data Analysis”, IEEE, pp. 112-119, 2012
YannChristinat, Bernd Wachmann, and Lei Zhang,” Gene Expression Data Analysis Using a Novel Approach to Biclustering Combining Discrete and Continuous Data”, IEEE/ACM Transactions On Computational Biology And Bioinformatics, Vol. 5, No. 4, pp. 583-593, October-December 2008
UjjwalMaulik, Anirban Mukhopadhyay and DiebasisChakraboirty, “Gene-Expression-Based Cancer Subtypes Prediction Through Feature Selection and Transductive SVM”, IEEE Transactions on Biomedical Engineering, vol. 60, no. 4, pp. 1111-1117, 2013
Yuchun Tang, Yan-Qing Zhang and Zhen Huang, Xiaohua Hu, “Granular SVM-RFE Gene Selection Algorithm for Reliable Prostate Cancer Classification on Microarray Expression Data”, in the Proceedings of the 5th IEEE Symposium on Bioinformatics and Bioengineering, 2005
Zhenyu Wang and Vasile Palade, “A Comprehensive Fuzzy-Based Framework for Cancer Microarray Data Gene Expression Analysis”, IEEE, pp. 1003-1010, 2007
George Lee, Carlos Rodriguez, and AnantMadabhushi,” Investigating the Efficacy of Nonlinear Dimensionality Reduction Schemes in Classifying Gene and Protein Expression Studies”, IEEE/ACM Transactions On Computational Biology And Bioinformatics, Vol. 5, No. 3, pp. 368-384, July-September 2008
Zhenyu Wang, Vasile Palade and Yong Xu, “Neuro-Fuzzy Ensemble Approach for Microarray Cancer Gene Expression Data Analysis”, in the International Symposium on Evolving Fuzzy Systems, pp. 241-246, 2006
Lipo Wang, Feng Chu, and Wei Xie, ” Accurate Cancer Classification Using Expressions of Very Few Genes”, IEEE/ACM Transactions On Computational Biology And Bioinformatics, vol 4, no.1, pp.40-53, January-March 2007.