Reconstruction of Gene Regulatory Network to
Identify Prognostic Molecular Markers of the
Reactive Stroma of Breast and Prostate Cancer
Using Information Theoretic Approach

Prof. Shanthi Mahesh; Dr. Neha Mangla; Pooja V; Suhas A Bhyratae

Reconstruction of Gene Regulatory Network to Identify Prognostic Molecular Markers of the Reactive Stroma of Breast and Prostate Cancer Using Information Theoretic Approach

Prof. Shanthi Mahesh¹, Dr. Neha Mangla², Pooja V³, Suhas A Bhyratae⁴
Department of ISE, Atria Institute of Technology, Bengaluru, Karnataka, India

Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering

Abstract

Gene regulation refers to a number of sequential processes, the most well-known and understood being translation and transcription, which control the level of a gene’s expression and ultimately result with specific quantity of a target protein. Reconstruction of gene regulatory networks is a process of analyzing the steps involved in gene regulation using computational techniques. In this paper, cancer-specific gene regulatory network has been reconstructed using information theoretic approach-Mutual Information. The microarray database used contains 12 Gene samples each of breast cancer and prostate cancer having both normal and tumor cell information. This data has been preprocessed, normalized and filtered using the t-test; the MI value is applied on the filtered genes to determine the Gene-Gene Interaction. Based on the interactions obtained, 10 different networks have been constructed and the statistical analysis has been performed on that network. Finally, validation of the inferred results has been done with available biological databases and literature.

KEYWORDS

gene regulatory network, microarray, reactive stroma of breast and prostate cancer, mutual information

I. INTRODUCTION

Malignant cancer is one of the most widespread diseases in today’s world that affects the mortality rate of human beings. The cancerous cells divide and grow in an uncontrollable manner forming tumors and infest the nearby part of body. Various significant genes are responsible for the genesis of different tumors. Radiotherapy, chemotherapy and surgery are the possible ways of treating cancer. Therefore, identification of genes that lead to cancer can typically solve the uncontrollable growth of cancer at an early stage.

Reconstruction of g e n e r e g u l a t or y n etwor ks (GRNs) explicitly represents the causality of developmental or regulatory process. It has become a challenging computational problem for understanding the complex regulatory mechanisms in cellular systems. An important problem in molecular biology is to identify and understand the gene regulatory networks (GRNs). Microarray technologies have produced tremendous amounts of gene expression data, which provide opportunity for understanding the underlying regulatory mechanism.

Recently, information theoretic approaches are increasingly being used for reconstructing GRNs. Several mutual information based methods have been successfully applied to infer GRNs and minet. In general, these approaches start by computing the pair-wise MIs between all possible pairs of genes, resulting in an MI matrix. The MI matrix is then manipulated to identify the regulatory relationships. MI provides a natural generalization of the correlation since it measures non-linear dependency and therefore attracts much attention. Another advantage of these methods is their ability to deal with thousands of variables (genes) in the presence of a limited number of samples. With these advantages, MI-based methods only work when investigating pair-wise regulations in a GRN. The inference of gene networks from high-throughput data is a very complex and vastly expanding; triggered by the invention of measurement technologies. In order to provide a systematic discussion of the underlying principles we limit this review to observational steady-state gene expression data and consider correlation-and mutual informationbased inference methods. These methods are representative of linear and non-linear methods. Principally, there are three fundamental levels of a molecular system as given by the central dogma of molecular biology (Crick, 1970), namely, the DNA, mRNA and the protein level. Figure 1 shows the overview of Central Dogma. The central dogma of molecular biology describes the two-step process, transcription and translation by which the information in genes flows into proteins:

DNA->RNA->Protein

In this work, we propose a relevance network model for gene regulatory network inference which employs mutual information to determine the interactions between genes. For this purpose, we propose a mutual information estimator based on adaptive partitioning which allows us to condition on both discrete and continuous random variables. We provide experimental results that demonstrate that the proposed regulatory network inference algorithm finds the high degree genes and predicts the gene responsible for both breast cancer and prostate cancer. The results are validated using biological database.

II. LITERATURE SURVEY

The reconstruction or ‘reverse engineering’ of GRNs, which aims to find the underlying network of gene–gene interactions from the measurement of gene expression is considered one of most important goals in systems biology [2,3]. For this, the Dialogue for Reverse Engineering Assessments and Methods (DREAM) program was established to encourage researchers to develop new efficient computation methods to infer robust GRNs [4]. A variety of approaches have been proposed to infer GRNs from gene expression data [5,7], such as discrete models of Boolean networks and Bayesian networks[8], differential equations [9-12], regression method[13,14] and linear programming [15]. Although many popular network inference algorithms have been investigated [16, 5], there are still a large space for current models to be improved [20]. Recently, information-theoretic approaches are increasingly being used for reconstructing GRNs. Several mutual information (MI)- based methods have been successfully applied to infer GRNs, such as ARACNE, CLR [23] and minet [21]. In general, these approaches start by computing the pair-wise MIs between all possible pairs of genes, resulting in an MI matrix. The MI matrix is then manipulated to identify the regulatory relationships. MI provides a natural generalization of the correlation since it measures non-linear dependency (which is common in biology) and therefore attracts much attention. Another advantage of these methods is their ability to deal with thousands of variables (genes) in the presence of a limited number of samples. Despite these advantages, MI- based methods only work when investigating pair-wise regulations in a GRN.

III. ABBREVIATIONS

DNA: Deoxyribo nucleic acid RNA: Ribonucleic acid NCBI: National center for Biotechnology Information GEO: Gene expression omnibus TMI: Threshold Mutual Information Microarray: Collection of microscopic DNA spots attached to solid surface. MI: Mutual Information. GRN: Gene Regulatory Network.

IV. DATASET DESCRIPTION

Evaluation of t h e performance of our approach is experimentally tested on the Reactive stroma of breast and prostate cancer dataset. The full data set can be downloaded from the Gene Expression Omnibus website: http://www.ncbi.nlm.nih.gov/geo/GSE26910. The dataset has information on 54675 genes under 24 different experimental conditions.

V. METHODOLOGY

The algorithm presented in this approach is shown in Figure 2.

A. Preprocessing and Normalization

The dataset is quite large with 54675 genes and a lot of information corresponds to genes that do not show any interesting changes during the experiment. During the pre- processing, genes that do not show any changes during the experiment are removed which reduces the size of the dataset. If we look through the gene list, we have several spots marked as ‘EMPTY’. These are empty spots on the array and these spots can be noise. The function isnan() is used to identify the genes with missing data and indexing commands are used to remove the genes with missing data.

B. Filtering

T-test is applied between the normal and tumor cell data to obtain the most significant genes. The t-test for unpaired data and both for equal and unequal variance can be computed as

D. Mapping

The gene interaction matrix obtained in the previous step is mapped onto the gene names.

E. Gene Regulatory Network

The result of gene-gene interaction matrix is imported into the network visualization and analysis tool, Cytoscape. Cytoscape is more powerful when used in conjunction with large databases of protein-protein, protein-DNA and genetic interactions that are increasingly available for humans and model organisms. It allows the visual integration of the network with expression profiles, phenotypes and other molecular state information and links the network to databases of functional annotations. The interacting genes are selected to obtain networks of interacting genes. This helps us in easily identifying the genes with highest degree. Such genes, called as highly connected genes, are said to have a higher impact in causing cancer.

F. Identify prognostic molecular markers

The highly connected genes are used in the identification of the prognostic molecular markers. This analysis is done using. This analysis is done using G2SBC (Genes-to-Systems Breast Cancer Database) .The G2SBC is a bioinformatics resource that collects and integrates data about genes. From this analysis it is found that the genes GOLM1, CSMD2, MICAL2, TMEM167A, TBC1D2, POSTN, AEBP1, ZNF668, ZFAND3, TXNL1, VOPP1, TRIP13 are common for causing both breast and prostate cancer [22].

VI. EXPERIMENTS AND RESULTS

Experiments were conducted on the reactive stroma of breast and prostate cancer with 54675 genes under 24 different experimental conditions. Regulatory network for 30 genes with 306 interactions is shown in the Figure 3. Statistical analysis of GRN for 30 genes is shown in table 1.

1. Neighborhood Connectivity: The connectivity of a node is the number of its neighbors. The neighborhood connectivity of a node n is defined as the average connectivity of all neighbors of n, Figure 4. The neighborhood connectivity distribution gives the average of the neighborhood connectivity of all nodes n with k neighbors for k = 0,1,….

2. Closeness centrality: It is the degree to which this node is close to all nodes. Figure 5, shows the closeness centrality plotted against number of neighbors. It is calculated ba s ed on sh or t e s t pa th s, i t i s g i ve n b y,

VII. CONCLUSION

In this work, a novel approach comprising the features viz, filtering function, mutual information and gene-gene interaction function have been used on the cancer data to compute regulatory relationship between gene pairs and statistical analysis of reconstructed network. The microarray data considered here consists of 54675 genes having 12 sets each of breast and prostate cancer data and 12 each of normal cell data. Our study yields 6 major outcomes; first we identify differentially expressed genes in dataset, second, the interactions between differentially expressed gene have been identified; third, genes regulating most of the other genes were identified; fourth, provides the statistical analysis of reconstructed network revealed a large number of interactions in the used data; fifth, provides the highly connected gene for 10 different network and sixth, helps to identify the prognostic molecular markers in the reactive stroma of breast and prostate cancer using G2SBC. From this analysis it is found that the genes GOLM1, CSMD2, MICAL2, TMEM167A, TBC1D2, POSTN, AEBP1, ZNF668, ZFAND3, TXNL1, VOPP1, TRIP13 are common for causing both breast and prostate cancer. The result provides an excelled understanding of the interaction mechanism of the breast and prostate cancer data and provides new insight into the biomedical world.

References

Hughes,T.R. et al. (2000) Functional discovery via a compendium of expression profiles. Cell, 102, 109–126.
Basso,K. et al.(2005) Reverse engineering of regulatory networks in human B cells.Nat. Genet., 37, 382–390.
Margolin,A.A. et al. (2006a) Reverse engineering cellular networks. Nat. Protoc., 1,663–672.
Marbach, D. et al. (2010) Revealing strengths and weaknesses of methods for gene network inference. Proc. Natl Acad. Sci. USA, 107, 6286–6291.
Bansal,M. et al. (2007) How to infer gene networks from expression profiles. Mol. Syst. Biol., 3,78.
Holter,N.S. et al. (2001) Dynamic modeling of gene expression data. Proc. Natl Acad.Sci. USA, 98, 1693– 1698.
Tegner,J. et al. (2003) Reverse engineering gene networks: integrating genetics
Kauffman,S. et al. (2003) Random Boolean network models and the yeast transcriptional network. Proc. Natl Acad. Sci. USA, 100, 14796– 14799.
Alter,O. et al. (2000) Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA, 97, 10101–10106.
Cantone,I. et al. (2009) A yeast synthetic network for in vivo assessment of reverse engineering and modeling approaches. Cell, 137, 172– 181.
di Bernardo,D. et al. (2005) Chemogenomic profiling on a genome-wide scale using reverse-engineered gene networks. Nat. Biotechnol.,Copyright to IJIRCCE www.ijircce.com 310 23, 377–383.
Honkela,A. et al. (2010) Model-based method for transcription factor target identification with limited data. Proc. Natl Acad. Sci. USA, 107, 7793–7798.
Gardner,T.S. et al. (2003) Inferring genetic networks and identifying compound mode of action via expression profiling. Science, 301, 102–105.
Tibshirani,R. (1996) Regression shrinkage and selection via the Lasso. J. R. Stat. Soc.B, 58, 267–288.
Wang,Y. et al. (2006) Inferring gene regulatory networks from multiple microarray datasets. Bioinformatics, 22, 2413–2420. [16] Altay,G. and Emmert-Streib,F. (2010) Revealing differences in gene network inference algorithms on the network level by ensemble methods. Bioinformatics, 26,1738–1744.
Brunel,H. et al. (2010) MISS: a non-linear methodology based on mutual information for genetic association studies in both population and sib-pairs analysis.Bioinformatics, 26, 1811–1818
Smet,R.D. and Marchal,K. (2010) Advantages and limitations of current network inference methods. Nat. Rev. Microbiol., 8, 717–729.
Meyer,P.E. et al. (2008) minet: a R/Bioconductor package for inferring large transcriptional networks using mutual information. BMC Bioinformatics, 9, 461.
Margolin,A.A. et al. (2006b) ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7, S7.
Adamcsek,B. et al. (2006) CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics, 2006, 22, 1021– 1023.
Identification of Prognostic Molecular Features in the Reactive Stroma of Human Breast and Prostate Cancer Anne Planche, Marina Bacac, Institute of Pathology, CHUV, and Faculty of Biology and Medicine, University of Lausanne