Evolutionary Computation for Biclustering of Gene Expression Federico Divina

Jesus S. Aguilar–Ruiz University of Seville Seville, Spain

Tilburg University Tilburg, The Netherlands

[email protected]

[email protected]

Categories and Subject Descriptors I.2.1 [Applications and Expert Systems]: Medicine and Science; I.5.1 [Clustering]

1.

INTRODUCTION

Microarray data are widely used in genomic research due to the enormous potential in gene expression profiling, facilitating the prognosis and the discovering of subtypes of diseases. The gene expression data are organized in matrices, where rows represent genes and columns represent experimental conditions. Each element in the matrix refers to the expression level of a particular gene under a specific condition. Biclustering of gene expression was first proposed by Cheng and Church [2], where the residue of an element in the bicluster and the mean squared residue of a submatrix were introduced. In addition, they adjusted that measure to reject trivial biclusters by means of the row variance. For details about these measures, we refer the reader to [2]. In this work, we address the biclustering problem with evolutionary computation (EC), which has been proven to have an excellent performance on highly complex optimization problems. In this paper we follow the model of biclustering proposed by [2]. As the algorithm, called SEBI, partially uses the squared mean residue, the results have been compared to those of Cheng and Church. In expression data analysis, the most important goal may not be finding the maximum bicluster or even finding a bicluster cover for the data matrix. It is more interesting to find a set of genes showing strikingly similar up–regulation and down– regulation under a set of conditions. A low mean squared residue score plus a large variation from the constant may be a good criterion for identifying these genes and conditions. Therefore, our goal is to find biclusters of maximum size, with mean squared residue lower than a given δ, with a relatively high row variance, and with a low level of overlapping among biclusters.

2.

DESCRIPTION OF THE ALGORITHM

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’05 March 13–17, 2005, Santa Fe, New Mexico, USA. Copyright 2005 ACM 1-58113-964-0/05/0003 ...$5.00.

The algorithm we propose adopts a sequential covering strategy: an Evolutionary Algorithm (EA), called EBI (for Evolutionary BIclustering), is called several times, until an end condition is met. EBI takes as input the expression matrix and the δ value and returns either a bicluster with mean squared residue lower than δ or nothing. In the former case, the returned bicluster is stored in a list called Results, and EBI is called again. The end condition is also met when EBI is called a maximum number of times. When the end condition is met, the list Results is returned. After a bicluster is returned, weights associated with the expression matrix are adjusted. This operation is performed in order to to bias the search towards biclusters that do not overlap with already found biclusters. In order to do so, we associate a weight to each element of the expression matrix. The weight of an element depends on the number of biclusters in Results containing the element. The more biclusters cover an element, the higher the weight of the element will be. The weight wp associated to an element elij of the expression matrix is:

wp (elij ) =

(

0

if |Cov(elij )| = 0; −|Cov(elnm )| n∈N,m∈M e −|Cov(elij )|

P

e

if |Cov(elij )| > 0.

where N and M are the number of rows and the number of columns of the expression matrix, respectively, and |Cov(elij )| is the number of biclusters in Results containing elij . In EBI the weights are taken into consideration when the quality of biclusters is assessed, as described in the following. The aim of EBI is to find δ–biclusters with maximum volume, with a relatively high row variance, and minimizing the effect of overlapping among biclusters. The initial population consists of biclusters containing only one element of the expression matrix. These biclusters have the property of having a mean squared residue equal to 0. Tournament selection is used for selecting parents. Selected pairs of parents are recombined with a crossover operator with a given probability pc (default value 0.9), and the resulting offspring is mutated with a probability pm (default value 0.1). Three crossover operators can be applied with equal probability: one-point, two-point and uniform crossover. Three mutation operators are used, a standard mutation operator, a mutation operator that adds a row and a mutation operator that adds a column to the bicluster. Elitism is applied with a probability pe (default value 0.75). At the end of the evolutionary process, if the best

individual, according to the fitness, encodes a δ–bicluster, then it is returned, otherwise EBI does not return anything. Each individual of the population encodes one bicluster. Biclusters are encoded by means of binary strings of length N + M , where N and M are the number of rows (genes) and of columns (conditions) of the expression matrix, respectively. Each of the first N bits of the binary string is related to the rows, in the order in which the bits appear in the string. In the same way, the remaining M bits are related to the columns. If a bit is set to 1, it means that the relative row or column belongs to the encoded bicluster; otherwise it does not. The fitness function rewards individuals encoding biclusters with low mean squared residue, with high volume and row variance and covering elements of the expression matrix that are not covered by biclusters found by previous executions of EBI. The final objective of the EBI is to minimize the fitness.

3.

EXPERIMENTAL RESULTS

In order to asses the goodness of the proposed method for finding biclusters in expression data, we conduct experiments on a well known dataset: the human B–cells expression data originated from [1]. The human dataset consists of an expression matrix of 4026 genes and 96 conditions. For the human dataset δ was set to 1200, as in [2]. 791 200 150 100

Expression Value

50

-100 -150 -200

0

10

20

30

40

50

Conditions 971 200 150 100

Expression Value

50 0 -50 -100 -150 -200 -250

Table 1: Performance comparison: SEBI and CC. Avg. residue Avg. Volume Avg. gene num. Avg. cond. num.

SEBI 1028.84 (29.19) 615.84 (278.35) 14.07 (5.39) 43.57 (6.20)

CC 850.04 (153.91) 4595.98 (3353.71) 269.22 (204.71) 24.5 (20.92)

In Table 1, we compare the performance of SEBI with that of Cheng and Church’s algorithm (here named CC), for what concerns the average residue and the average dimension of the biclusters found. We can see that CC is capable of finding biclusters characterized by a higher volume than the ones found by SEBI. This is due to the fact that after that the most trivial biclusters have been found, SEBI focuses on elements of the expression matrix that are not contained in already found biclusters. On the other hand, after that CC has found a bicluster, the covered elements of the expression matrix are substituted by randomly generated values, in the range of the original data. This may cause the biclusters to overlap much more of what happens in SEBI, where overlapping is avoided as much as possible. As far as the residue is concerned, CC is able, on average, to find biclusters with a lower residue. The standard deviation of CC is much higher than that of SEBI, meaning that SEBI has a more stable behavior.

4. CONCLUSIONS AND FUTURE WORK

0 -50

-250

The one hundred biclusters found cover 34.07% of the elements of the expression matrix, covering 38.23% of the genes and 100% of the conditions. Each call to EBI on the human dataset requires about 500 seconds on a Pentium III 900 Mhz.

0

10

20

30

40

50

60

Conditions

Figure 1: Bicluster found for the human dataset. The residue of bicluster 791 is 1123.09, and its row variance is 3207.90. The residue of bicluster 971 is 1169.85 and its row variance is 5796.25. Figure 1 shows two out of one hundred biclusters found for the human dataset. These biclusters were found with the following parameters for EBI: generation 100, population size 200, crossover probability 0.85, mutation probability 0.20 and elitism probability 0.90. The genes that are contained in these two biclusters show very similar behavior under the conditions belonging to the biclusters. In addition, these biclusters present a low residue and a very high row variance.

In this paper we have introduced an algorithm based on EC, called SEBI, for finding biclusters on expression data. The proposed algorithm adopts a sequential covering strategy, and an EA in order to find biclusters. The experimental results show that CC is capable of finding bigger clusters. This could be due to the overlapping policy adopted by CC. In SEBI overlapping is avoided by means of the weights assigned to the elements of the expression matrix, while CC replaces covered entries of the expression matrix by random values. This strategy could not prevent overlapping as efficiently as the one adopted in SEBI allowing the system to find biclusters with higher volume. We can conclude that SEBI is successful in finding set of genes that show strikingly similar up–regulations and down–regulations under a set of conditions, as shown by the presented results. Thus, EC represents a useful framework for addressing the challenges of gene expression analysis. Our future work consists in exploring a natural encoding of biclusters, where they are not encoded by binary strings, but by strings of natural numbers, using thus a phenotypic representation, allowing variable–size representations.

5. REFERENCES

[1] A. A. Alizadeh and et al. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403:503–511, 200. [2] Y. Cheng and G. M. Church. Biclustering of expression data. In Proceedings of the 8th International Conference on Itelligent Systems for Molecular Biology (ISMB’00), pages 93–103, 2000.

Jesus S. Aguilar–Ruiz University of Seville Seville, Spain

Tilburg University Tilburg, The Netherlands

[email protected]

[email protected]

Categories and Subject Descriptors I.2.1 [Applications and Expert Systems]: Medicine and Science; I.5.1 [Clustering]

1.

INTRODUCTION

Microarray data are widely used in genomic research due to the enormous potential in gene expression profiling, facilitating the prognosis and the discovering of subtypes of diseases. The gene expression data are organized in matrices, where rows represent genes and columns represent experimental conditions. Each element in the matrix refers to the expression level of a particular gene under a specific condition. Biclustering of gene expression was first proposed by Cheng and Church [2], where the residue of an element in the bicluster and the mean squared residue of a submatrix were introduced. In addition, they adjusted that measure to reject trivial biclusters by means of the row variance. For details about these measures, we refer the reader to [2]. In this work, we address the biclustering problem with evolutionary computation (EC), which has been proven to have an excellent performance on highly complex optimization problems. In this paper we follow the model of biclustering proposed by [2]. As the algorithm, called SEBI, partially uses the squared mean residue, the results have been compared to those of Cheng and Church. In expression data analysis, the most important goal may not be finding the maximum bicluster or even finding a bicluster cover for the data matrix. It is more interesting to find a set of genes showing strikingly similar up–regulation and down– regulation under a set of conditions. A low mean squared residue score plus a large variation from the constant may be a good criterion for identifying these genes and conditions. Therefore, our goal is to find biclusters of maximum size, with mean squared residue lower than a given δ, with a relatively high row variance, and with a low level of overlapping among biclusters.

2.

DESCRIPTION OF THE ALGORITHM

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’05 March 13–17, 2005, Santa Fe, New Mexico, USA. Copyright 2005 ACM 1-58113-964-0/05/0003 ...$5.00.

The algorithm we propose adopts a sequential covering strategy: an Evolutionary Algorithm (EA), called EBI (for Evolutionary BIclustering), is called several times, until an end condition is met. EBI takes as input the expression matrix and the δ value and returns either a bicluster with mean squared residue lower than δ or nothing. In the former case, the returned bicluster is stored in a list called Results, and EBI is called again. The end condition is also met when EBI is called a maximum number of times. When the end condition is met, the list Results is returned. After a bicluster is returned, weights associated with the expression matrix are adjusted. This operation is performed in order to to bias the search towards biclusters that do not overlap with already found biclusters. In order to do so, we associate a weight to each element of the expression matrix. The weight of an element depends on the number of biclusters in Results containing the element. The more biclusters cover an element, the higher the weight of the element will be. The weight wp associated to an element elij of the expression matrix is:

wp (elij ) =

(

0

if |Cov(elij )| = 0; −|Cov(elnm )| n∈N,m∈M e −|Cov(elij )|

P

e

if |Cov(elij )| > 0.

where N and M are the number of rows and the number of columns of the expression matrix, respectively, and |Cov(elij )| is the number of biclusters in Results containing elij . In EBI the weights are taken into consideration when the quality of biclusters is assessed, as described in the following. The aim of EBI is to find δ–biclusters with maximum volume, with a relatively high row variance, and minimizing the effect of overlapping among biclusters. The initial population consists of biclusters containing only one element of the expression matrix. These biclusters have the property of having a mean squared residue equal to 0. Tournament selection is used for selecting parents. Selected pairs of parents are recombined with a crossover operator with a given probability pc (default value 0.9), and the resulting offspring is mutated with a probability pm (default value 0.1). Three crossover operators can be applied with equal probability: one-point, two-point and uniform crossover. Three mutation operators are used, a standard mutation operator, a mutation operator that adds a row and a mutation operator that adds a column to the bicluster. Elitism is applied with a probability pe (default value 0.75). At the end of the evolutionary process, if the best

individual, according to the fitness, encodes a δ–bicluster, then it is returned, otherwise EBI does not return anything. Each individual of the population encodes one bicluster. Biclusters are encoded by means of binary strings of length N + M , where N and M are the number of rows (genes) and of columns (conditions) of the expression matrix, respectively. Each of the first N bits of the binary string is related to the rows, in the order in which the bits appear in the string. In the same way, the remaining M bits are related to the columns. If a bit is set to 1, it means that the relative row or column belongs to the encoded bicluster; otherwise it does not. The fitness function rewards individuals encoding biclusters with low mean squared residue, with high volume and row variance and covering elements of the expression matrix that are not covered by biclusters found by previous executions of EBI. The final objective of the EBI is to minimize the fitness.

3.

EXPERIMENTAL RESULTS

In order to asses the goodness of the proposed method for finding biclusters in expression data, we conduct experiments on a well known dataset: the human B–cells expression data originated from [1]. The human dataset consists of an expression matrix of 4026 genes and 96 conditions. For the human dataset δ was set to 1200, as in [2]. 791 200 150 100

Expression Value

50

-100 -150 -200

0

10

20

30

40

50

Conditions 971 200 150 100

Expression Value

50 0 -50 -100 -150 -200 -250

Table 1: Performance comparison: SEBI and CC. Avg. residue Avg. Volume Avg. gene num. Avg. cond. num.

SEBI 1028.84 (29.19) 615.84 (278.35) 14.07 (5.39) 43.57 (6.20)

CC 850.04 (153.91) 4595.98 (3353.71) 269.22 (204.71) 24.5 (20.92)

In Table 1, we compare the performance of SEBI with that of Cheng and Church’s algorithm (here named CC), for what concerns the average residue and the average dimension of the biclusters found. We can see that CC is capable of finding biclusters characterized by a higher volume than the ones found by SEBI. This is due to the fact that after that the most trivial biclusters have been found, SEBI focuses on elements of the expression matrix that are not contained in already found biclusters. On the other hand, after that CC has found a bicluster, the covered elements of the expression matrix are substituted by randomly generated values, in the range of the original data. This may cause the biclusters to overlap much more of what happens in SEBI, where overlapping is avoided as much as possible. As far as the residue is concerned, CC is able, on average, to find biclusters with a lower residue. The standard deviation of CC is much higher than that of SEBI, meaning that SEBI has a more stable behavior.

4. CONCLUSIONS AND FUTURE WORK

0 -50

-250

The one hundred biclusters found cover 34.07% of the elements of the expression matrix, covering 38.23% of the genes and 100% of the conditions. Each call to EBI on the human dataset requires about 500 seconds on a Pentium III 900 Mhz.

0

10

20

30

40

50

60

Conditions

Figure 1: Bicluster found for the human dataset. The residue of bicluster 791 is 1123.09, and its row variance is 3207.90. The residue of bicluster 971 is 1169.85 and its row variance is 5796.25. Figure 1 shows two out of one hundred biclusters found for the human dataset. These biclusters were found with the following parameters for EBI: generation 100, population size 200, crossover probability 0.85, mutation probability 0.20 and elitism probability 0.90. The genes that are contained in these two biclusters show very similar behavior under the conditions belonging to the biclusters. In addition, these biclusters present a low residue and a very high row variance.

In this paper we have introduced an algorithm based on EC, called SEBI, for finding biclusters on expression data. The proposed algorithm adopts a sequential covering strategy, and an EA in order to find biclusters. The experimental results show that CC is capable of finding bigger clusters. This could be due to the overlapping policy adopted by CC. In SEBI overlapping is avoided by means of the weights assigned to the elements of the expression matrix, while CC replaces covered entries of the expression matrix by random values. This strategy could not prevent overlapping as efficiently as the one adopted in SEBI allowing the system to find biclusters with higher volume. We can conclude that SEBI is successful in finding set of genes that show strikingly similar up–regulations and down–regulations under a set of conditions, as shown by the presented results. Thus, EC represents a useful framework for addressing the challenges of gene expression analysis. Our future work consists in exploring a natural encoding of biclusters, where they are not encoded by binary strings, but by strings of natural numbers, using thus a phenotypic representation, allowing variable–size representations.

5. REFERENCES

[1] A. A. Alizadeh and et al. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403:503–511, 200. [2] Y. Cheng and G. M. Church. Biclustering of expression data. In Proceedings of the 8th International Conference on Itelligent Systems for Molecular Biology (ISMB’00), pages 93–103, 2000.