Applying Ck Hybridization Arrays in Expression Profiling –

a Theoretical Study


by Yoav Pinsky and Idan Rubin

July 16, 2003, under guidance of Doron Lipson and Zohar Yakhini


In any living cell that undergoes a biological process; different subsets of the total set of genes encoded in the organism’s genome are expressed in different stages of the process. The particular subset expressed at a given stage and its quantitative composition is of extreme importance. Being able to measure subsets of genes that express themselves in different stages, different cells, and different organisms is instrumental in understanding biological processes. Such information can help the characterization of sequence-to-function relationship and the determination of effects (and side effects) of experimental treatment. The most successful and most widely used techniques for measuring expression profiles utilize specifically designed surface-bound probes in an assay based on hybridization arrays. One example of an existing generic method that doesn’t require prior determination of the RNA to be measured is SAGE.

In this work we study theoretical and feasibility aspects of a generic micro-array based approach to expression profiling, from the computational point of view.

We examine the following question: what is the quantitative effect of the noise variance on the hybridization array’s performance? To be more specific: how large can random (Gaussian) noise in the fluorescence-pattern get, and still be tolerated by a generic hybridization array? (Tolerated noise here means that the array still yields the right answer, with high probability, measured according to some reasonable probability measures on the input space).


Given a mixture of many different RNA strands (with known sequences), we want to determine the expression levels of each sequence in the mixture, using a generic array based hybridization assay, and our knowledge of the hybridization signatures of each component of the mixture.

Consider a mixture of known RNA sequences. We try to determine the expression levels of each RNA molecule in the mixture by performing the following:

  1. We get the hybridization signature of the mixture, b, by performing a simple hybridization assay.
  2. We calculate the hybridization signature of each RNA sequence that might be preset in the mixture, based on its sequence, and construct a hybridization matrix A. Each column in the matrix is a hybridization signature of one sequence.
  3. To find the concentration vector, we use the pseudo-inverse of the hybridization matrix, and find the vector, x, that gives us the best approximate solution to the equation system: Ax = b

This will work if the hybridization signature is linear in the relative concentration of the different RNA molecules in the mixture. In reality this is not the case, but we assume that it is approximately linear. If we had an "ideal" system, under the linearity assumption, and the matrix A was non-singular, the process described above would give us the exact and unique concentration vector b.

Unfortunately we have some factors that can cause an error in our results:

  1. The accuracy of the calculated hybridization signature of each gene.
  2. The accuracy of our instruments - when we measure the hybridization signature of the mixture.
  3. The hybridization kinetics of each sequence in the mixture can be slightly different than the hybridization in a pure solution.

We treat all these factors as noise, and want to find out how this noise affects the accuracy of our calculated concentration vector.


We performed simulations of the proposed method to find out the correlation between the noise STD and the accuracy of the result expression vector. We simulated the experiment on Yeast genes sequences, with different values of noise STD, and different numbers of sequences.

We found out that there is a linear relation between the STD of the noise, and the average distance of the result expression vector from the original one.

We also found out that the accuracy of the results improves when using a larger Ck array (k = 7-mers instead of 6-mers), or when we perform the assay on a smaller number of gene sequences.

Full Report

CkHybridizationArray.doc  - (170 KB Word document)