Joint Analysis of DNA Copy Numbers
and Gene Expression Levels

 

Doron Lipson1,4, Amir Ben-Dor2, Elinor Dehan3 and Zohar Yakhini1,2

Proceedings of Algorithms in Bioinformatics: 4th International Workshop, WABI 2004, Bergen, Norway, September 17-21, 2004
Lecture Notes in Computer Science (LNCS), Vol. 3240/2004, p.135, Springer 2004

1CS Dept,Technion, Israel, 2Agilent Labs, 3NYU, 4corresponding author

 

Abstract Paper Scores Examples References

Abstract

Genomic instabilities, amplifications, deletions and translocations are often observed in tumor cells. In the process of cancer pathogenesis cells acquire multiple genomic alterations, some of which drive the process by triggering overexpression of oncogenes and by silencing tumor suppressors and DNA repair genes.

We present data analysis methods designed to study the overall transcriptional effects of DNA copy number alterations. Alterations can be measured using several techniques including microarray based hybridization assays. The data have unique properties due to the strong dependence between measurement values in close genomic loci. To account for this dependence in studying the correlation of DNA copy number to expression levels we develop versions of standard correlation methods that apply to genomic regions and methods for assessing the statistical significance of the observed results. In joint DNA copy number and expression data we define significantly altered submatrices as submatrices where a statistically significant correlation of DNA copy number to expression is observed. We develop heuristic approaches to identify these structures in data matrices. We apply all methods to several datasets, highlighting results that can not be obtained by direct approaches or without using the regional view.

Paper

PDFSpringer LNCS

Scores

In the paper [1] we define two scores for identifying significant correlations in joint DNA copy number and gene expression data: The Regional Correlation score and significantly altered Genomic Continuous Submatrices (GCSMs). Here we present updated forms of these scores.

Regional Correlation

Given a gene  with a vector of gene expression levels over a set of samples  we wish to find a genomic segment containing , such that the correlation of  with all DNA copy number vectors  of genes  in this segment are consistently and significantly high. If this set of correlation values significantly deviates from the values that are expected at random, we will suspect that the expression of  is affected by a genomic aberration in the genomic interval that corresponds to . Since we do not assume that we have a-priori knowledge of the exact location of the interval  with the best correlation to , we would like to consider all possible intervals, and locate the most significantly correlating one.

Let , where Corr is some correlation function (e.g. the Pearson correlation coefficient). The Regional Correlation of the gene is defined as:

 A high regional correlation score may arise due to significant correlation between the expression vector and the DNA copy number vectors within a interval , but it may also arise due to random (insignificant) correlation between and DNA copy number vectors that are very highly correlated between themselves. In order to identify only significant correlation score, we compute a p-value for by considering alrage number of random instances of the expression vector.

Genomic Continouous Submatrices (GCSMs)

A GCSM is defined by a continuous genomic segment G and a subset of the samples S. These determine a submatrix of the DNA copy number measurement matrix C, and a submatrix of the gene expression matrix E. We denote these matrices  and , respectively.

We would like to score the degree to which the DNA copy numbers and expression levels of the genes  support the existence of an amplification (or deletion) in M.

Given k – the number of positive entries in , p – the fraction of positive entries in the data and , we score the overabundance of positive entries in M as:

Similarly, we would like to score the overabundance of measurements in E that suggest that M is indeed aberrant. Unlike the DNA copy number values, we do not expect the expression measurements of all genes  to support M since the expression level of any gene incident to the aberration may or may not be modified depending on different factors that determine regulation. Accordingly, we define a score that reflects the overabundance of genes in  that are significantly differentially expressed, comparing and , in the correct direction (higher in than in ). A TNoM (Threshold Number of Misclassifications) score may be assigned to each gene according to its performance as a  versus  classifier. For a more detailed description of differential expression overabundance please see [2].

A total score for an amplification in M is then defined as:

Although  and  both arise from statistical significance considerations, since  is expected to be supported by more data than , we use a weighting factor a to balance their contributions to the total score.

Algorithmic methods for locating high-scoring GCSMs are described in [1].

Examples

We demonstrate the use the use of Regional correaltion and significantly-altered GCSMs on breast cancer data of Pollack et al [3]. The dataset contains parallel DNA copy number and gene expression measurements of 6,095 genes on 41 breast tumor and cell-line samples.

The following Figure depicts the genomic locations of significant aberrations located in the breast cancer dataset. Pink marks depcit significantly altered GCSMs (S(M;C,E) >40) where the embedded yellow marks denote the positions of resident genes that are significantly  differentially expressed (TNoM<7). with relation to the respective partition. Genes with significant regional correlation score (R(i,*)>1.3, pval<10-3) are depicted above by yellow marks, where the surrounding light blue marks depict the genomic intervals for against which the maximum regional correaltion was attained.
Note the high degree of agreement between the two scores, although high-scoring GCSMs may be found even when no transcriptional effect is detected. 

 

A table that summarizes the information for significantly-altered GCSMs

A table that summarizes the information for genes with significant Regional Correlation

 

References

  1. D. Lipson, A. Ben-Dor, E. Dehan and Z. Yakhini, Joint Analysis of DNA Copy Numbers and Gene Expression Levels, in Proceedings of WABI '04, LNCS Vol. 3240/2004, p.135, Springer, 2004.

  2. A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, Tissue classification with gene expression profiles, in Proceedings of RECOMB '02, pp. 54–64, 2000.

  3. J.R. Pollack, T. Sorlie, C.M. Perou, C.A. Rees, S.S. Jeffrey, P.E. Lonning, R. Tibshirani, D. Botstein, A. Borresen-Dale, and P.O. Brown. Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors, PNAS, 99(20):12963–8, 2002.

Page created by Doron Lipson