Efficient calculation of interval scores

for DNA copy number data analysis

April 2007, under guidance of Zohar Yakhini

Introduction

Alterations in DNA copy number are characteristic of many cancer types and are thought to drive some cancer pathogenesis processes. These alterations include large chromosomal gains and losses as well as smaller scale amplifications and deletions. Because of their role in cancer development, regions of chromosomal instability are useful for elucidating other components of the process. For example, since genomic instability can trigger the over expression or activation of oncogenes and the silencing of tumor suppressors, mapping regions of common genomic aberrations had been used to discover cancer related genes.

In Comparative Genomic Hybridization (CGH) differentially labeled tumor and normal DNA are co-hybridized to normal metaphases. Ratios between the two labels allow the detection of chromosomal amplifications and deletions of regions that may harbor oncogenes and tumor suppressor genes. In a more advanced method termed array CGH (aCGH), tumor and normal DNA are co-hybridized to a microarray of thousands of genomic clones of BAC, cDNA or oligonucleotide probes, allowing the determination of changes in DNA copy number of relatively small chromosomal regions.

The Stepgram application is used to detect aberrations in DNA copy number data from aCGH experiments. The application is based on the interval score described in the paper "Efficient Calculation of Interval Scores for DNA Copy Number Data Analysis"[1]. In short, the score of an interval is defined as the sum of its probes, divided by the square root of its length (). Although the paper mentioned above addresses many problems, we’ve focused on one in particular since it is the key to all others: how can we find the interval with the maximum score correctly and efficiently. The paper offers 2 different algorithms for that problem in details, we will describe them in short.

Known Algorithms

The paper suggests 2 algorithms for finding the interval with maximal score:

LookAhead algorithm

The algorithm operates by considering all possible interval starting points, in sequence. For each starting point, we check the intervals with ending point in increasing distance from . The basic idea is to try and not consider all possible ending point, but rather to skip some that will clearly not provide the optimum. Assume that we are given two parameters  and , where  is a lower bound on the optimal score  and  is an upper bound on the value of any single element .

Assume that we have just considered an interval of length  with , and .

After checking the interval , an exhaustive algorithm would continue to the next ending point, and check the interval . However, this might not always be necessary. If  is sufficiently small and  is sufficiently large then  may stand no chance of obtaining. For any x, setting  (skipping the next  intervals) an upper bound on the score of  is given by . Thus,  has a chance of surpassing  only if  . Solving for, we obtain. Thus, the next ending point to consider is, where.

The improvement in efficiency depends on the number of ending points that skipped. This number, in turn, depends on the tightness of the two bounds  and. Initially,  may be set to . As we proceed  is replaced by maximal score encountered so far, gradually improving performance.

provides an upper bound on the values of single elements in . In the description above, we used the global maximum. However, using this bound may limit the usefulness of the LookAhead approach, since even a single high value in the data will severely limit the skip size. Thus, we use a local approach, where we bound the value of the single elements within a window of size. The most efficient definition of local bounds is the slope maximum: For a window size, for all, pre-calculate, .

Although   is not an upper bound on the value of elements within the  -window following, the value of  is indeed an upper bound on  within the  -window, maintaining the correctness of the approach. Note that the skip size must now be limited by the window size:. Thus, larger values of  give better performance although preprocessing limits the practical window size to .

Geometric Family Algorithm (GFA)

The paper defines a geometric family of intervals I as: I =
where I(j) is defined as follows:
For j = 0, 1,…, log(1+ε)n, set kj=(1+ε)j and Δj= εkj. then:
I(j) = .
In simple words, in each step we increase both the length of the intervals (kj) and the space between the starting points of 2 consecutive intervals (Δj). Since both increments are exponential, in length and in space, the result is a set of intervals that is significantly smaller than all  possible sub-intervals in the range [0, n-1].

Lemma (proof in the paper): Let I* be the optimal scoring interval. Let J be the leftmost longest interval of I fully contained in I*. Then (J) ≥ (I*)/a, where .

Using the lemma, we can do as follows:

Find the maximal score of intervals in I, call it M, and let M* denote the score of the maximal interval in I. Then, for each interval for which J < M*/a ≤M/a, J can’t be the leftmost longest interval of I fully contained in I. So, when searching for the maximum scoring interval, we need not consider those for which J is the leftmost longest interval.

Project goals

Out project addresses 2 issues:

1. Trying to improve time complexity of LookAhead algorithm.
2. Find out whether the One-Side Geometric Family Algorithm (details ahead) can have better run-time than the original GFA (although theoretically it has a worse time complexity), because it is simpler and can use larger values for ε.

Improving LookAhead algorithm:

The general concept:

In the original LA algorithm, the upper bound's computation is done for all the points. For each point, this computation is set by the values from  to  only.

In the Improved LA Algorithm, the computation of the upper limit is from each point to the last point. By this method, in the max interval calculation, the skip  is not limited by any value.

Preprocessing:

From each pointall the points from this point to the end are grouped into intervals, each interval size is  (except for the last interval).

In an empiric way, we selected .

For each interval, the maximum point is computed using the Algorithm.

All the points in a specific interval are considered to be the maximum value found inside. If the maximum is a negative value, it is set to be zero. Taking these new values set for each point, the slopes are computed.

The slope of a point is equal or larger than the average of the values in any interval beginning with i.  This is guaranteed because the use of an upper bounds for the point’s value.

The complexity of the  algorithm’s preprocessing is  and the maximum value’s finding in an interval, is.

In the Improved LA algorithm’s preprocessing, for each point, no more than calculations of max values are done. Therefore the complexity of the preprocessing is.

Max intervals calculation:

Same like the original algorithm, but the slope’s values are different.

The step size, as computed in the original algorithm, is computed with the new slopes (). Note that, in contrast to the LA Algorithm,  is not limited to any size.

An alternate option is to calculate the original slopes as well, and then the step size, is computed twice. First  is set to the original slope and  is limited to, then  is set to the new slope that was calculated in the Improved LA Preprocessing. The maximum is selected, and the steps are set accordingly.

In the alternate option the performance in the preprocessing is not as good as the proposed option.

Range Maximum Query () Algorithm1

For any array A of numbers, for indices  and  between  and, query  returns the index and the value of the largest element in the sub array.

The idea is to precompute each query whose length is a power of two. That is, for every

_

Between  and  and every  between  and , maximum element is found in the block starting at  and having length , that is, we compute .  Table therefore has size, and we fill it in time by using dynamic programming. Specificaly, the maximum in a block of size  is found by computing the two maxima of its two constituent blocks of size . More formally,  if  and  otherwise.

To avoid unnecessary iterations, only the first  iteration are done.

Computation of the :

We select two overlapping blocks that entirely cover the subrange: let  be the size of the largest block that fits into the range from  to , that is let . Then can be computed by comparing the maxima of the following two blocks: to and  to . These values have already been computed, so we can find the in constant time. The complexity[2].

One-Side GFA:

OSGFA use is similar in concept to the original GFA, but while in each iteration in GFA we enlarge both interval length and spacing between intervals, in OSGFA we only enlarge the length. Formally:

We define I, a one-side geometric family of intervals:

Set e > 0.

For j = 0,1,…,n-1: I(j) = { [ j , j + ( 1 + e)­i – 1 ] : i = 0,1,…,log(1+e)(n-j) }

Lemma 1: Let I be an interval, and J – the leftmost longest interval of I contained in I. Then:

Proof:

Suppose that the length of J is (1+e)j (we consider the interval [i,i] as an interval with the length of 1). J is the leftmost longest interval contained in I, therefore . Then:

è

Lemma 2: Let I* be the interval with the optimal score, and J be the leftmost longest interval of I contained in I*. Then:

, with .

Proof:

Set . Assume by contradiction that . Denote I* = [u,v] and J = [u,w]. define A = [w+1,v] (the interval that extends J to I*).

and  therefore:

According to our assumption, we now get:

è

So our assumption was wrong, and indeed .

From lemma 2 we get that we can use OSGFA in the same manner as GFA, but we can get the same approximation (same Alpha) in OSGFA by using a bigger e: .

In addition, the cover zone of interval [x,y] in OSGFA is smaller than it is in GFA
( L-COV(x) = x and R-COV(y) = y + (y-x+1)1+e - 2 ).

Results

To test our suggestions we implemented the following:

-        Naïve algorithm (exhaustive search).

-        LookAhead algorithm.

-        Improved LookAhead.

-        GFA.

-        OSGFA.

-        Random input generator.

-        Main program that produces a random input (or receives it as a file) and runs the selected algorithms.

Using the input generator we produced the following tests:

 Size of input Number of amplifications Number of tests 3000 1 100 6000* 2 100 12000 3 50 24000 4 50 48000 5 30 96000 6 20 192000 7 10

* 4 tests of real data of about 6000 probes (Pollack) were also tested and the times were similar

The random input was produced as follows:

-        Initial values of all points were drawn from Normal(0,1).

-        Input vector divided into 1-7 zones, depending on number of amplifications implanted, for each zone an amplification was added by:

o      Drawing the amplification length from Geom(0.01), with a limitation that it must be at least 10 and no more than 20% of the zone length.

o      Start point of the amplification in the zone is drawn uniformly.

o      For each point in the amplification, we draw the amplification amplitude from Uniform(0.5,10).

-        All of the parameters for the generation mentioned above are the default values in our program, but they can be changed by the user.

All algorithms were tested with the tests of 3000-12000 points, on larger tests we tried all except for the Naïve algorithm (takes too much time…).

* All running times refer to a PC with AMD Athlon XP 2500+ processor and 1GB RAM.

Graph 1: Average run-times of LookAhead, improved LA and some key functions as reference. The start point values that were chosen for the functions:

X1 => (3000,0.5) – same as improved LookAhead.

X4/3 => (3000,0.5) – same as improved LookAhead.

X3/2 => (3000,0.3) – same as LookAhead.

Graph 2: Average run-times of GFA, OSGFA and some key functions as reference. The start point values that were chosen for the functions:

X1 (0.38)=> (3000,0.38) – same as GFA.

X1 (0.47)=> (3000,0.47) – same as OSGFA.

X4/3 => (3000,0.47) – same as improved OSGFA.

Graph 3: Average run-times for all tested algorithms and some key functions as reference, all start points for the functions were (3000,0.3) – same as LA:

Graph 4: run-time scatter for LA, improved LA, GFA and OSGFA:

 24000 (50)                                  48000 (30)              96000 (20)                192000 (10)

The code:

The main program can:

-        Create a random input file or use an existing one.

o      The input file should consist of the probes values separated by any non-digit char(s).

-        Run any algorithm (or some of them) from: naïve, LookAhead, improved LA, GFA & OSGFA.

Command line arguments for the main program:

Usage: stepgram.exe

-input <file name>  # name of file to read input from (or to create if using -produce_input)

#The input file should consist of the probes values separated by any non-digit char(s)

[-modes <comma-separated list of modes>]  # optional modes: all,naive,la,imp_la,gfa,os_gfa

[-produce_input <size of input>]  # will produce random input, can be used with -modes flag to run with these modes after generation of input

[-gfa_ep <epsilon>]  # epsilon for GFA, default is 0.1

[-os_gfa_ep <epsilon>]  # epsilon for one-side GFA, default is 0.25

[-la_window <window size>]  # window size for LookAhead, given as the exponent of input size, default is 1/2

[-imp_la_window <window size>]  # window size for improved LookAhead, default is 1/3

The following flags are for random input generation:

[-ri_vm <mean>]  # original values are generated using Normal(ri_vm,ri_vsd)

[-ri_vsd <sd>]  #  defaults are: ri_vm = 0, ri_vsd = 0

[-ri_agp <p>]  #       the amplification length is determined by Geom(ri_agp), but must be:

[-ri_aml <length>]  #  larger than ri_aml and smaller than (input_size/ri_noa)*ri_matir

[-ri_matir <ratio>]  # defaults are: ri_agp = 0.01, ri_aml = 10, ri_amatir = 0.2

[-ri_amina <min>]  # for each probe in the amplification interval, amplitude is determined

[-ri_amaxa <max>]  # using UNIFORM(ri_amina, ri_amaxa), default is (0.5,10)

[-ri_noa <num>]  # number of amplification to add.

[-h] # display this help message.

Example for an output (messages begin with “-I-“ for info, “-T-“ for time measurements, “-R-“ for results, “-W-“ for warnings and “-E-“ for errors):

F:\TECHNION\project in bioinformatics\submit>stepgram.exe -produce_input 100000 -input "F:\TECHNION\project in bioinformatics\submit\random_input_for_example.txt" -modes la,imp_la,gfa,os_gfa -ri_noa 5

Command line: stepgram.exe -produce_input 100000 -input F:\TECHNION\project in bioinformatics\submit\random_input_for_example.txt -modes la,imp_la,gfa,os_gfa -ri_noa 5

##################################################################

##################################################################

-I- stepgram: starting standart LookAhead algorithm

-I- LAPreProcess::LAPreProcess - Total number of probes: 100000

-T- LAPreProcess::LAPreProcess - LAPreProcess constructor: 20.375 seconds

-R- LookAhead::activate - max interval: (17819,18231) score: 102.348

-T- LookAhead::activate: 32.093 seconds

-R- stepgram: max interval by la algorithm: (17819,18231) score: 102.348

-T- stepgram: la calculation: 52.468 seconds

##################################################################

##################################################################

##################################################################

-I- stepgram: - starting improved LookAhead algorithm

-I- LAPreProcess::LAPreProcess - Total number of probes: 100000

-T- LAPreProcess::LAPreProcess - LAPreProcess constructor: 2.969 seconds

-I- ImpLAPreProcess::ImpLAPreProcess - Total number of probes: 100000

-T- ImpLAPreProcess::ImpLAPreProcess - ImpLAPreProcess constructor: 13.485 seconds

-R- ImpLookAhead::activate - max interval: (17819,18231) score: 102.348

-T- ImpLookAhead::activate: 15.625 seconds

-R- stepgram: max interval by improved la algorithm: (17819,18231) score: 102.348

-T- stepgram: improved la calculation: 32.079 seconds

##################################################################

##################################################################

##################################################################

-I- stepgram: starting GFA

-I- LAPreProcess::LAPreProcess - Total number of probes: 100000

-T- LAPreProcess::LAPreProcess - LAPreProcess constructor: 2.953 seconds

-I- GfaBase::activate - Found 1015 candidate intervals... reduced to 1 after merge

-T- GfaBase::activate: 7.641 seconds

-R- LookAhead::activate - max interval: (17819,18231) score: 102.348

-T- LookAhead::activate: 0.281 seconds

-R- stepgram: max interval by GFA: (17819,18231) score: 102.348

-T- stepgram: GFA calculation: 12.453 seconds

##################################################################

##################################################################

##################################################################

-I- stepgram: - starting OneSideGFA

-I- LAPreProcess::LAPreProcess - Total number of probes: 100000

-T- LAPreProcess::LAPreProcess - LAPreProcess constructor: 2.875 seconds

-I- GfaBase::activate - Found 11767 candidate intervals... reduced to 2 after merge

-T- GfaBase::activate: 11 seconds

-R- LookAhead::activate - max interval: (17819,18231) score: 102.348

-T- LookAhead::activate: 0.188 seconds

-R- stepgram: - max interval by OneSideGFA: (17819,18231) score: 102.348

-T- stepgram: - OneSideGFA calculation: 15.391 seconds

##################################################################

##################################################################

Run the Program

stepgram.exe

References:

1.     Doron Lipson, Yonatan Aumann, Amir Ben-Dor, Nathan Linial and Zohar Yakhini, "Efficient Calculation of Interval Scores for DNA Copy Number Data Analysis". Proceedings of RECOMB '05, LNCS 3500, p. 83, Springer-Verlag, 2005.

1. M. A. Bender and M. Farach-Colton, “The LCA problem revisited”. In Proceedings of Latin American Theoretical Informatics (LATIN 2000), pages 88.94, 2000.