Efficient calculation of interval scores
for DNA copy number data analysis
by Oren
Levin and Amit Meshulam
April 2007, under guidance of Zohar Yakhini
Introduction
Alterations in DNA copy number are
characteristic of many cancer types and are thought to drive some cancer
pathogenesis processes. These alterations include large chromosomal gains and
losses as well as smaller scale amplifications and deletions. Because of their
role in cancer development, regions of chromosomal instability are useful for
elucidating other components of the process. For example, since genomic
instability can trigger the over expression or activation of oncogenes and the
silencing of tumor suppressors, mapping regions of common genomic aberrations
had been used to discover cancer related genes.
In Comparative Genomic Hybridization (CGH)
differentially labeled tumor and normal DNA are co-hybridized to normal metaphases.
Ratios between the two labels allow the detection of chromosomal amplifications
and deletions of regions that may harbor oncogenes and tumor suppressor genes.
In a more advanced method termed array CGH (aCGH), tumor and normal DNA are
co-hybridized to a microarray of thousands of genomic clones of BAC, cDNA or
oligonucleotide probes, allowing the determination of changes in DNA copy
number of relatively small chromosomal regions.
The Stepgram application is used
to detect aberrations in DNA copy number data from aCGH experiments. The
application is based on the interval score described in the paper "Efficient
Calculation of Interval Scores for DNA Copy Number Data Analysis"[1].
In short, the score of an interval is defined as the sum of its probes, divided
by the square root of its length (). Although the paper mentioned above addresses many
problems, we’ve focused on one in particular since it is the key to all others:
how can we find the interval with the maximum score correctly and efficiently.
The paper offers 2 different algorithms for that problem in details, we will
describe them in short.
Known Algorithms
The paper suggests 2 algorithms for finding the interval with maximal score:
LookAhead algorithm
The
algorithm operates by considering all possible interval starting points, in
sequence. For each starting point, we check the intervals with ending point in increasing
distance from
. The basic idea is to try and not consider all possible
ending point, but rather to skip some that will clearly not provide the
optimum. Assume that we are given two parameters
and
, where
is a lower bound on
the optimal score
and
is an upper bound on
the value of any single element
.
Assume
that we have just considered an interval of length with
, and
.
After
checking the interval , an exhaustive algorithm would continue to the next ending
point, and check the interval
. However, this might not always be necessary. If
is sufficiently small
and
is sufficiently large
then
may stand no chance of
obtaining
. For any x, setting
(skipping the next
intervals) an upper
bound on the score of
is given by
. Thus,
has a chance of
surpassing
only if
. Solving for
, we obtain
. Thus, the next ending point to consider is
, where
.
The
improvement in efficiency depends on the number of ending points that skipped.
This number, in turn, depends on the tightness of the two bounds and
. Initially,
may be set to
. As we proceed
is replaced by maximal
score encountered so far, gradually improving performance.
provides an upper bound on the values of single elements in
. In the description above, we used the global maximum
. However, using this bound may limit the usefulness of the
LookAhead approach, since even a single high value in the data will severely
limit the skip size. Thus, we use a local approach, where we bound the
value of the single elements within a window of size
. The most efficient definition of local bounds is the slope
maximum: For a window size
, for all
, pre-calculate,
.
Although
is not an upper bound
on the value of elements within the
-window following
, the value of
is indeed an upper
bound on
within the
-window, maintaining
the correctness of the approach. Note that the skip size must now be limited by
the window size:
. Thus, larger values of
give better
performance although preprocessing limits the practical window size to
.
Geometric Family Algorithm (GFA)
The paper defines a geometric family of intervals I
as: I =
where I(j) is defined as follows:
For j = 0, 1,…, log(1+ε)n, set kj=(1+ε)j
and Δj= εkj. then:
I(j) = .
In simple words, in each step we increase both the length of the intervals (kj)
and the space between the starting points of 2 consecutive intervals (Δj).
Since both increments are exponential, in length and in space, the result is a
set of intervals that is significantly smaller than all possible sub-intervals
in the range [0, n-1].
Lemma
(proof in the paper): Let I* be the optimal scoring interval. Let J be
the leftmost longest interval of I fully contained in I*. Then (J) ≥
(I*)/a,
where
.
Using the lemma, we can do as follows:
Find the maximal score of intervals in I, call it M,
and let M* denote the score of the maximal interval in for which J < M*/a ≤M/a, J can’t be the leftmost longest interval of I
fully contained in I. So, when searching for the maximum scoring interval, we
need not consider those for which J is the leftmost longest interval.
Project goals
Out project addresses 2 issues:
Improving LookAhead algorithm:
The general concept:
In the original LA algorithm, the upper bound's computation
is done for all the points. For each point, this computation is set by the values from
to
only.
In the Improved LA Algorithm, the computation of the upper
limit is from each point to the last point. By this method, in the max interval
calculation, the skip is not limited by any
value.
Preprocessing:
From
each pointall the points from this point to the end are grouped into intervals,
each interval size is
(except for the last
interval).
In an
empiric way, we selected .
For
each interval, the maximum point is computed using the Algorithm.
All the points in a specific interval are considered to be the maximum value found inside. If the maximum is a negative value, it is set to be zero. Taking these new values set for each point, the slopes are computed.
The
slope of a point is equal or larger than the average of the values in any
interval beginning with i. This is
guaranteed because the use of an upper bounds for the point’s value.
The
complexity of the algorithm’s
preprocessing is
and the maximum
value’s finding in an interval, is
.
In
the Improved LA algorithm’s preprocessing, for each point, no more than calculations of max values are done. Therefore the complexity
of the preprocessing is
.
Max
intervals calculation:
Same like the original algorithm, but the slope’s values are different.
The
step size, as computed in the original algorithm, is computed with the
new slopes (
). Note that, in contrast to the LA Algorithm,
is not limited to any
size.
An
alternate option is to calculate the original slopes as well, and then the step
size, is computed twice. First
is set to the original
slope and
is limited to
, then
is set to the new
slope that was calculated in the Improved LA Preprocessing. The maximum
is selected, and the steps are set accordingly.
In the alternate option the performance in the preprocessing is not as good as the proposed option.
Range
Maximum Query () Algorithm1
For
any array A of numbers, for indices and
between
and
, query
returns the index and
the value of the largest element in the sub array
.
The idea is to precompute each query whose length is a
power of two. That is, for every
_
Between and
and every
between
and
, maximum element is found in the block starting at
and having length
, that is, we compute
. Table
therefore has size
, and we fill it in time
by using dynamic programming. Specificaly, the maximum in a
block of size
is found by computing
the two maxima of its two constituent blocks of size
. More formally,
if
and
otherwise.
To avoid unnecessary iterations, only the first iteration are done.
Computation of the :
We select two overlapping blocks that entirely cover
the subrange: let be the size of the
largest block that fits into the range from
to
, that is let
. Then
can be computed by comparing the maxima of the following two
blocks:
to
and
to
. These values have already been computed, so we can find the
in constant time. The complexity
[2].
One-Side GFA:
OSGFA use is similar in concept to the original GFA, but while in each iteration in GFA we enlarge both interval length and spacing between intervals, in OSGFA we only enlarge the length. Formally:
We define I, a one-side geometric family of intervals:
Set e > 0.
For j = 0,1,…,n-1: I(j) = { [ j , j + ( 1 + e)i – 1 ] : i = 0,1,…,log(1+e)(n-j)
}
Lemma
1: Let I be an interval, and J – the leftmost longest interval of I
contained in I. Then:
Proof:
Suppose
that the length of J is (1+e)j (we consider the interval [i,i] as an
interval with the length of 1). J is the leftmost longest interval contained in
I, therefore . Then:
è
Lemma 2: Let I* be the interval with the optimal score, and J be the leftmost longest interval of I contained in I*. Then:
, with
.
Proof:
Set . Assume by contradiction that
. Denote I* = [u,v] and J = [u,w]. define A =
[w+1,v] (the interval that extends J to I*).
and
therefore:
According to our assumption, we now get:
è
So
our assumption was wrong, and indeed .
From
lemma 2 we get that we can use OSGFA in the same manner as GFA, but we can get
the same approximation (same Alpha) in OSGFA by using a bigger e: .
In
addition, the cover zone of interval [x,y] in OSGFA is smaller than it is in
GFA
( L-COV(x) = x and R-COV(y) = y + (y-x+1)1+e -
2 ).
Results
To test our suggestions we implemented the following:
- Naïve algorithm (exhaustive search).
- LookAhead algorithm.
- Improved LookAhead.
- GFA.
- OSGFA.
- Random input generator.
- Main program that produces a random input (or receives it as a file) and runs the selected algorithms.
Using the input generator we produced the following tests:
Size of input |
Number of amplifications |
Number of tests |
3000 |
1 |
100 |
6000* |
2 |
100 |
12000 |
3 |
50 |
24000 |
4 |
50 |
48000 |
5 |
30 |
96000 |
6 |
20 |
192000 |
7 |
10 |
* 4 tests of real data of about 6000 probes (Pollack)
were also tested and the times were similar
The random input was produced as follows:
-
Initial values of all points were drawn
from
- Input vector divided into 1-7 zones, depending on number of amplifications implanted, for each zone an amplification was added by:
o Drawing the amplification length from Geom(0.01), with a limitation that it must be at least 10 and no more than 20% of the zone length.
o Start point of the amplification in the zone is drawn uniformly.
o For each point in the amplification, we draw the amplification amplitude from Uniform(0.5,10).
- All of the parameters for the generation mentioned above are the default values in our program, but they can be changed by the user.
All algorithms were tested with the tests of 3000-12000 points, on larger tests we tried all except for the Naïve algorithm (takes too much time…).
* All running times refer to a PC with AMD Athlon XP 2500+ processor and 1GB RAM.
Graph 1: Average run-times of LookAhead, improved LA and some key functions as reference. The start point values that were chosen for the functions:
X1 => (3000,0.5) – same as improved LookAhead.
X4/3 => (3000,0.5) – same as improved LookAhead.
X3/2 => (3000,0.3) – same as LookAhead.
Graph 2: Average run-times of GFA, OSGFA and some key functions as
reference. The start point values that were chosen for the
functions:
X1 (0.38)=> (3000,0.38) – same as GFA.
X1 (0.47)=> (3000,0.47) – same as OSGFA.
X4/3 => (3000,0.47) – same as improved OSGFA.
Graph 3: Average run-times for all tested algorithms and some key functions as reference, all start points for the functions were (3000,0.3) – same as LA:
Graph 4: run-time scatter for LA, improved LA, GFA and OSGFA:
24000 (50) 48000 (30) 96000 (20) 192000 (10)
The code:
The main program can:
- Create a random input file or use an existing one.
o The input file should consist of the probes values separated by any non-digit char(s).
- Run any algorithm (or some of them) from: naïve, LookAhead, improved LA, GFA & OSGFA.
Command line arguments for the main program:
Usage: stepgram.exe
-input <file name> # name of file to read input from (or to
create if using -produce_input)
#The input file
should consist of the probes values separated by any non-digit char(s)
[-modes <comma-separated list of
modes>] # optional modes:
all,naive,la,imp_la,gfa,os_gfa
[-produce_input <size of
input>] # will produce random input,
can be used with -modes flag to run with these modes after generation of input
[-gfa_ep <epsilon>] # epsilon for GFA, default is 0.1
[-os_gfa_ep <epsilon>] # epsilon for one-side GFA, default is 0.25
[-la_window <window size>] # window size for LookAhead, given as the
exponent of input size, default is 1/2
[-imp_la_window <window
size>] # window size for improved
LookAhead, default is 1/3
The following flags are for random input
generation:
[-ri_vm <mean>] # original values are generated using
[-ri_vsd <sd>] #
defaults are: ri_vm = 0, ri_vsd = 0
[-ri_agp <p>] #
the amplification length is determined by Geom(ri_agp), but must be:
[-ri_aml <length>] #
larger than ri_aml and smaller than (input_size/ri_noa)*ri_matir
[-ri_matir <ratio>] # defaults are: ri_agp = 0.01, ri_aml = 10,
ri_amatir = 0.2
[-ri_amina <min>] # for each probe in the amplification
interval, amplitude is determined
[-ri_amaxa <max>] # using UNIFORM(ri_amina, ri_amaxa), default
is (0.5,10)
[-ri_noa <num>] # number of amplification to add.
[-h] # display this help message.
Example for an output (messages begin with “-I-“ for info, “-T-“ for time measurements, “-R-“ for results, “-W-“ for warnings and “-E-“ for errors):
F:\TECHNION\project in
bioinformatics\submit>stepgram.exe -produce_input 100000 -input
"F:\TECHNION\project in bioinformatics\submit\random_input_for_example.txt"
-modes la,imp_la,gfa,os_gfa -ri_noa 5
Command line: stepgram.exe
-produce_input 100000 -input F:\TECHNION\project in
bioinformatics\submit\random_input_for_example.txt -modes la,imp_la,gfa,os_gfa
-ri_noa 5
##################################################################
##################################################################
-I- stepgram: starting
standart LookAhead algorithm
-I- LAPreProcess::LAPreProcess
- Total number of probes: 100000
-T- LAPreProcess::LAPreProcess
- LAPreProcess constructor: 20.375 seconds
-R- LookAhead::activate - max
interval: (17819,18231) score: 102.348
-T- LookAhead::activate:
32.093 seconds
-R- stepgram: max interval by
la algorithm: (17819,18231) score: 102.348
-T-
stepgram: la calculation: 52.468 seconds
##################################################################
##################################################################
##################################################################
-I- stepgram: - starting
improved LookAhead algorithm
-I- LAPreProcess::LAPreProcess
- Total number of probes: 100000
-T- LAPreProcess::LAPreProcess
- LAPreProcess constructor: 2.969 seconds
-I-
ImpLAPreProcess::ImpLAPreProcess - Total number of probes: 100000
-T- ImpLAPreProcess::ImpLAPreProcess
- ImpLAPreProcess constructor: 13.485 seconds
-R- ImpLookAhead::activate -
max interval: (17819,18231) score: 102.348
-T- ImpLookAhead::activate:
15.625 seconds
-R- stepgram: max interval by
improved la algorithm: (17819,18231) score: 102.348
-T- stepgram: improved la
calculation: 32.079 seconds
##################################################################
##################################################################
##################################################################
-I- stepgram: starting GFA
-I- LAPreProcess::LAPreProcess
- Total number of probes: 100000
-T- LAPreProcess::LAPreProcess
- LAPreProcess constructor: 2.953 seconds
-I- GfaBase::activate - Found
1015 candidate intervals... reduced to 1 after merge
-T- GfaBase::activate: 7.641
seconds
-R- LookAhead::activate - max
interval: (17819,18231) score: 102.348
-T- LookAhead::activate: 0.281
seconds
-R- stepgram: max interval by
GFA: (17819,18231) score: 102.348
-T- stepgram: GFA calculation:
12.453 seconds
##################################################################
##################################################################
##################################################################
-I- stepgram: - starting
OneSideGFA
-I- LAPreProcess::LAPreProcess
- Total number of probes: 100000
-T- LAPreProcess::LAPreProcess
- LAPreProcess constructor: 2.875 seconds
-I- GfaBase::activate - Found
11767 candidate intervals... reduced to 2 after merge
-T- GfaBase::activate: 11
seconds
-R- LookAhead::activate - max
interval: (17819,18231) score: 102.348
-T- LookAhead::activate: 0.188
seconds
-R- stepgram: - max interval
by OneSideGFA: (17819,18231) score: 102.348
-T- stepgram: - OneSideGFA
calculation: 15.391 seconds
##################################################################
##################################################################
Run the Program
References:
1. Doron Lipson, Yonatan Aumann, Amir Ben-Dor, Nathan Linial and Zohar Yakhini, "Efficient Calculation of Interval Scores for DNA Copy Number Data Analysis". Proceedings of RECOMB '05, LNCS 3500, p. 83, Springer-Verlag, 2005.