Project in bioinformatics
Winter 2003/4 under guidance of Prof.
Dan Geiger And Ma’ayan Fishelzon
Submitted By:
Gzayil Firass email: sfgzayil@t2.technion.ac.il
Zeidan Mohammed
email: smaz@t2.technion.ac.il
Goal of the project:
Performing approximate inference via a heuristic which ignores extreme markers in likelihood computations, when those are too strenuous to be performed exactly, since, they may take a long time to finish the calculation or may cause memory overflow.
Requirements:
To code a function that would satisfy the project goal.
This function is coded within the super link code and uses its internal data structures.
The function keeps removing markers from the data structures till reaching the desired predefined complexity (which is lower than the original one), this will enable super link (after reducing the complexity coefficient below the predefined one), run on minimized data, meaning, faster calculation. As may be expected, this way we loose information that is needed to have full accurate results. Our task was to develop means to remove less valuable markers in an attempt to have results as close as possible to the accurate ones, and to implement those means in this function.
Implementation:
1. Start with the total number of loci as specified in the input files.
2. Determine an elimination order of the problem.
3. Derive the complexity coefficient of the elimination order.
4. If it is less than the predefined complexity, we run the superlink program to calculate the likelihood without any further operations.
5. Choose one marker and remove it from the data structures.
6. goto 3
This is an abstract algorithm of the solution.
The main problem was to determine the marker to be clipped without harming the accuracy of the likelihood result as possible.
How to choose a marker:
We considered few characteristics of the loci to help choosing the extreme ones:
1. Te distance of a locus from the affection.
This characteristic indicates how close a locus to the affection; hence, we avoid clipping loci located too close to the laffection.
2. The number of persons in which the phenotype of a locus is unknown.
As bigger this number, as less the information a locus provides. So it’s recommended to get rid of it.
3. The number of possible alleles a locus has.
A locus with a big number of possible alleles causes higher complexity.
4. The closeness of a locus to it’s flanking neighbors.
If two loci are too close to each other, it’s possible to give up one of them since we have the information from the other.
Data structures:
We had only one data structure, an array that holds former four characteristics for each locus.
The array size is the number of loci, and it is of type prio:
Prio is a struct and defined as follows:
typedef struct Prio{
double leftDist;
double rightDist;
int numUnknown;
double totalDist;
double totalPrio;
}prio;
leftDist- is the distance between a locus and it’s left neighbor.
rightDist- is the distance between a locus and it’s right neighbor.
numUnknown- the number of persons that their phenotype at this locus is unknown.
totalDist- the distance of this locus from the affection.
totalPrio- the total priority of this locus.
*As lower a locus` priority as higher the chance to be clipped off!
Hence, the total memory complexity is O(number of loci)
Algorithm of setting the priority of the loci:
1. Calculate and set the values of every field of each cell of the priority array corresponding with the proper locus. A locus located at index i in the array of the loci (loci), corresponds with index i in the priority array (prioLoci).
· leftDist / rightDist / totalDist - are calculated once by moving along the array that holds the recombination values between the loci è time complexity is O(number of loci)
· numUnknown- is calculated by moving all the persons in all pedigrees at each locus calculationè time complexity is O[(number of loci)*(number of people in all pedigrees)]
2. Calculate the total priority:
a. As mentioned above we have no interest in removing loci located too close to the affection; therefore, we “protect” the area surrounding the affection. The size of the area (dangerousInterval) is determined by a percentage defined (the user may change it (percentageForInterval)). For each locus in the protected area we set a high positive value (100) to its total priority (totalPrio), and zero for the others. If however, during the run of the program, no loci left out the protected are, it shrinks by one from each side.
b. Calculate the average of the number of unknown phenotype of any person at any locus, and the average of alleles at each locus outside the protected interval,è time complexity is O(number of loci)
c. For each locus (out of the protected area) that has number of unknown greater than the average, reduce 10 from its total priority; (the new total priority is -10). The same is done for each locus (out of the protected area), and has number of possible alleles greater than average.
d. We had defined a critical distance (criticalDist) so that any two adjacent loci with recombination frequency equal or less than it, reduce a defined positive value from total priority (totalPrio) (minVal 30) – the user can change it.
(c + d) in one loop over priority arrayè time complexity is O(number of loci)
e. In choosing the marker, the search selects the locus with the lowest priority, if there is more than one locus with the lowest priority, take the one with higher number of alleles, if the alleles are equal take the farthest one from the affection locus.
This is done by moving along the priority array only onceè time complexity is O(number of loci)
3. Remove the locus chosen an
adjust the data structure to get consistent, by deleting the marker chosen from
the array loci and shift the others è (time complexity
is O(number of loci) in worst case shift), also
updating the two dimensional array (maleThetaVals and
femaleThetaVals) è(time complexity
is O(number of loci) in worst case shift). Also updating the pedigree, shift
the loci for each person è(time complexity is O((number of loci)*(number of
people in all pedigrees)) in worst case shift).
prioLoci array:

This is how the priority
array looks like at running time. In the middle of the diagram, the affection
locus is located, and the rectangle describes the protected (dangerous) area
which includes the loci surrounding the affection within a fixed length, the
priority of these loci is denoted by 100 and the rest of loci receive priority
according to the data each one holds. The flanking arrows are to say that the
width of the protected area may shrink if this is needed, if no loci left out
of it (very low chance to happen).
As already been mentioned
above, we choose the locus that has lowest priority of the loci.
Results and
Conclusions:
*in all the file bellow, the
affection locus is set to be in the middle of the map
Resuls (same number of alleles for each locus)
|
File |
#people |
#loci |
#Nodes |
Lod score/likelihood Exact. |
Lod score/likelihood Approx. |
Absolute error |
Error rate % |
Run time Exact. (sec) |
Run time approx. (sec) |
#loci removed |
Result files.
(Exact) |
Result files.
(Approx) |
|
57 |
21 |
659 |
-0.078388 |
-0.078245 |
0.00014 |
0.182425882 |
3.140000 |
3.890 |
1 |
|||
|
57 |
21 |
1838 |
-1.731987 |
-1.102788 |
0.6292 |
36.32815951 |
22.516000 |
8.000 |
8 |
|||
|
57 |
23 |
2007 |
-1.232434 |
-2.253489 |
1.02106 |
82.84865559 |
21.110000 |
20.406 |
7 |
|||
|
57 |
25 |
2190 |
-2.577146 |
-2.251073 |
0.32607 |
12.65248457 |
24.671000 |
35.547 |
6 |
|||
|
57 |
27 |
2380 |
-1.356160 |
-0.723942 |
0.63222 |
46.61824563 |
1132.828000 |
293.813 |
7 |
|||
|
57 |
29 |
2531 |
-1.282169 |
-0.706973 |
0.5752 |
44.86116885 |
2371.485000 |
622.578 |
5 |
|||
|
25 |
23 |
851 |
-1.145314 |
1.614273 |
2.75959 |
240.9458891 |
76.266000 |
45.515 |
6 |
|||
|
25 |
25 |
922 |
-0.354794 |
-1.584810 |
1.23002 |
346.684555 |
95.843000 |
56.297 |
8 |
|||
|
25 |
27 |
996 |
-1.623843 |
-0.578665 |
1.04518 |
64.36447366 |
120.156000 |
49.500 |
12 |
|||
|
25 |
29 |
1062 |
-0.260679 |
-0.630194 |
0.36952 |
141.7509657 |
89.235000 |
71.937 |
13 |
This
kind of file in which we assigned the number of alleles for each locus,
reflects the effect of the number of unknown and the
recombination values between the loci, on the final results of our implementation.
As it seems from the results, our implementation
stands it
nice, since we had only one file in which we obtained relatively high error
distance.
We
have obtained the following graphs from the table above. The first graph below
describes the absolute error (error size) as a
function
of the number of loci been removed.
The
absolute error is generally low (except one file) as it seems from the table
the same as from the graph.

The
following graph describes the absolute error as a function of the number of
loci at each side of the affection locus.
In
this group, from a certain point, the more loci at each side of the affection
the lower the absolute rate is.

Result (same known)
|
Files |
#people |
#loci |
#Nodes |
Lod score/likelihood Exact. |
Lod score/likelihood Approx. |
Absolute error |
Error rate % |
Run time Exact. (sec) |
Run time approx. (sec) |
#loci removed |
Result files.
(Exact) |
Result files.
(Approx) |
|
50 |
21 |
1716 |
-2.105777 |
-1.588505 |
0.51727 |
24.5644244 |
3.2500 |
2.547000 |
5 |
|||
|
50 |
23 |
1855 |
-0.798646 |
-0.677597 |
0.12105 |
15.1567778 |
2.7960 |
2.828000 |
2 |
|||
|
50 |
25 |
2033 |
-2.237754 |
10.164054 |
12.4018 |
554.207835 |
5.6560 |
6.343000 |
14 |
|||
|
50 |
27 |
2197 |
-1.301117 |
-1.727963 |
0.42685 |
32.8061197 |
12.6250 |
10.89000 |
8 |
|||
|
50 |
29 |
2330 |
-2.227370 |
0.892379 |
3.11975 |
140.064246 |
12.1710 |
13.81200 |
15 |
|||
|
50 |
31 |
2472 |
-2.339975 |
-2.540232 |
0.20026 |
8.55808289 |
8.8590 |
8.171000 |
9 |
|||
|
50 |
33 |
2632 |
-2.024760 |
-0.525317 |
1.49944 |
74.0553448 |
10.6250 |
7.421000 |
9 |
|||
|
50 |
35 |
2779 |
-0.705142 |
1.642086 |
2.34723 |
332.873095 |
7.8120 |
6.671000 |
10 |
|||
|
50 |
37 |
2919 |
-3.120746 |
1.638174 |
4.75892 |
152.493026 |
7.3120 |
8.156000 |
9 |
|||
|
50 |
39 |
3078 |
-0.294515 |
3.999787 |
4.2943 |
1458.0928 |
18.2960 |
12.40600 |
11 |
In these files, all the loci
have the same number of known phenotype in the people given in the input,
(matter fact we assigned the loci to be all known at any person – since we take
into account the average - as been described above – this wouldn’t have any
effects). According to our implementation, in this case we remove – mainly loci
with the highest number of alleles in an attempt to
reduce the complexity coefficient. As it is obvious from the
table and the graph bellow (the first one) the more loci removed, the bigger
the absolute rate is.

Also in this case, from a
certain point, the more loci at each side of the affection the lower the
absolute rate is.

Results (same distance)
|
Files |
#people |
#loci |
#Nodes |
Lod score/likelihood Exact. |
Lod score/likelihood Approx. |
Absolute error |
Error rate % |
Run time Exact. (sec) |
Run time approx. (sec) |
#loci removed |
Result files.
(Exact) |
Result files.
(Approx) |
|
30 |
15 |
686 |
-0.386085 |
2.580186 |
2.96627 |
768.294806 |
0.4840 |
0.9220 |
8 |
|||
|
30 |
17 |
750 |
-0.231126 |
0.934130 |
1.16526 |
504.164828 |
0.6870 |
0.9840 |
4 |
|||
|
30 |
19 |
837 |
-0.456340 |
-0.586443 |
0.1301 |
28.5101021 |
0.7340 |
1.3430 |
6 |
|||
|
30 |
21 |
927 |
-0.293532 |
0.739871 |
1.0334 |
352.058038 |
0.890 |
1.140 |
2 |
|||
|
30 |
23 |
1012 |
-0.645523 |
-0.571646 |
0.07388 |
11.4445186 |
1.0150 |
1.6710 |
4 |
|||
|
30 |
25 |
1095 |
-0.586887 |
-0.288023 |
0.29886 |
50.923602 |
1.0460 |
2.5150 |
8 |
|||
|
30 |
27 |
1174 |
-0.345031 |
-1.090526 |
0.7455 |
216.066093 |
1.000 |
3.250 |
11 |
|||
|
30 |
29 |
1267 |
-0.641469 |
-0.914389 |
0.27292 |
42.5460934 |
3.2180 |
4.2030 |
13 |
|||
|
30 |
31 |
1341 |
0.275094 |
-1.297687 |
1.57278 |
571.724938 |
2.610 |
4.9060 |
14 |
|||
|
30 |
33 |
1435 |
-0.321124 |
-1.310705 |
0.98958 |
308.161645 |
2.860 |
5.8280 |
15 |
All loci in these files have
the same recombination values between them = 0.1.
In general the absolute error
is low except the first file, however, this file has relatively a low number of
loci from each side of the affection and half of the loci were removed. So this
particular file is not a representative one.