Lab In Bioinformatics

Implementation of the EM Algorithm to find the value of recombination fractions that maximizes the probability of the data.

by Liat Ramati and Galia Shlezinger.

21.2.2002

under guidance of

Prof. Dan Geiger and Ma`ayan Fishelzon

Problem

The current implementation of SuperLink calculates the likelihood of a pedigree or pedigrees given a value of theta. This value might not be the most probable value of theta for the given evidance.

The aim of the project was to find the values of theta that maximize the probability of the evidance for a given pedigree or pedigees. Our project was to find these values using the EM algoritm.

Implementation

We implemented our program as a function of SuperLink called: calcLikelihoodSIlink. Therefor our function uses the data structures in SuperLink and it's input files are in the same format as in SuperLink.

We implemented the EM algoritm by calculating the theta values based on the data given, updating the data according to the new theta values we calculated, and calculating the theta values again. This was done as long as the differance between two consecutive results for the theta values was smaller than a value we set in advance.

The differance between two consecutive result was calculated using this function:

diff = sum over all the loci [ (oldthetaval-newthetaval)/ (oldthetaval)]

There is also an option to stop the iteration according to the likelihood, in that case the differance is calculated by:

diff = (oldlikelihood-newlikelihood)/ oldlikelihood.

The calculation of the theta values at each iteration was done using the BTE Algoritm to find the probability functions for the selectors of the Loci of the pedigree now being calculated. From them we got the expected number of recombinant meioses and theta was calculated by dividing the number of expected recombinant meioses by the total number of meioses.

We implemented the BTE algoritm according to: K. Kask, R.Dechter, J. Larrosa & F. Cozeman. Unifying Tree-Decomposition Schemes for Automated Reasoning. http://www.ics.uci.edu/~dechter/publications/
 
 

Results

The format of the input files is the same as in SuperLink.

The third column of the table holds the value used as the stopping condition for the iterations of the EM algoritm , the stopping condition was calculated according to the difference in the theta values.

The values of the running time are in seconds.
 
Input files No. of people No. of Loci No. of  allels Loops
Stopping condition
Running time of original SuperLink
Running time when calculating optimat theta
No. of iterations
Time per iteration
Recombination values and convergance data.

 

pedfile1.dat,datafile1.dat 15 4 3-4 no
0.07
0.05 
94.883 
15  6.325  data1
pedfile2.dat,datafile2.dat 15 3 2-3 yes
0.005
0.07  3.725  0.745  data2
pedfile3.dat,datafile3.dat
10
10
3
no
0.07
0.26 
123.7 
24.74  data3
pedfile4.dat,datafile4.dat 15 3 2-3 no
0.005
0.04   0.971  10  0.0971  data4
pedfile5.dat,datafile5.dat 15 4 3 no
0.032
0.18  27.5  24  1.1458  data5
pedfile6.dat,datafile6.dat 10 5 2-3 no
0.032
0.03  0.2  0.02857  data6

The main problem we encountered was shortage in memory. The probability tables we calculate can become very large and the computer running the program runs out of memory to allocate for the program. When runing the program on a file contating 10 people and 25 Loci the program terminated with an error message printed as a result of the malloc function failing ,the same happens when running the program on input files containing 20 people and 3 Loci. However these restrictions on the size of the pedigree are not absolut , other factors effect the size of the probability tables such as the number of nodes eliminated by SuperLink or the elimination order (that is also set in SuperLink) that can effect the shape of the tree and so effect the merging of the buckets . For these reasons the former examples should be looked at as "ball park" figures.

Downloads

Full report (in hebrew)

Executable for Windows (zip)

Presentation.ppt