Haplotyping By MPE Query

by Sofia Grinberg and Adi Mano.

under guidance of Ma'ayan Fishelzon and Assoc. Prof. Dan Gieger.

       March 2004

 

Problem

Performing haplotyping on the input data, inferring the most likely haplotypes for the individuals in the input pedigrees. The most likely haplotypes are the ones which maximize the probability of the data.

Our goal in this project is to upgrade the SuperLink program to be able to perform haplotyping via the query of MPE (Most Probabl Explanation. The MPE task is to find the assignment   such that .

 In other words, if the pedigree is translated into a Bayesian network, we need to find an assignment to all the variables of the Bayesian network, that together with evidence maximize the probability of the data.

 

Implementation

The implementation is based on SuperLink's implementation of the SuperMLink query. One of the main differences from the SuperMLink query is that maximization is done where summation had place before. Meaning, that in the elimination variable process, we select the best values for the variable under the constrains of other variables in its function. These values are collected during the variable elimination and this is the "Backwards" stage of the algorithm presented before. In the next stage we analyze the collected information and this is the "Forward" stage.

Our program can also retrieve K different solutions to the Haplotyping problem. That is K solutions (if exist) with the same maximum probability (maximum LOD score). In order to add this ability to the project (without making major changes) we create an array in size of K that holds for each K the MPE_arg obtained in the backward phase for this Ks solution. The extension of the solution is done upon variable conditioning (variable breaking), where different sub trees are created for each value of the break node. If the maximum score can be obtained by more then one sub tree then an extension of the solution can be performed.

We also allow to calculate K results within a predefined interval from the highest score that can be obtained by the query. For example if the highest score is -1031.35 then assignments with the score of -1033.55 can be also obtained (for some given epsilon). 

When ever a new solution is found we calculate a delta between the new solution and the best one found so far. If this delta is within the boundaries of the defined interval the solutions is saved. For the solutions with lower probabilities (could be the old or new solutions) the delta is kept. At the forward phase the solution are bubble sorted in order to be printed in order of their probability.

Results

We tested our project in several ways. At first we checked the correctness of the flow on a small pedigree. That way we could compute and compare the exact calculation during the all process. In the second stage, we tested our program on 6 sets of different pedigrees examples, which appear on the SuperLink's web.

The second stage of testing consists of two groups:

bullet Comparison of the Haplotyping query vs. SuperMLink program on the performance issue.
bullet Comparison of the our Haplotyping query vs. Ma'ayan's version. In this group of tests we compared the MPE probability and  performance.

We also tested our program on the time consumption for different parameters, like the numbers of solution required (K) or the size of the interval that allowed to hold the solution (epsilon).

The Results Summary

Downloads

bullet

Full report (.doc)

bullet

Source code (.exe)