## Performing approximate inference by using a heuristic which ignores extreme markers in the likelihood

Winter 2004, under guidance of Prof. Dan Geiger and Ma'ayan Fishelzon

by

Goal

Provide the means for performing approximate inference via a heuristic which ignores extreme markers when computations become too strenuous to be performed exactly.

Currently SUPERLINK performs exact inference. Regardless of how well algorithms are being perfected, linkage analysis provides an ever-growing challenge to computing because some disease models depend on multiple loci, markers are highly polymorphic, and there are many markers are available. It is therefore crucial to provide the means for approximate computations with error bounds.

Algorithm’s Prototype:

1)     Start with the total number of markers as specified in the user input.

2)     Determine an elimination order for the problem as is.

3)     If the complexity coefficient of the best elimination order found greater than some determined threshold ,clip off one of the extreme markers and goto 2

4)     When we have found an elimination order with complexity coefficient below the desired threshold call the function to calculate likelihood .

Strategy (to determined the clipped markers)

Concisely our strategy is to keep the closest markers for the affection locus, as long as there is relatively “reasonable” number of typed persons at each one of these markers.

In order to obtain a good heuristic we have first to determine what are the parameters that influence result accuracy, then for each one of these parameters we have to evaluate how does the heuristic value depend on this parameter (monotonous, decreasing, increasing, linear etc…), for this phase we will reference later as local factor. And then we have to weigh up between all the heuristic values from the different parameters such that if parameter A is more important than parameter B then the value contributed by A should be greater and greater correspondingly, for this phase we will reference later as global factor.

Data structure

Info:

This data structure is used in order to hold the marker information in order to calculate the heuristic value.

Following is a definition of Info structure :

typedef struct Info {

double heurVal;

double distFromDisease;

int possibleAlleles;

} Info;

heurVal indicates the heuristic value for this entry, in our implementation high heuristic values were assigned for those markers who are most likely to be clipped off. I.e. each iteration we clip the marker with the highest heuristic value.

distFromDisease is the distance of the marker in this entry from the affection status locus, this value calculated by Haldane function.

possibleAlleles number of possible alleles in this markers.

|LociInformation is an array of Info, of size equal to the numLoci. Entry i holds Info for the locus (Marker) in position i on the chromosome ,Each time we clip a marker the correspond entry  is deleted and we shift to the left all the entries in right side. I.e. we maintain LociInformation updated and matching the remaining markers situation.

1.     Distance from affection status locus.

a.       Why does it matter?

Genes that lie close together on the same chromosome will tend to be inherited together, because it will be rare for a crossover to occur between the loci at meiosis.

b.      Local factor :

The closer together the loci are, the less likely crossovers will be and the fewer recombinants will be observed. If loci are far apart or on different chromosomes then recombination will occur by chance in 50% of meioses. The recombination fraction ranges from 0 (tight linkage) to 0.5 (no linkage).

So the first and simple conclusion is that the closer the marker is for the affected status locus the more high value it gets. But besides the distance should be relative rather than just an absolute distance, which means the heuristic determined by how much the marker is closer to the disease locus relatively to other markers.

This is better because it is more dynamic and attached for the markers topology on the chromosome. For example in case that all the markers are approximately at the same distance from the disease locus there is a high opportunity to clip of not the farthest marker of them but to “ignore” the influence of this factor and let other factors to decide.

This is made by updating each entry on LociInformation by the difference between the current average and the average in case that the corresponding marker is clipped off. In addition to emphasize that small differences in distances can lead to big differences in the results accuracy, we shall arise the previous expression to 3.

Algorithm prototype:

Let dis (i) = LociInformation[i].distFromDisease//distance from disease locus

D=AVG (dis (1), dis (2) dis (numLoci))

For i =1 to numLoci

End

Notice that if the distance is less than the average distance the contribution will be negative

c.       global factor:

As mentioned before this parameter is the most important one that influence the result accuracy so it crucial that it takes great factor relatively to other parameters,so in our implementation :

And in this way we ensure that the distance influence is dominant and decisive .

2.     Informativeness of the marker

(Percentage of typed individuals at this marker

a.      Why does it matter?

In order to determine if a specific marker and a disease locus are linked together, we should examine as many as possible crossbreeds where the parents and children are typed at this marker. Since if there are a little number of crossbreeds who satisfy this condition then we can’t determined definitely if the marker and disease locus are linked or not linked, because it could be that the result is just “coincidental“,

b.     Local factor :

For each marker it’s very vital that as many as possible number of individuals are typed at this marker from all available individuals on the pedigree, so what determines is the percentage of typed person at this locus rather than their number. In addition it’s reasonable that the heuristic value added to the each entry not linear to the percentage of the number of untyped individuals at the corresponding  marker, instead  the more is the percentage close to 1 the more the heuristic value is added, that to means for small percentages our heuristic is “forgiving” and the punishment for that is light and when this percentage grows up the heuristic value increases by a little, but for big percentages then every individual becomes crucial so the punishment is heavy and when this percentage grows more the heuristic value increases “crazy”.

Thus the heuristic function is from this from:

* this is a prototype function, for the precise one look at source code

Where  is the number of untyped individuals at marker i.

We can obtain more information than just count the number of untyped persons, we can investigate the “environment” of each individual (number of his children and if he has parents and the status of these individuals in the corresponding marker).

For example let’s look at specific marker, it is reasonable that the information we loose if there is untyped individual at this marker which has one typed children is less than the information we loose from untyped individual which has 4 typed children, however If one parent has not been genotyped, one may still use information from the other parent under certain conditions. The typed parent must be heterozygous, and obviously if the affected child has the same genotype then one cannot tell which allele has been transmitted. If the child is homozygous, then one can deduce which allele has been transmitted, but the pair must still be discarded from the analysis because otherwise one will introduce a bias in favour of commoner alleles. If a parent is missing, one should only incorporate the remaining parent-child pair if both are heterozygous and have different genotypes.

On the other hand an untyped individual which has no typed siblings and has two typed parents (at the same marker) cause greater information loose than untyped individual which has typed siblings (that we can inference some information from this family)  or on or more of his parents are untyped so in our implementation in the first phase we counter the number of untyped individual with  a factor that depends on the individual “environment” for each marker , and then we calculate heuristic value according to the former equation.

c.      Global factor:

in our implementation this parameter consider primary after the Distance parameter, that we keep the closer markers as long as they have “sufficient” information.

3.     number of  possible Alleles

a.      Why does it matter?

We mean by that the sum of possible alleles at all the individuals as concluded from the pedigree.

The main purpose of our algorithm is to reduce the complexity coefficient, and in the same time to maintain as possible the result accuracy .so it’s desired to clip alleles as less as possible, that means we shall clip markers that reduce the complexity coefficient as more as possible, and it’s known that the more the marker we clip has possible alleles the more the complexity coefficient reduced.

b.     Local factor

So the markers that relatively have many possible alleles have the priority to be clipped first.

c.      Global factor

It’s not clear how does the complexity coefficient acts as function of the number of possible alleles,besides by the Experimental tests we notice that it is not unusual that clipping marker with less number of alleles than another marker lead to less complexity coefficient.

So we let this parameter minimal influence at the heuristic value.

4.     distance from neighbor markers

a.      Why does it matter?

As we explained before two genes that lie close together on the same chromosome will tend to be inherited together, so if a marker is too close to one of his neighbors the will be tied together and there is minor chance for recombination between them, so in proximity can ascribe to them as one marker, and the other marker doesn’t add any information.

b.     Local factor:

In order that the approximation that two close marker behave together in nearly all the time they should be very very close together.

c.      Global factor:

In case that the distance between markers not as sufficient that we can consider two markers behave as one marker and clip one of them so we give this parameter small affection on the heuristic value and make sense just when the markers are very close and the previous parameters are close.

General notices:

In case of sex difference in order to compute the Distance from the affected locus, or the distances between the markers we use both maleThetaVals and femaleThetaVals and weigh between them by the ratio of male number and female number in the pedigrees.

The constants and factors on the equation were determined by Experimental results.

-  Executable file