# Calculating A Family Pedigree Using Monte Carlo Approximations

### Problem

Regardless of how well algorithms are being perfected, linkage analysis provides an ever-growing challenge to computing because some disease models depend on multiple loci, markers are highly polymorphic, and there are many markers available. It is therefore crucial to automatically identify when a model is too large to be handled via exact inference and to provide the means, at the user's request, for approximate computations, with error bounds. Several theoretical approximation algorithms were developed in order to address this challenge. Since the current implementation of SuperLink performs exact inference it was the goal of this project to provide the means for performing approximate inference.

The difficulties that we faced were on two levels, finding an appropriate theoretical model and tailoring it to SuperLink.

### Implementation

The theoretical base of our work is found in , which describes and proves a bounded approximation algorithm for Bayesian networks inference. The approach as expressed in the article is an improvement on Evidence Weighing .

The Bayesian network is traversed many times. In each traversal node values are randomized, base on the probability tables.

Expect nodes are nodes that hold the data in a Bayesian network that doesnt just add up to 1. The sum on the values of these nodes, given their predecessors values, will be less than 1. It can be proved that the multiplication of the generated values of these nodes converges to the exact inference value of the network.

The implementation is constricted to a single module that performs the approximate reduction (MonteCarlo.[ch]). There is a need to set general definitions in settings.h. There is a significant amount of documentation in the code itself.

Implementing the algorithm in SuperLink required us to disable some of SuperLink's features. SuperLink tries to reduce the number of nodes in the network and thereby reduce computations. This reduction increases the number of expect nodes, and thereby increases the number of computations needed by the approximation algorithm. Another difficulty was the isolation of the probability table for a specific node.

A theoretical bound computed according to  can be used, but is not recommended since it requires too much time. The program can be supplied with the number of traversals to be made by the approximation algorithm.

### Results

We have managed to show that the algorithm is not suitable for the purposes of SuperLink.

In all the examples we used, exact inference worked faster by far. The main problem that the SuperLink environment puts before this algorithm is the usage of logical values (i.e. zero and one probabilities) or probabilities that are close to logical. A simple example is described in diagram 1; the lower node is the expect node. Notice that the sum on its values given the predecessors value is not 1. In 99 out of a 100 we will randomize the value 1 for the top node, thereby the probability of the expect node will be zero. In the last of 100 traversals we will randomize the value 0 for the top node and following, the expect node will get 0.5 as its probability. The division of 0.5 by 100 (sum of probabilities for expect nodes, divided by number of traversals made) will give us the end result. But what if we only do 50 traversals, or if in the 100 traversals we did we did not hit 0 in the first node...
The example supplies an intuitive understanding of the problem, where logical and close to logical values are used there is a need for a very large number of traversals in order to reach convergence to the inferred value.

In addition a theoretical bound on the number of traversal validates this view as shown in . We used the MakePed utility in order to create the input files. Following are the parameters that we used:

Number of loci desired in the data file - 2

Place of the disease locus - 1

Number of null loci - 0
Number of alleles per locus - 3
Type of locus no. 0 - alleles type locus
Type of locus no. 1 - alleles type locus
Recombination value - 0.1 / 0.3 / 0.5 (3 possible values for every assigned size of pedigree)

No multiple-marriages

No loops
Size of pedigree - 3 to 18

Following are sample runs on randomly generated input files. The files were varied on two criteria, pedigree size and recombination factor. The pedigree size was expected to make the Bayesian network larger and more difficult to handle for both tested approaches. The recombination factor was considered as a means to affect the logical values problem.

Values are in log-likelihood, time for calculation is in seconds. The numbers next to MonteCarlo stand for the number of traversals over the network.

 Pedigree Size Recombination Factor Deterministic Inference MonteCarlo 2000 MonteCarlo 20000 MonteCarlo 200000 Value Time Value Time Value Time Value Time 3 0.1 -2.747035 0.000000 -2.748870 0.160000 -2.743392 1.270000 12.657098 8.290000 4 0.1 -2.580212 0.000000 -2.683621 0.170000 -2.620569 1.420000 -2.599492 9.950000 5 0.1 -6.111203 0.000000 -6.199487 0.330000 -6.085544 2.040000 -6.120306 16.040000 5 0.3 -5.706751 0.000000 -5.667138 0.330000 -5.697237 3.080000 -5.711430 17.080000 5 0.5 -5.365020 0.050000 -5.296761 0.160000 -5.426840 1.260000 -5.376147 8.570000 6 0.1 -6.698798 0.000000 -6.600389 0.330000 -6.667788 3.130000 -6.671289 20.430000 6 0.3 -7.671187 0.000000 -7.582318 0.330000 -7.645987 3.730000 -7.665147 18.730000 6 0.5 -8.625642 0.000000 -8.584443 0.220000 -8.640189 1.430000 -8.623507 8.570000 7 0.1 -8.974668 0.000000 -9.571315 0.440000 -9.015012 3.790000 -8.969255 23.950000 7 0.3 -8.046899 0.000000 -8.035737 0.390000 -8.039623 4.500000 -8.047561 22.470000 7 0.5 -7.172967 0.000000 -7.060475 0.220000 -7.106232 2.250000 -7.183487 12.020000 8 0.1 -7.172967 0.000000 -7.060475 0.220000 -7.106232 2.250000 -7.183487 12.020000 8 0.3 -9.690607 0.000000 -9.950005 0.490000 -9.620946 4.500000 -9.696056 24.440000 8 0.5 -9.186951 0.000000 -8.910845 0.280000 -9.308785 2.420000 -9.162657 15.160000 9 0.1 -8.755576 0.060000 0.000000 0.710000 0.000000 5.820000 -8.420663 58.550000 9 0.3 -11.232086 0.000000 0.000000 0.600000 -11.305155 6.200000 -11.191212 32.730000 9 0.5 -11.040140 0.000000 0.000000 0.330000 0.000000 3.460000 -11.167107 18.120000 10 0.1 -7.919520 0.060000 0.000000 0.710000 -7.921716 5.930000 -7.946076 37.680000 10 0.3 -8.837329 0.000000 0.000000 5.380000 0.000000 32.240000 -9.049059 305.670000 10 0.5 -9.956336 0.000000 0.000000 0.330000 -9.815304 3.250000 -10.162091 18.510000 11 0.1 -12.230570 0.000000 0.000000 0.660000 -12.781721 6.210000 -12.389024 38.720000 11 0.3 -10.367531 0.000000 -10.517442 0.610000 -10.341351 5.280000 -10.382309 31.740000 11 0.5 -12.204816 0.000000 0.000000 0.380000 -12.151225 3.620000 -12.090527 19.170000 12 0.1 -13.708077 0.050000 0.000000 0.770000 -13.529714 6.270000 -13.383586 39.600000 12 0.3 -12.149852 0.000000 0.000000 0.940000 0.000000 7.080000 -12.262694 46.090000 12 0.5 -14.045091 0.000000 0.000000 0.330000 -14.335848 3.840000 -13.974120 21.140000 14 0.1 -13.246331 0.160000 0.000000 1.100000 -13.029461 7.470000 -13.401072 54.380000 14 0.3 -16.064232 0.060000 0.000000 1.210000 0.000000 9.170000 -16.271196 101.060000 14 0.5 -19.663534 0.000000 0.000000 0.490000 0.000000 4.670000 0.000000 25.490000 16 0.1 -14.958027 0.000000 0.000000 1.210000 0.000000 8.570000 0.000000 90.400000 16 0.3 -15.982722 0.050000 0.000000 1.210000 0.000000 8.680000 0.000000 60.030000 16 0.5 -15.482457 0.000000 -15.483322 0.550000 -15.484372 4.610000 -15.482537 24.880000 18 0.1 -23.364095 0.060000 0.000000 1.100000 0.000000 8.290000 0.000000 63.440000 18 0.3 -17.241915 0.000000 0.000000 1.160000 0.000000 8.680000 0.000000 63.270000 18 0.5 -17.795665 0.000000 0.000000 0.550000 -17.911613 5.550000 -17.911613 30.260000

We ran 3 of the above examples for a longer period of time:

 Pedigree Size Recombination Factor MonteCarlo 50000 MonteCarlo 2000000 Value Time Value Time 14 0.5 0.000000 368.390000 16 0.3 0.000000 144.460000 18 0.3 0.000000 163.630000

### Conclusions

The test runs show that as the pedigree grows, the computation time increases. Furthermore, the number of needed traversals to get an accurate result and especially a result that is different than zero increases. The zero result is the product of the logical values problem we showed with diagram 1. For each run, the approximation algorithm worked for a longer time by several scales. The variance over the recombination factor value did not seem to lead to any conclusions.

This approximation algorithm to the inference of a Bayesian networks seemed to have failed. We can point to several reasons for this:

1. The inference algorithm performs a finite number of computations in order to eliminate each node, those computations are dependant on the number of predecessors (p) and the number of values (v) that each of them has. So the order of computations is C1 * O(p) * O(v). For each node, the approximation algorithm just randomizes a value and does that each traversal where the number of traversals is t. So the order of computation is t * O(1). Considering that p and v are bound, there must be a number of traversals t that will cause the inference approach to be faster, and it seems that in the networks we used, p and v were such that the deterministic inference algorithm had better time results.
2. We have noticed that as the network becomes more complicated and "harder" for the regular inference algorithm, it's likelihood lessens considerably. As we have shown, the logical values problem makes it difficult for the approximation algorithm to compute a close enough result.
We assume that in networks where these conditions are more favorable, and the network is large and poses a challenge to the inference algorithm, there is a possibility that the approximation algorithm will come out in triumph.

Full logs (.zip)

Executable for Windows (.zip)

### Bibliography

 Dagum and Luby1997

Paul Dagum and Michael Luby.
An optimal approximation algorithm for Bayesian inference.
Artificial Intelligence, 93:1-27, 1997.

 Fung and Chang1989

Robert Fung and Kuo-Chu Chang.
Weighing and integrating evidence for stochastic simulation in Bayesian networks.
In Uncertainty in Artificial Intelligence 5, pages 209-219, New York, N. Y., 1989. Elsevier Science Publishing Company, Inc.