Calculating A Family Pedigree Using Monte Carlo Approximations 

by Guy RomBenjamin Zander 

April 14th 2002 under guidance of Maayan Fishelson 

Problem 

Regardless of how well algorithms are being perfected, linkage analysis provides an ever-growing challenge to computing because some disease models depend on multiple loci, markers are highly polymorphic, and there are many markers available. It is therefore crucial to automatically identify when a model is too large to be handled via exact inference and to provide the means, at the user's request, for approximate computations, with error bounds. Several theoretical approximation algorithms were developed in order to address this challenge. Since the current implementation of SuperLink performs exact inference it was the goal of this project to provide the means for performing approximate inference.

The difficulties that we faced were on two levels, finding an appropriate theoretical model and tailoring it to SuperLink.

Implementation 

The theoretical base of our work is found in [1], which describes and proves a bounded approximation algorithm for Bayesian networks inference. The approach as expressed in the article is an improvement on Evidence Weighing [2]. 

The Bayesian network is traversed many times. In each traversal node values are randomized, base on the probability tables. 

Expect nodes are nodes that hold the data in a Bayesian network that doesnít just add up to 1. The sum on the values of these nodes, given their predecessorís values, will be less than 1. It can be proved that the multiplication of the generated values of these nodes converges to the exact inference value of the network.

The implementation is constricted to a single module that performs the approximate reduction (MonteCarlo.[ch]). There is a need to set general definitions in settings.h. There is a significant amount of documentation in the code itself.

Implementing the algorithm in SuperLink required us to disable some of SuperLink's features. SuperLink tries to reduce the number of nodes in the network and thereby reduce computations. This reduction increases the number of expect nodes, and thereby increases the number of computations needed by the approximation algorithm. Another difficulty was the isolation of the probability table for a specific node.

A theoretical bound computed according to [1] can be used, but is not recommended since it requires too much time. The program can be supplied with the number of traversals to be made by the approximation algorithm.

Results 

We have managed to show that the algorithm is not suitable for the purposes of SuperLink.

In all the examples we used, exact inference worked faster by far. The main problem that the SuperLink environment puts before this algorithm is the usage of logical values (i.e. zero and one probabilities) or probabilities that are close to logical. A simple example is described in diagram 1; the lower node is the expect node. Notice that the sum on its values given the predecessorís value is not 1. In 99 out of a 100 we will randomize the value 1 for the top node, thereby the probability of the expect node will be zero. In the last of 100 traversals we will randomize the value 0 for the top node and following, the expect node will get 0.5 as its probability. The division of 0.5 by 100 (sum of probabilities for expect nodes, divided by number of traversals made) will give us the end result. But what if we only do 50 traversals, or if in the 100 traversals we did we did not hit 0 in the first node...
The example supplies an intuitive understanding of the problem, where logical and close to logical values are used there is a need for a very large number of traversals in order to reach convergence to the inferred value. 

In addition a theoretical bound on the number of traversal validates this view as shown in [1]. 


 

 

We used the MakePed utility in order to create the input files. Following are the parameters that we used:

Number of loci desired in the data file - 2

Place of the disease locus - 1
Program isn't sexlinked

Number of null loci - 0
Number of alleles per locus - 3
Type of locus no. 0 - alleles type locus
Type of locus no. 1 - alleles type locus
Recombination value - 0.1 / 0.3 / 0.5 (3 possible values for every assigned size of pedigree)

No multiple-marriages

No loops
Size of pedigree - 3 to 18
 

 

Following are sample runs on randomly generated input files. The files were varied on two criteria, pedigree size and recombination factor. The pedigree size was expected to make the Bayesian network larger and more difficult to handle for both tested approaches. The recombination factor was considered as a means to affect the logical values problem.

Values are in log-likelihood, time for calculation is in seconds. The numbers next to MonteCarlo stand for the number of traversals over the network.
 
Pedigree

Size
Recombination
Factor
Deterministic Inference
MonteCarlo 2000
MonteCarlo 20000
MonteCarlo 200000
Value
Time
Value
Time
Value
Time
Value
Time
3
0.1
-2.747035
0.000000
-2.748870
0.160000
-2.743392
1.270000
12.657098
8.290000
4
0.1
-2.580212
0.000000
-2.683621
0.170000
-2.620569
1.420000
-2.599492
9.950000
5
0.1
-6.111203
0.000000
-6.199487
0.330000
-6.085544
2.040000
-6.120306
16.040000
5
0.3
-5.706751
0.000000
-5.667138
0.330000
-5.697237
3.080000
-5.711430
17.080000
5
0.5
-5.365020
0.050000
-5.296761
0.160000
-5.426840
1.260000
-5.376147
8.570000
6
0.1
-6.698798
0.000000
-6.600389
0.330000
-6.667788
3.130000
-6.671289
20.430000
6
0.3
-7.671187
0.000000
-7.582318
0.330000
-7.645987
3.730000
-7.665147
18.730000
6
0.5
-8.625642
0.000000
-8.584443
0.220000
-8.640189
1.430000
-8.623507
8.570000
7
0.1
-8.974668
0.000000
-9.571315
0.440000
-9.015012
3.790000
-8.969255
23.950000
7
0.3
-8.046899
0.000000
-8.035737
0.390000
-8.039623
4.500000
-8.047561
22.470000
7
0.5
-7.172967
0.000000
-7.060475
0.220000
-7.106232
2.250000
-7.183487
12.020000
8
0.1
-7.172967
0.000000
-7.060475
0.220000
-7.106232
2.250000
-7.183487
12.020000
8
0.3
-9.690607
0.000000
-9.950005
0.490000
-9.620946
4.500000
-9.696056
24.440000
8
0.5
-9.186951
0.000000
-8.910845
0.280000
-9.308785
2.420000
-9.162657
15.160000
9
0.1
-8.755576
0.060000
0.000000
0.710000
0.000000
5.820000
-8.420663
58.550000
9
0.3
-11.232086
0.000000
0.000000
0.600000
-11.305155
6.200000
-11.191212
32.730000
9
0.5
-11.040140
0.000000
0.000000
0.330000
0.000000
3.460000
-11.167107
18.120000
10
0.1
-7.919520
0.060000
0.000000
0.710000
-7.921716
5.930000
-7.946076
37.680000
10
0.3
-8.837329
0.000000
0.000000
5.380000
0.000000
32.240000
-9.049059
305.670000
10
0.5
-9.956336
0.000000
0.000000
0.330000
-9.815304
3.250000
-10.162091
18.510000
11
0.1
-12.230570
0.000000
0.000000
0.660000
-12.781721
6.210000
-12.389024
38.720000
11
0.3
-10.367531
0.000000
-10.517442
0.610000
-10.341351
5.280000
-10.382309
31.740000
11
0.5
-12.204816
0.000000
0.000000
0.380000
-12.151225
3.620000
-12.090527
19.170000
12
0.1
-13.708077
0.050000
0.000000
0.770000
-13.529714
6.270000
-13.383586
39.600000
12
0.3
-12.149852
0.000000
0.000000
0.940000
0.000000
7.080000
-12.262694
46.090000
12
0.5
-14.045091
0.000000
0.000000
0.330000
-14.335848
3.840000
-13.974120
21.140000
14
0.1
-13.246331
0.160000
0.000000
1.100000
-13.029461
7.470000
-13.401072
54.380000
14
0.3
-16.064232
0.060000
0.000000
1.210000
0.000000
9.170000
-16.271196
101.060000
14
0.5
-19.663534
0.000000
0.000000
0.490000
0.000000
4.670000
0.000000
25.490000
16
0.1
-14.958027
0.000000
0.000000
1.210000
0.000000
8.570000
0.000000
90.400000
16
0.3
-15.982722
0.050000
0.000000
1.210000
0.000000
8.680000
0.000000
60.030000
16
0.5
-15.482457
0.000000
-15.483322
0.550000
-15.484372
4.610000
-15.482537
24.880000
18
0.1
-23.364095
0.060000
0.000000
1.100000
0.000000
8.290000
0.000000
63.440000
18
0.3
-17.241915
0.000000
0.000000
1.160000
0.000000
8.680000
0.000000
63.270000
18
0.5
-17.795665
0.000000
0.000000
0.550000
-17.911613
5.550000
-17.911613
30.260000

 

We ran 3 of the above examples for a longer period of time:
 
Pedigree

Size
Recombination
Factor
MonteCarlo 50000
MonteCarlo 2000000
Value
Time
Value
Time
14
0.5
 
 
0.000000
368.390000
16
0.3
0.000000
144.460000
 
 
18
0.3
0.000000
163.630000
 
 

 

Conclusions 

The test runs show that as the pedigree grows, the computation time increases. Furthermore, the number of needed traversals to get an accurate result and especially a result that is different than zero increases. The zero result is the product of the logical values problem we showed with diagram 1. For each run, the approximation algorithm worked for a longer time by several scales. The variance over the recombination factor value did not seem to lead to any conclusions.

 
 

This approximation algorithm to the inference of a Bayesian networks seemed to have failed. We can point to several reasons for this:

  1. The inference algorithm performs a finite number of computations in order to eliminate each node, those computations are dependant on the number of predecessors (p) and the number of values (v) that each of them has. So the order of computations is C1 * O(p) * O(v). For each node, the approximation algorithm just randomizes a value and does that each traversal where the number of traversals is t. So the order of computation is t * O(1). Considering that p and v are bound, there must be a number of traversals t that will cause the inference approach to be faster, and it seems that in the networks we used, p and v were such that the deterministic inference algorithm had better time results.
  2. We have noticed that as the network becomes more complicated and "harder" for the regular inference algorithm, it's likelihood lessens considerably. As we have shown, the logical values problem makes it difficult for the approximation algorithm to compute a close enough result.
We assume that in networks where these conditions are more favorable, and the network is large and poses a challenge to the inference algorithm, there is a possibility that the approximation algorithm will come out in triumph.

Downloads 

Full logs (.zip) 

Executable for Windows (.zip) 

Bibliography 

[1] Dagum and Luby1997 

Paul Dagum and Michael Luby.
An optimal approximation algorithm for Bayesian inference.
Artificial Intelligence, 93:1-27, 1997. 

An Optimal Algorithm for Monte Carlo Estimation (PS)

[2] Fung and Chang1989

Robert Fung and Kuo-Chu Chang.
Weighing and integrating evidence for stochastic simulation in Bayesian networks.
In Uncertainty in Artificial Intelligence 5, pages 209-219, New York, N. Y., 1989. Elsevier Science Publishing Company, Inc.