(Advanced Programming Lab ,cs 236503)

by Anna Reitman  and Dobgolbsqy Nyqolay

25.02.02, under guidance of

Prof. Dan Geiger

Ma'ayan Fishelzon

Subject of Project   

Implementation of Genehunter's algorithm.


Genetic linkage analysis is a statistical method that is used to associate functionality of genes to their location on chromosomes.  The main idea is that genes that are found in vicinity on the chromosome have a tendency to stick together when passed on to offsprings.  Thus, if some disease is often passed to offsprings along with a specific marker, then it can be concluded that the gene that is responsible for the disease is located close on the chromosome to that marker.  By studying the proportion of offsprings to which the marker has passed along with the disease, the distance between the disease gene and the marker can be estimated.

         The alleles (at different genes) that are received by an individual from one parent are called a haplotype.  The measure that is used for estimating whether the haplotype of an individual contains 2 alleles (one at each locus) that resided in the same haplotype in the individual's parent is called the recombination fraction and it is denoted by theta.  The basic computational goal in genetic linkage analysis is to compute the recombination fraction between the two genes, i.e., to compute the probability that a recombination occurs between two genes.
         By calculating the probability of recombination between two genes we can estimate the distance between the genes, and therefore if we know the location of one of the genes we can locate the position of the second one.

        One of the frequently used methods for estimating theta is the Maximum Likelihood Evaluation (MLE) approach.  The aim of this approach is to find a value of theta that maximizes the probability of the evidence about a family pedigree, given theta.


Genehunter is a program used for performing genetic linkage analysis.The Genehunter algorithm is based on hidden Markov models(HMMs).By using this approach, the computation time scales linearly with the number of loci – although it scales exponentially with the number ofmeioses in the pedigree.The goal of this project is to implement Genehunter’s algorithm in the best possible way, using the references mentioned below.


*SUPERLINK external & internal documentation.

*Leonid Kruglyak, Mark J. Daly, Mary P. Reeve-Daly, and Eric S. Lander.Rapid Multipoint Linkage Analysis of Recessive Traits in Nuclear Families, Including Homozygosity Mapping. American Journal of Human Genetics 56 (1995): 519-527..

*Leonid Kruglyak, Mark J. Daly, Mary P. Reeve-Daly, and Eric S. Lander. Parametric and Nonparametric Linkage Analysis: A Unified Multipoint Approach. American Journal of Human Genetics 58 (1996): 1347-1363.

*Eric S. Lander and Philip Green.Construction of multilocus genetic linkage maps in humans.In Proceedings of National Academy of Science 84 (1987):2363-2367.

*GENEHUNTER documentation: http://linkage.rockefeller.edu/soft/gh/ 

*"Statistics in Genetics " by Terry  Speed http://stat-www.berkeley.edu/users/terry/Classes/s260.1998/Week8a/week8a/



All Data Structures are represented by arrays - as most fastest way and once prepared "masks" that give fast reach of specific bit.

For more explanation follow the link - DATA STRUCTURE

Linkage analysis can be devided into two steps:

1. Inferring information about the inheritance pattern of a pedigree

2.Deciding whether the the inheritance information indicates the presence of a trait-causing gene


Computing the single-locus Inheritance Destribution at Codominant Loci - O(2^2n-f)

The inheritance pattern at each point x is completely describes by a binary inheritance vector  v(x)=(p1,m1 , p2,m2 ....pn,mn), whose coordinates describe the outcome of paternal and maternal meioses giving rise to the n nonfounder in pedigree.

For each locus we compute the possibility of each inheretence vector (all nonfounders except first generation).This done by computing P(Marker | v) and then normalizing the sum of all probabilities in same loci to 1.


Computing the single-locus Inheritance Destribution at Desease  Loci (without speed-up)- O(4f)

 Once again we compute P(Desease| v) when P(PhenotypeDesease| v) =Sum(P(gi | v)*Mult    (P(PhenotypeDesease | gi)) and normalize then.

P(Desease | gi) is the probability that the i-th pedigree member has phenotype gi

P(gi | v) means that gi is completly determined by the founder genotypes.

The sum is computed by direct summation over founder genotypes.


Computing Likehood- O(n*n*2n)

Is done exactly as Describes in Lander and Green algorithm(1987).


Source code



Input Files No. of People Num. of Founders No.of Loci Ln- Likelihood on first place ln likehood superlink Run 
datafileEC11.dat pedfileEC11.dat 5 3 110 -1094.408491 -1103.826326 0.02 0.4 outEC11.txt
datafileEC10.dat  pedfileEC10.dat  5 3 100 -999.084665 -1007.835833 0.01 0.3 outEC10.txt 
datafileED0.dat  pedfileED0.dat  5 2 100 -884.612059 -872.452335 0.02 0.4 outED0.txt 
datafileED5dat pedfileED5.dat  5 2 150 -1320.336439 -1305.520497 0.02 0.6 outED5.txt 
datafileED11.dat pedfileED11.dat  5 2 210 -1816.519908 -1799.752965 0.02 - outED11.txt 
datafileA0.dat pedfileA0.dat 7 3 30 -332.227408 -333.789857 0.02 0.2 outA0.txt
datafileA1.dat  pedfileA1.dat  7 3 60 -706.169015  -701.623972 0.01 0.3 outA1.txt 
datafileA2.dat pedfileA2.dat  7 3 90 -1082.211699  -1070.737231 0.01 0.5 outA2.txt 
datafileB0.dat pedfileB0.dat 10 3 100 -1723.443557 -1716.274631 76.43 26.4* outB0.txt 
datafileB1.dat pedfileB1.dat  10 3 120 -2077.838700 -2068.643312 92.15 * outB1.txt
datafileB2.dat pedfileB2.dat  10 3 140 -2416.808981 -2407.466499 108.65 * outB2.txt 
datafileB3.dat pedfileB3.dat  10 3 160 -2766.293204 -2745.365789 126.58 * outB3.txt 
datafileB4.dat pedfileB4.dat  10 3 180 -3119.061002 -3103.428414 138.96 * outB4.txt 


*    LOD score wasn't calculated due to too small numbers or/and input file size

**  The error is bound by 2%

*** Genehunter does not calculate ln-likelihood -  only LOD SCORE

**** Output includes ln-likelihood for all possible places for disease locus

***** Run time - was measured by time system  function (for more information man time) - user time was taken 

Input Files No. of People Num. of Founders No.of Loci Ln- Likelihood on first place ln likehood superlink Run 
datafile3.dat pedfile3.dat 3 2 20 -123.982425 -126.488538 0.02 0.04 out3.txt
datafile4.dat  pedfile4.dat  4 2 20 -166.879821 -165.384065 0.01 0.06 out4.txt 
datafile5.dat  pedfile5.dat  5 2 20 -182.224323 -177.032069 0.02 0.18 out5.txt 
datafile6dat pedfile6.dat  6 2 20 -176.306845 -179.544550 0.05 0.2 out6.txt 
datafile7.dat pedfile7.dat  7 2 20 -197.065780 -195.338692 0.36 0.52 out7.txt 
datafile8.dat pedfile8.dat 8 3 20 -245.468750 -248.886213 0.18 0.3 out8.txt
datafile9.dat  pedfile9.dat  9 3 20 -245.485508 -246.856814 1.23 1.09 out9.txt 
datafile10.dat pedfile10.dat  10 3 20 -361.802210 -364.037750 0.72 0.59 out10.txt 
datafile11.dat pedfile11.dat 11 4 20 -338.797211 -340.462190 5.11 2.72 out11.txt 


On these files our program is in average faster by 1.38

Average  mistake 1.37%