Project In Bioinformatics

SNP analysis

Winter 2004, under guidance of Prof. Dan Gaiger and Anna Tzemah

By Shay Gammer and Idan Ben-Harrush

           Table of contents


1.      Background

2.      General Description

3.      Subject of Project

4.      Goals

5.      Requirements

6.      Implementation of multApproxFewRecombonation()

7.      Implementation of void calcProbabilities()

8.      Conclusions

9.      References


Ø      MultApproxFewRecombonation experiments results

Ø      CalcProbabilities experiments results


1.      Background:

·                          Linkage analysis aims to extract all available inheritance from pedigrees and to test for coinheritance if chromosomal region with a trait. In principle, one can use either parametric methods, meaning that a mode of inheritance is assumed, i.e. testing whether the inheritance pattern fits a specific model for a trait-causing gene. Or nonparametric methods, which involve testing whether the inheritance pattern deviates from expectation under independent assortment.

·                          Single Nucleotide polymorphism (SNP) is a single nucleotide change in a DNA sequence. SNPs by definition do not cause health problems for people that have them, but they can be useful when studying diseases in the general population, as they are natural variations that we all have.


Back to top


2.      General Description:

·                          GH is a program that calculates the likelihood of a family pedigree(s) given the θ values.  The likelihood calculation on the given input pedigrees is performed in a couple of stages:

                                                               i.      Stage 1:  Probabilities of all inheritance vectors for each locus are calculated separately.

                                                             ii.      Stage 2:  Calculated likelihood for all points (multiplication of probability vectors with transition matrices).

·                          This algorithm is linear for number of loci, but exponential with number of people.


Back to top


3.      Subject of Project:

·                          Adjusting our program for SNP analysis and constructing approximate solutions for maps where the probability of observing several recombinants between consecutive markers is close to zero.


Back to top

4.      Goals:

·                          Implement a faster algorithm for SNP analysis and finding approximate solutions assuming few recombinations at most between consecutive markers.

·                          Implement a faster algorithm calculating probabilities of all inheritance vectors for each locus
in case of two possible alleles in each marker.


Back to top


5.      Requirements:

·                          Implementation of the following function (in file “MultApproximate.c”):
void multApproxFewRecombonation(double *result, double *input, double theta, int size)
In this function, we need to multiply vector of real numbers (size n) with a transition matrix T(θ) of size n X n. We will define a n X n transition matrix T(θ) having rows and columns indexed by elements  α,β ε (Z2 )n,  with  elements
Tα,β(θ) = { θ j (1- θ)n- j   | j = H(α,β) and j < 2
Tα,β(θ) = { 0                  | j = H(α,β) and j >= 2
where  H(α,β) denote the Hamming distance (the number of bits that differ) between two vectors  α,β  ε (Z2 )n.  The Hamming distance is a metric on (Z2)n.  

·                          Implementation of the following function (in file “probTree.c)
void calcProbabilities(TreeNode* path, int depth, int step, int locus, bool paternal)
In these functions, we need to calculate the probability of marker genotypes given an inheritance vector. Use binary tree to represent the inheritance vectors, each leaf of the tree represents inheritance vector, the idea is to consider the restrictions imposed by the observed marker genotypes.


Back to top

6.      Implementation of multApproxFewRecombonation():

·                          Remark: We have examined the efficiency of the matrix vector multiplications, and reached the conclusion that the Fast Fourier Transform reaches better result when dealing with more than one recombinations.  Therefore we have implemented two functions, one which deals with any number of recombinations (multApproxFewRecombonation) and another which is optimized for dealing with one recombination (multApproxOneRecombonation).

·                          Functions prototypes:
void multApproxOneRecombonation(double *result, double *input, double theta, int size)
void multApproxFewRecombonation(double *result, double *input, double theta, int size)

·                          Logic:

                                                               i.      The matrix involved is symmetric, without loosing generality we will describe left vector, matrix multiplication.

                                                             ii.      In order to perform efficient multiplication, one needs to find all the matrix cells, for which the hamming distance between it’s indexes row(i), column(j) is less or equal to the maximum allowed recombinations.

                                                            iii.      The Hamming distance between every two indexes i,j equals the bit difference between their binary representation.

                                                           iv.      The result of XOR operation between vector A and vector B, such that vector B has only N bits on, must be a vector with hamming distance N from vector A.

                                                             v.      The outcome of multiplication of vector 1Xn with matrix nXn is 1Xn vector. This outcome is stored in the vector “result”. In the function, for each of the matrix columns the relevant cells are found ,simply, by enumeration all vectors with at most maximum allowed recombinations bits on,  and “XORing” with the column index
(in binary representation).

                                                           vi.      Each cell in the vector “result” holds summation of all the multiplication of the vector with the relevant column, as explained in the previous bullet we sum only values which make contribution i.e it’s value is not zero.

·                          Functions parameters:

                                                               i.      result – used to store the multiplication result.

                                                             ii.      input – the input vector.

                                                            iii.      theta – the recombination probability.

                                                           iv.      size – the size of the arrays "result" and "input"

·                          Functions return value: void.

·                          Data structures and variables used:

                                                               i.      void multApproxOneRecombonation(double *result, double *input, double theta, int size)

1.      double common – holds the value:  (1-θ)^( size )

2.      double  thetaValue – holds the value: (θ  * (1-θ)^( size -1)  (

3.      int invertMask:

a.       Initialization:      1.

b.      Purpose:           used to find the matrix's cells which hold value that is not zero.

                                                             ii.      void multApproxFewRecombonation(double *result, double *input, double theta, int size)

1.      #define MAX_ALLOWED_RECOMBINATIONS  - the number of recombinations considered valid between two consecutive markers.

2.      double common – holds the value:  (1-θ)^( size - MAX_ALLOWED_RECOMBINATIONS )

3.      double  thetaValues[MAX_ALLOWED_RECOMBINATIONS + 1]
cell i holds the value: (

4.      int invertMask: enumerates all vectors with at most MAX_ALLOWED_RECOMBINATIONS bits on.

5.      int bitInvert[MAX_ALLOWED_RECOMBINATIONS]: this array helps enumerating all the relevant vectors in invertMask, each cell holds a vector with one or zero bits on, we calculate invertMask by using bitwise OR over all cells in bitInvert.

·                          Pseudo code:

                                                               i.      For each index I in the vector "input":

1.      result[I] = input[I] * (1- θ) (MAX_ALLOWED_RECOMBINATIONS )

2.      for all vectors InvertMask < size with k < MAX_ALLOWED_RECOMBINATIONS bits on

a.       J = InvertMask  Xor   I

b.      result [I] += input[J] * (θk * (1- θ) (MAX_ALLOWED_RECOMBINATIONS – k) )

                                                             ii.      The "result" vector now contains the result of the multiplication of the vector with the sparse matrix.

Remark: in the case of one allowed recombination the InvertMask enumeration is rather simple, it is done by bitwise shift left operation.


·                          Complexity:

                                                               i.      For each index in the vector "input" going over all the relevant matrix cells: O(log( MAX_ALLOWED_RECOMBINATIONS) (size))

                                                             ii.      Sum: O(log(size))*O(size) = O(size * log( MAX_ALLOWED_RECOMBINATIONS) (size))


Back to top

7.      Implementation of void calcProbabilities():

·                          Remark: in order to implement efficiently the function calcProbabilities(), we have implemented the function :

void superCalcPrbability(TreeNode* path, int locus)

Which is a wrapper to the function calcProbabilities().

This function does the following:

o                     Initialize of the data structures.
o                     Initialize the root node with all the details of the founders.
o                     Makes the first call to the recursive function calcProbabilities().

·                          Function prototype:
void calcProbabilities(TreeNode* path, int depth, int step, int locus, bool paternal)

·                          Logic:

                                                               i.      The main data structure we use is a “virtual” sparse binary tree, in which each branch (meiosis) is evaluated conditional on the outcome of the preceding meiosis. Evaluation stops early in section producing invariant outcomes, these occur, when an impossible gene flow pattern is detected. Uninformative meiosis produces symmetric nodes and further increase “sparseness”. Uninformative meiosis occurs when both paternal alleles are indistinguishable i.e. the allele donating person is homozygote.

                                                             ii.      In order to save time we store the possible founder’s alleles assignments and the corresponding likelihoods in each node.
This allows both, the recognition of impossible gene flow (probability 0) early in the tree path, and fast probabilities calculations in leaf nodes.

                                                            iii.      In order to save space we go over the tree in DFS, which means we don’t store the whole tree (that is why we called it “virtual” earlier) but, only an array of nodes in the size of the depth of the tree (which is the maximum amount of nodes we might need to store concurrently), i.e. we store the path from the root node to the current node.  

·                          Parameters:

                                                               i.      path – the tree node, each node represents meiosis outcome

(data structure of the tree node will be described later on).

                                                             ii.      depth – an integer that represents the tree depth,  used to find the person that is effected by the meiosis.

                                                            iii.      step  integer to represent left = 0 /right = 1 son of the parent node, the step is used to build the inheritance vector.

                                                           iv.      locus – an integer that represents the locus which is tested. Used to find the probabilities of the alleles.

                                                             v.      paternal – a Boolean that represents the gender of the parent which had the meiosis.

·                          Return value: void.

·                          Data structures and variables used:

                                                               i.      typedef struct TreeNode_t


int inhVector;

int depth;

int* founderAssignments;

double* probabilities;

int top;

} TreeNode;

1.      inhVector – used to store the inheritance vector so far.

2.      Depth – used to store the depth of the node from the tree root.

3.      founderAssignments

a.        An array stores possible founder’s Alleles assignments.

b.       This array is logically of two dimensions; however it is implemented as one dimension array. It’s size is:

(2* number of founders) * ( 2 ^ (2 * number of founders))/2.

c.       The array is initialized in the function superCalcprobabilities() with known founder’s alleles assignments. The other speculations are filled while searching the tree.

d.      The array may double it’s size when a new allele assignment is possible, or be divided by 2 when there is an alleles restriction.

e.       Impossible alleles assignments to the founder according to the inheritance vector, is pointed by contradicting alleles assignments, for example assigning A to allele which was restricted to B only.

4.      probabilities – an array storing the corresponding probability for each assignment in founderAssignments. The probability of a marker given an inheritance vector is the sum of all the probabilities of the possible assignments for the founder’s alleles in the leaf node representing the vector.

5.      top – indicates the first empty space of the two previous arrays (founderAssignments,probabilities).

                                                             ii.      int Person, presonsParent –used as keys for accessing the data structure  (PedigreeGraph) for details harvesting on the relevant persons.

                                                            iii.      int allele1, allele2 – used to store the alleles of the person effected by the meiosis. Can store the values: 0 – for not initialized alles. 1 / 2 for each of the alleles possible.

·                          Pseudo code: (recursive function)

                                                               i.      Initialize the root node with the details known about the founders. (done in the superCalacProbabilities() )

      We go over the “virtual” tree nodes in DFS

                                                             ii.      For visited node:

1.      Init the node data structures from it’s father node.

a.       Update the inheritance vector according to the step made (0/1).

b.      Copy the founder’s alleles assignments options from the father node.

c.       Copy the probabilities for each alleles assignment from the father node.

2.      Check whether the person alleles contribute information to the founder’s alleles assignment:

a.       If the information contributes more options for the founder’s alleles assignment – update the founderAssingments array with the new option, and update the probabilities vector accordingly.

b.      If the information contributes by narrowing the options for the founder’s alleles assignment – update the founderAssingments array with the new option.

c.       If the information contributes by contradicting the founder’s alleles information gathered so far,


3.      If a leaf was reached store the probability and return.

4.      If the parent of the person effected by the next meiosis is homozygous

a.        Visit left son.

b.      duplicate probabilities of symmetric nodes.

c.       Return.

5.      Visit left son.

6.      Visit right son. 

·                          Complexity:

                                                               i.      DFS complexity: O(|V|+|E|) . |V| = 2[(num of meiosis) + 1] ,  |E| = |V| -1

                                                             ii.      Sum of nodes update: [ 2 * founders) * 2 (2 * founders) ]/2

                                                            iii.      Total : O[(2[(num of meiosis) + 1] + 2[(num of meiosis) + 1]) * ( 2 * founders) * 2 (2 * founders) )/2 ]

                                                           iv.      This complexity bound is rather loose, using sparse tree and partially occupied founderAssignments array , the average complexity is much lower.


             Back to top


8.      Conclusions:

·                          Approximate matrix vector multiplication:
When we want to consider the possibility of at most M recombinations between consecutive markers, the NxN transition matrix will have exactly N * logM(N) cells with values greater then zero. Since every nonzero cell can effect the outcome and since the Fast Fourier Transform approach performs the multiplication in O(N*log(N)), the method of approximate multiplication can only be effective for M=1.
The experiments we performed demonstrated the benefits of approximating the vector matrix multiplication when the probability of more then one recombination between consecutive markers is very small (virtually zero).
The efficient method of enumerating all nonzero matrix cells explained earlier (
6 - logic) makes it possible to reduce runtime without the need of any special data structure and therefore without increasing memory requirements.


Ø      MultApproxFewRecombonation experiments results


             Back to top



·                          Calculating probabilities using sparse trees:

o       The runtime of the sparse tree algorithm (STA), is longer than the runtime of the GH algorithm, however when predicting the runtime of STA with founders reduction, the runtime of STA is usually much shorter than the runtime of GH.

o       The more homozygous positions over all markers the faster the algorithms run (GH and STA). This occurs due to symmetry nodes
(as explained in section 7-logic-i).

o       The ratio between the runtime of STA and GH seems to increase with the increase of the homozygous marker positions, due to the tree sparseness.

o       The more genotyped founders the faster the algorithms run (GH and STA). This occurs due to restrictions imposed on the founder’s alleles assignments, which increases the sparseness of the tree.

o       The runtime seems to be shorter at the high end and low end of the graph, i.e. when the percentage of typed people is high or low,
this can be explained by the following reasons:

§         High percentages of genotyped people impose many restrictions on the founder’s alleles assignments, i.e. many impossible gene flows and many symmetric nodes, these shorten the calculations periods. (As explained in section 7-logic-i,ii).

§         Low percentages genotyped people means low number on calculations, when searching the tree non-genotyped person means no contribution to the marker probability calculations, i.e. skipping to the next node without calculations.

o       The run time of both algorithms is linear with the number of loci.

§         For GH the ratio is about: 0.9.

§         For SPA (predicted with founder reduction) the ratio is about: 0.25.

o       The runtime ratio between GH and SPA (considering founders reduction) tend to grow with the number of loci growth.

Ø      CalcProbabilities experimnets results


             Back to top



9.      References:

·                          "Merlin—rapid analysis of dense genetic maps using sparse gene flow trees", Gonçalo R. Abecasis, Stacey S. Cherny, William O. Cookson, and Lon R. Cardon.  Nat Genet. 2002 Jan;30(1):97-101

·                          Computing the probability of marker genotypes given an inheritance vector

·                          Merlin

·                          Converter to and from Genehunter/Merlin format:


             Back to top