SNP analysis
Winter 2004, under guidance of Prof. Dan
Gaiger and Anna Tzemah
By Shay Gammer and Idan Ben-Harrush
Table of contents
1. Background
4.
Goals
5.
Requirements
6. Implementation of multApproxFewRecombonation()
7. Implementation of void calcProbabilities()
8. Conclusions
9. References
Ø
MultApproxFewRecombonation experiments results
Ø CalcProbabilities
experiments results
· Linkage analysis aims to extract all available inheritance from pedigrees and to test for coinheritance if chromosomal region with a trait. In principle, one can use either parametric methods, meaning that a mode of inheritance is assumed, i.e. testing whether the inheritance pattern fits a specific model for a trait-causing gene. Or nonparametric methods, which involve testing whether the inheritance pattern deviates from expectation under independent assortment.
· Single Nucleotide polymorphism (SNP) is a single nucleotide change in a DNA sequence. SNPs by definition do not cause health problems for people that have them, but they can be useful when studying diseases in the general population, as they are natural variations that we all have.
· GH is a program that calculates the likelihood of a family pedigree(s) given the θ values. The likelihood calculation on the given input pedigrees is performed in a couple of stages:
i. Stage 1: Probabilities of all inheritance vectors for each locus are calculated separately.
ii. Stage 2: Calculated likelihood for all points (multiplication of probability vectors with transition matrices).
· This algorithm is linear for number of loci, but exponential with number of people.
· Adjusting our program for SNP analysis and constructing approximate solutions for maps where the probability of observing several recombinants between consecutive markers is close to zero.
· Implement a faster algorithm for SNP analysis and finding approximate solutions assuming few recombinations at most between consecutive markers.
·
Implement a faster algorithm
calculating probabilities of all inheritance vectors for each locus
in case of two possible alleles in each marker.
·
Implementation of the following
function (in file “MultApproximate.c”):
void multApproxFewRecombonation(double *result, double
*input, double theta, int size)
In this function, we need to multiply vector of real numbers (size n)
with a transition matrix T(θ) of size n X n. We will define a n X n
transition matrix T(θ) having rows and columns indexed by
elements α,β
ε (Z2 )n,
with elements
Tα,β(θ) = { θ
j (1- θ)n- j | j = H(α,β)
and j < 2
Tα,β(θ) = { 0 | j = H(α,β)
and j >= 2
where H(α,β)
denote the Hamming distance (the number of bits that differ) between two
vectors α,β
ε (Z2 )n. The Hamming distance is a metric on (Z2)n.
·
Implementation of the following
function (in file “probTree.c”)
void calcProbabilities(TreeNode* path, int
depth, int step, int locus, bool paternal)
In these functions, we need to calculate the probability of marker genotypes
given an inheritance vector. Use binary tree to represent the inheritance
vectors, each leaf of the tree represents inheritance vector, the idea is to
consider the restrictions imposed by the observed marker genotypes.
6. Implementation
of multApproxFewRecombonation():
·
Remark: We have examined the efficiency of the
matrix vector multiplications, and reached the conclusion that the Fast Fourier
Transform reaches better result when dealing with more than
one recombinations. Therefore we have
implemented two functions, one which deals with any number of recombinations (multApproxFewRecombonation) and another which is optimized for
dealing with one recombination (multApproxOneRecombonation).
·
Functions prototypes:
void multApproxOneRecombonation(double *result, double
*input, double theta, int size)
void multApproxFewRecombonation(double *result, double
*input, double theta, int size)
·
Logic:
i.
The matrix involved is symmetric, without loosing generality we will
describe left vector, matrix multiplication.
ii.
In order to perform efficient multiplication, one needs to find all the
matrix cells, for which the hamming distance between it’s indexes row(i), column(j) is less or equal
to the maximum allowed recombinations.
iii.
The Hamming distance between every two indexes i,j equals the bit difference between their binary
representation.
iv.
The result of XOR operation between vector A and vector B, such that
vector B has only N bits on, must be a vector with hamming distance N from
vector A.
v.
The outcome of multiplication of vector 1Xn with matrix nXn is 1Xn vector. This outcome is stored in the vector “result”. In the function, for each of the matrix
columns the relevant cells are found ,simply, by enumeration all vectors with
at most maximum allowed recombinations bits on, and “XORing” with
the column index
(in binary representation).
vi.
Each cell in the vector “result” holds
summation of all the multiplication of the vector with the relevant column, as
explained in the previous bullet we sum only values which make contribution i.e it’s value is not zero.
·
Functions parameters:
i.
result – used to store the multiplication result.
ii.
input – the input vector.
iii.
theta – the recombination probability.
iv.
size – the size of the arrays "result" and "input"
·
Functions return value: void.
·
Data structures and variables used:
i.
void multApproxOneRecombonation(double *result, double
*input, double theta, int size)
1. double common – holds the value: (1-θ)^( size )
2. double
thetaValue – holds the value: (θ * (1-θ)^( size -1) (
3. int invertMask:
a. Initialization: 1.
b. Purpose: used to find the matrix's cells which
hold value that is not zero.
ii.
void multApproxFewRecombonation(double *result, double
*input, double theta, int size)
1.
#define MAX_ALLOWED_RECOMBINATIONS - the number of recombinations
considered valid between two consecutive markers.
2. double common – holds the value: (1-θ)^( size -
MAX_ALLOWED_RECOMBINATIONS )
3. double
thetaValues[MAX_ALLOWED_RECOMBINATIONS + 1] –
cell i holds the value: (θi * (1- θ) (MAX_ALLOWED_RECOMBINATIONS – i))
4. int
invertMask: enumerates all vectors with at most
MAX_ALLOWED_RECOMBINATIONS bits on.
5. int
bitInvert[MAX_ALLOWED_RECOMBINATIONS]: this array
helps enumerating all the relevant vectors in invertMask,
each cell holds a vector with one or zero bits on, we calculate invertMask by using bitwise OR over all cells in bitInvert.
·
Pseudo code:
i.
For each index I in the vector "input":
1. result[I] = input[I] * (1- θ) (MAX_ALLOWED_RECOMBINATIONS )
2. for all vectors InvertMask < size with k < MAX_ALLOWED_RECOMBINATIONS
bits on
a. J = InvertMask Xor I
b. result [I] += input[J] * (θk * (1- θ) (MAX_ALLOWED_RECOMBINATIONS
– k) )
ii.
The "result" vector now contains the result of the
multiplication of the vector with the sparse matrix.
Remark: in the case of one allowed
recombination the InvertMask enumeration is rather
simple, it is done by bitwise shift left operation.
·
Complexity:
i.
For each index in the vector "input" going over all the
relevant matrix cells: O(log( MAX_ALLOWED_RECOMBINATIONS) (size))
ii.
Sum: O(log(size))*O(size) = O(size * log(
MAX_ALLOWED_RECOMBINATIONS) (size))
7. Implementation
of void calcProbabilities():
·
Remark:
in order to implement efficiently the function calcProbabilities(),
we have implemented the function :
void superCalcPrbability(TreeNode* path, int locus)
Which is a wrapper to the function calcProbabilities().
This function does the following:
o Initialize of the data structures.
o Initialize the root node with all the details of the founders.
o Makes the first call to the recursive function calcProbabilities().
·
Function prototype:
void calcProbabilities(TreeNode* path, int
depth, int step, int locus, bool paternal)
·
Logic:
i.
The main data structure we use is a “virtual” sparse binary tree, in
which each branch (meiosis) is evaluated conditional on the outcome of the
preceding meiosis. Evaluation stops early in section producing invariant
outcomes, these occur, when an impossible gene flow pattern is detected. Uninformative
meiosis produces symmetric nodes and further increase “sparseness”.
Uninformative meiosis occurs when both paternal alleles are indistinguishable
i.e. the allele donating person is homozygote.
ii.
In order to save time we store the possible founder’s alleles
assignments and the corresponding likelihoods in each node.
This allows both, the recognition of impossible gene flow (probability 0) early
in the tree path, and fast probabilities calculations in leaf nodes.
iii.
In order to save space we go over the tree in DFS, which means we don’t
store the whole tree (that is why we called it “virtual” earlier) but, only an
array of nodes in the size of the depth of the tree (which is the maximum
amount of nodes we might need to store concurrently), i.e. we store the path
from the root node to the current node.
·
Parameters:
i.
path – the tree node, each node represents meiosis outcome
(data structure of
the tree node will be described later on).
ii.
depth –
an integer that represents the tree depth,
used to find the person that is effected by the meiosis.
iii.
step
– integer to represent left = 0 /right =
1 son of the parent node, the step is used to build the inheritance vector.
iv.
locus –
an integer that represents the locus which is tested. Used to find the
probabilities of the alleles.
v.
paternal
– a Boolean that represents the gender of the parent which had the meiosis.
·
Return value:
void.
·
Data structures and variables used:
i.
typedef struct TreeNode_t
{
int inhVector;
int depth;
int* founderAssignments;
double* probabilities;
int top;
} TreeNode;
1. inhVector – used to store the inheritance vector so far.
2. Depth – used to store the
depth of the node from the tree root.
3. founderAssignments
a. An array stores possible founder’s Alleles
assignments.
b. This array is logically of two dimensions;
however it is implemented as one dimension array. It’s size is:
(2* number of founders) * (
2 ^ (2 * number of founders))/2.
c. The array is initialized in
the function superCalcprobabilities() with known founder’s alleles assignments. The other
speculations are filled while searching the tree.
d. The array may double it’s size when a new allele assignment is possible, or be
divided by 2 when there is an alleles restriction.
e. Impossible alleles assignments
to the founder according to the inheritance vector, is pointed by contradicting
alleles assignments, for example assigning A to allele which was restricted to
B only.
4. probabilities – an array storing the corresponding
probability for each assignment in founderAssignments.
The probability of a marker given an inheritance vector is the sum of all the
probabilities of the possible assignments for the founder’s alleles in the leaf
node representing the vector.
5. top – indicates the first empty space of the
two previous arrays (founderAssignments,probabilities).
ii.
int Person, presonsParent
–used as keys for accessing the data structure
(PedigreeGraph) for details harvesting on the
relevant persons.
iii.
int allele1, allele2 – used to store the
alleles of the person effected by the meiosis. Can store the values: 0 – for
not initialized alles. 1 / 2 for each of the alleles
possible.
·
Pseudo code:
(recursive function)
i.
Initialize the root node with the details known about the founders. (done
in the superCalacProbabilities() )
We go over the “virtual” tree nodes in DFS
ii.
For visited node:
1. Init the node data structures
from it’s father node.
a. Update the inheritance vector
according to the step made (0/1).
b. Copy the founder’s alleles assignments options from the father node.
c. Copy the probabilities for
each alleles assignment from the father node.
2. Check whether the person
alleles contribute information to the founder’s alleles assignment:
a. If the information contributes
more options for the founder’s alleles assignment –
update the founderAssingments array with the new
option, and update the probabilities vector accordingly.
b. If the information contributes
by narrowing the options for the founder’s alleles assignment – update the founderAssingments array with the new option.
c. If the information contributes
by contradicting the founder’s alleles information gathered so far,
Return.
3. If a leaf was reached store
the probability and return.
4. If the parent of the person
effected by the next meiosis is homozygous
a. Visit left son.
b. duplicate probabilities of symmetric nodes.
c. Return.
5. Visit left son.
6. Visit right son.
·
Complexity:
i.
DFS complexity: O(|V|+|E|) . |V| = 2[(num
of meiosis) + 1] , |E| = |V| -1
ii.
Sum of nodes update: [ 2 * founders) * 2 (2 * founders) ]/2
iii.
Total : O[(2[(num of meiosis) + 1] + 2[(num
of meiosis) + 1]) * ( 2 * founders) * 2 (2 * founders) )/2
]
iv.
This complexity bound is rather loose, using sparse tree and partially
occupied founderAssignments array ,
the average complexity is much lower.
·
Approximate matrix vector multiplication:
When we want to consider the possibility of at most M recombinations between
consecutive markers, the NxN transition matrix will
have exactly N * logM(N) cells with values
greater then zero. Since every nonzero cell can effect the outcome and since
the Fast Fourier Transform approach performs the multiplication in O(N*log(N)), the method of approximate multiplication can
only be effective for M=1. The experiments we
performed demonstrated the benefits of approximating the vector matrix
multiplication when the probability of more then one recombination between
consecutive markers is very small (virtually zero).
The efficient method of enumerating all nonzero matrix cells explained earlier
(6 -
logic) makes it possible to
reduce runtime without the need of any special data structure and therefore without
increasing memory requirements.
Ø
MultApproxFewRecombonation experiments results
·
Calculating probabilities using sparse trees:
o
The
runtime of the sparse tree algorithm (STA), is longer than the runtime of the
GH algorithm, however when predicting the runtime of STA with founders
reduction, the runtime of STA is usually much shorter than the runtime of GH.
o
The more homozygous positions over
all markers the faster the algorithms run (GH and STA). This occurs due to
symmetry nodes
(as explained in section 7-logic-i).
o
The ratio between the runtime of
STA and GH seems to increase with the increase of the homozygous marker
positions, due to the tree sparseness.
o The more genotyped founders the faster the algorithms run (GH and STA). This occurs due to restrictions imposed on the founder’s alleles assignments, which increases the sparseness of the tree.
o
The runtime seems to be shorter at
the high end and low end of the graph, i.e. when the percentage of typed people
is high or low,
this can be explained by the following reasons:
§ High percentages of genotyped people impose many restrictions on the founder’s alleles assignments, i.e. many impossible gene flows and many symmetric nodes, these shorten the calculations periods. (As explained in section 7-logic-i,ii).
§ Low percentages genotyped people means low number on calculations, when searching the tree non-genotyped person means no contribution to the marker probability calculations, i.e. skipping to the next node without calculations.
o The run time of both algorithms is linear with the number of loci.
§
For GH the ratio is about: 0.9.
§
For SPA (predicted with founder
reduction) the ratio is about: 0.25.
o
The runtime ratio between GH and
SPA (considering founders reduction) tend to grow with the number of loci
growth.
Ø CalcProbabilities
experimnets results
· "Merlin—rapid analysis of dense genetic maps using sparse gene flow trees", Gonçalo R. Abecasis, Stacey S. Cherny, William O. Cookson, and Lon R. Cardon. Nat Genet. 2002 Jan;30(1):97-101 http://www.nature.com/cgitaf/DynaPage.taf?file=/ng/journal/v30/n1/full/ng786.html
· Computing the probability of marker genotypes given an inheritance vector http://stat-www.berkeley.edu/users/terry/Classes/s260.1998/Week8a/week8a/node13.html#SECTION00040000000000000000
· Merlin http://www.sph.umich.edu/csg/abecasis/Merlin/
·
Converter to and from Genehunter/Merlin format:
http://cbl-fog.cs.technion.ac.il/pedtool/