Non Parametric Linkage Analysis

Project by Avihai Dgany, 06 Feb 2006

Introduction

Linkage analysis is a statistical method, which is used to extract information from inheritance data (e.g. pedigrees), and to find correlation between certain alleles and phenotypes, thus aiding in the identification of disease causing genes.

Two approaches can be used for the analysis:

Parametric
assumes prior knowledge of the inheritance model
Non parametric
assumes no prior knowledge, and tests if the provided data deviates from the expectation under independent assortments.

GENEHUNTER is a program which performs linkage analysis for pedigrees given theta values.

Multipoint analysis of a given pedigree is separated into two steps:

  1. computation of probability distribution of inheritance patterns (which depend only on the marker genotypes)
  2. computation of a statistic that quantifies linkage (which, conditional on an inheritance pattern, depends only on the phenotypes)

The algorithm used by GENEHUNTER is linear by number of loci, but exponential by the number of people.

Description

General

The goal of the project was to implement NPL (Non Parametric Linkage) analysis in the fastest possible way, based on the methodology of GENEHUNTER. A program was provided which calculates the probability vectors and likelihood scores, into which an implementation of NPL score computation was to be integrated.

The problem

A general algorithm to calculate score:


		compute_score(inheritance_vector_space){
			score := 0
			for each inheritance_vector in inheritance_vector_space do
				s := vector_score( vector )
				p := vector_probability( vector )
				score := score + (s * p)
		}
	

In this project, I seek for the best implementation of vector_score to return NPL scores of two kinds:

Objectives of the Solution

  1. Efficient algorithms
  2. Efficient code
  3. Clear, readable, and structured code

Algorithms

Calculating NPL score has two main algorithmic bottle-necks:

Therefore, improving performance of the calculation, requires:

The first issue is dealt with by not traversing through members who have no influence on the genotypic data of affected member (Kyriacos et al 2001), in addition to the already implemented reduction of inheritance space by removing the first child of each founder (Kruglyak et al 1996, Second Speedup).

The second and third issues are dealt with by integrating them into a unified process, in which the traversal and score calculation are weaved together, and by using two special traits of gray code. Gray code is a binary system in which two consecutive numbers differ by the value of a single bit. By iterating over the inheritance vectors space using Gray code, only a single allele is changed (for the changed member and its descendants). By iterating through the possible choices of alleles of affected members using Gray code, a faster calculation of NPLall is also gained (Kyriacos et al 2001). Since Gray code is cyclic, an extra iteration returns to the initial state, thus saving the need to analyze the entire pedigree when calculating the first choice of alleles in NPLall.

NPLpairs

In NPLpairs the score of each vector is the number of pairs of alleles that are IBD, in distinct affected pedigree members. When traversing the vector space, whenever changing an allele for a member, the score changes as follows: let a1 be the allele being replaced by a2, and |x| represents the number of times allele x appears in affected members, then after the replacement, |a1|* := |a1| - 1, |a2|* := |a2| + 1, score* := score - |a1|* + |a2|. If the new allele is identical to the other allele present in the same member, decrease the score by one, and if this is the case for the old allele, increase it by one.

NPLall

In NPLall the Whittemore-Halpern statistic, is calculated using Gray code as described by Kyriacos et all (2001). To improve performance, the starting value of the statistic PI(bi(h=0), i=1, 2f) is modified during traversal in the following way: let a1 be the allele being replaced by a2 (in this case a1 is always the paternal allele), then the value of the statistic is divided by the multiplicity of a1, the multiplicity of a1 is decreased by one, the multiplicity of a2 is increased by one, and the statistic is multiplied by the multiplicity of a2.

Pseudo-code


		let SF be the scoring function, which has methods:
			onNewVector(v), onChangeAlleles(member, parent), onVectorEnd(vector)
		let R be the reduced vectors space
		let F be the relevant founders
		let N be the relevant non founders	
		let S be the set of score per vector
		Initialization:
			For each s in S, s := 0
			vector := 0
			SF.onNewVector(vector)
			For each f in F,
				f.paternalAllele := unique value
				f.maternalAllele := unique value
				SF.onChangeAlleles(f, father)
				SF.onChangeAlleles(f, mather)
			For each n in N
				n.paternalAllele := n.father.paternalAllele
				n.maternalAllele := n.mather.paternalAllele
				SF.onChangeAlleles(n, father)
				SF.onChangeAlleles(n, mather)
			SF.onVectorEnd(vector)
		Iteration
			For each vector in R-{0}, in Gray code order, starting from 1
				SF.onNewVector(vector)
				let m be the changed member
				let p be the parent in whom the allele changed
				SF.onChangeAlleles(m,p)
				For each descendant d of m, in order of generations
					If allele change affected d,
						change allele in d, for appropriate parent q
						SF.onChangeAlleles(d,q)
				SF.onVectorEnd(vector)
		Normalization:
			For each s in S
				s := ( s - Expectation(S) ) / StandardDeviation(S)
				
		Examples of SF:
		NPLpairs:
			onNewVector:
				S(vector) := S(previous_vector)
			onChangeAlleles:
				if member is not affected then exit.
				let a be the allele being changed
				let b be the new allele
				let c be the other allele in the same member
				|a| := |a| - 1
				S(vector) := S(vector) - |a| + |b|
				|b| := |b| + 1
				if a = c S(vector) := S(vector) + 1
				if b = c S(vector) := S(vector) - 1
			onVectorEnd:
				do nothing
		
		NPLall:
			let h0 be the stored value of h0 initialized as 1
			onNewVector:
				do nothing
			onChangeAlleles:
				if member is not affected, or changed allele is not paternal then exit.
				let a be the allele being changed
				let b be the new allele
				h0 := h0 / |a|
				|a| := |a| - 1
				|b| := |b| + 1
				h0 := h0 * |b|
			onVectorEnd:
				Let H be the affected vector space
				S(vector) := h0
				h := h0
				For each vector h in H-{0}, in Gray code order
					let a be the allele being changed
					let b be the new allele
					h := h / |a|
					|a| := |a| - 1
					|b| := |b| + 1
					h := h * |b|
					S(vector) := S(vector) + h
				(The next lines are used to return the multiplicities to the state they should be according to h0)
				Let x be the maternal allele of the MSB
				Let y be the paternal allele of the MSB
				|x| := |x| - 1
				|y| := |y| + 1
	

Complexity

Let n be the number of non founders, f the number of founders, and a the number of affected members.

Marking relevance of pedigree membersO(n+f)
Creating masksO(n+f)
Traversing the inheritence vector spaceO(22n-f)
Changing allele assignment for each vectorO(n)
Calculating NPLpairs for each vectorO(n)
Calculating NPLall for each vectorO(2a)
Mutiplying scores and probability vectorsO(22n-f)
TotalO(22n-f+a)

Implementation

In order to integrate the NPL score computation into the existing code, a few changes are required in the original code:

  1. Changing the structure of the test points, to include fields for the final score for each test point. This is done by adding the corresponding fields npl_score_all and npl_score_pairs to the TestPoint struct, in programSpec.h. Accordingly, the initialization and printing of these fields is implemented in programSpec.c.
  2. Creating the score vectors. This is done by using the computeNplScore function provided in the npl.h library (implemented in npl.c) and providing it with functions to compute the required NPL score (for example, those provided by the library nplScorePairs.h and nplScoreAll.h, implemented in the corresponding c files). The modification in likehood.h (adding the vectors' definitions) and in gh_main.c (calling computeNplScore) generates the score vectors to be used in the program.
  3. Calculating the scores, using the vectors generated in (2) above, into the fields defined in (1) above is accomplished by adding the multiplication of the score vector and the probability vectors generated in likehood() function, implemented in likehood.c.
The modifications to the existing code is braced by /* modification BEGIN */ and /* modification END */, where modification is INSERT, REPLACE, or DELETE and replaced or deleted code is remarked.

Results

The test data included 56 tests, which were run on t2.technion.ac.il Each test was run 5 times, and the results presented are the average runtime of all the runs.

Average Runtime Results in Seconds for 5 Runs
Data filePedigree fileResultsNPLpairs Score vector calculationNPLpairs Total RuntimeNPLpairs GENEHUNTER RuntimeNPLall Score vector calculationNPLall Total runtimeNPLall GENEHUNTER Runtime
datafile3.datpedfile30.datNPLpairs NPLall00.0060.1100.0020.07
datafile3.datpedfile32.datNPLpairs NPLall00.0060.0900.0060.08
datafile4.datpedfile42.datNPLpairs NPLall00.0040.0900.0060.07
datafile4.datpedfile43.datNPLpairs NPLall00.0020.0600.0020.03
datafile5.datpedfile50.datNPLpairs NPLall00.0360.2100.0360.19
datafile5.datpedfile51.datNPLpairs NPLall00.261.760.0520.2981.83
datafile5.datpedfile52.datNPLpairs NPLall0.0020.0540.5500.050.6
datafile5.datpedfile53.datNPLpairs NPLall00.0520.4900.0420.48
datafile5.datpedfile54.datNPLpairs NPLall0.0024.47410.80.0284.34610.76
datafile6.datpedfile60.datNPLpairs NPLall00.3182.5500.3142.57
datafile6.datpedfile61.datNPLpairs NPLall0.0024.33423.4804.34423.55
datafile6.datpedfile62.datNPLpairs NPLall00.0220.3100.0220.3
datafile6.datpedfile63.datNPLpairs NPLall00.2342.3400.2522.4
datafile6.datpedfile64.datNPLpairs NPLall00.1080.0300.0960.02
datafile6.datpedfile65.datNPLpairs NPLall00.1380.0300.130.03
datafileGB_27_0.datpedfileGB_27_0_0.datNPLpairs NPLall0.0020.52859.2400.53658.84
datafileGB_27_1.datpedfileGB_27_1_0.datNPLpairs NPLall01.0362.9201.03662.29
datafileGB_27_2.datpedfileGB_27_2_0.datNPLpairs NPLall01.55864.4901.64464.07
datafileGB_27_3.datpedfileGB_27_3_0.datNPLpairs NPLall02.03866.902.07466.62
datafileGB_67_0.datpedfileGB_67_0_0.datNPLpairs NPLall013.24658.34012.06457.87
datafileGB_67_1.datpedfileGB_67_1_0.datNPLpairs NPLall013.24661.7012.861.74
datafileGB_67_2.datpedfileGB_67_2_0.datNPLpairs NPLall014.263.51013.25662.84
datafileGB_67_3.datpedfileGB_67_3_0.datNPLpairs NPLall0.00214.73264.4015.58265.37
datafileGB_90_0.datpedfileGB_90_0_0.datNPLpairs NPLall02.54457.9702.57458.54
datafileGB_90_1.datpedfileGB_90_1_0.datNPLpairs NPLall02.94859.230.0022.92859.51
datafileGB_90_2.datpedfileGB_90_2_0.datNPLpairs NPLall0.0023.26859.9203.25660.25
datafileGB_90_3.datpedfileGB_90_3_0.datNPLpairs NPLall0.0023.64661.050.0023.62261.52
datafileLA_0.datpedfileLA_0_0.datNPLpairs NPLall0.0020.1321.170.0020.141.18
datafileLA_1.datpedfileLA_1_0.datNPLpairs NPLall00.141.2200.1441.24
datafileLA_2.datpedfileLA_2_0.datNPLpairs NPLall00.161.4900.1481.33
datafileLA_3.datpedfileLA_3_0.datNPLpairs NPLall00.171.4600.1641.5
datafileLA_4.datpedfileLA_4_0.datNPLpairs NPLall00.1761.5800.1781.59
datafileLA_5.datpedfileLA_5_0.datNPLpairs NPLall00.1921.700.2041.6
datafileLA_6.datpedfileLA_6_0.datNPLpairs NPLall00.2081.8200.1961.77
datafileLA_7.datpedfileLA_7_0.datNPLpairs NPLall00.2141.8800.2361.95
datafileLA_8.datpedfileLA_8_0.datNPLpairs NPLall00.2422.1600.2341.97
datafileLA_9.datpedfileLA_9_0.datNPLpairs NPLall00.2562.100.2282.03
datafileLB_0.datpedfileLB_0_0.datNPLpairs NPLall00.0220.1300.0260.15
datafileLB_1.datpedfileLB_1_0.datNPLpairs NPLall00.0220.1800.0220.13
datafileLB_10.datpedfileLB_10_0.datNPLpairs NPLall00.0560.400.060.41
datafileLB_11.datpedfileLB_11_0.datNPLpairs NPLall00.0680.3900.060.47
datafileLB_12.datpedfileLB_12_0.datNPLpairs NPLall00.0740.4400.0720.43
datafileLB_13.datpedfileLB_13_0.datNPLpairs NPLall00.0760.4300.0680.45
datafileLB_14.datpedfileLB_14_0.datNPLpairs NPLall00.080.5100.0840.48
datafileLB_15.datpedfileLB_15_0.datNPLpairs NPLall00.090.500.0940.58
datafileLB_16.datpedfileLB_16_0.datNPLpairs NPLall00.090.5300.0840.55
datafileLB_17.datpedfileLB_17_0.datNPLpairs NPLall00.0960.5300.0880.57
datafileLB_18.datpedfileLB_18_0.datNPLpairs NPLall00.0980.6200.0920.63
datafileLB_19.datpedfileLB_19_0.datNPLpairs NPLall00.090.5900.0920.62
datafileLB_2.datpedfileLB_2_0.datNPLpairs NPLall00.0240.1900.020.16
datafileLB_3.datpedfileLB_3_0.datNPLpairs NPLall00.0240.1800.0340.18
datafileLB_4.datpedfileLB_4_0.datNPLpairs NPLall00.040.2400.0440.24
datafileLB_5.datpedfileLB_5_0.datNPLpairs NPLall00.0380.280.0020.0420.24
datafileLB_6.datpedfileLB_6_0.datNPLpairs NPLall00.0460.3100.0480.27
datafileLB_7.datpedfileLB_7_0.datNPLpairs NPLall00.050.300.0540.33
datafileLB_8.datpedfileLB_8_0.datNPLpairs NPLall00.0540.3300.0540.35
datafileLB_9.datpedfileLB_9_0.datNPLpairs NPLall00.0520.3800.0520.34

Suggested Improvements

General

Specific

Technical

References

Resources

ResourceDescription
SuperLinkSuperlink homepage, for file structure and other data

Usage