Project in bioinformatics

 

 

Winter 2003/4 under guidance of Prof. Dan Geiger And Maayan Fishelzon

Submitted By:

Gzayil Firass email: sfgzayil@t2.technion.ac.il

Zeidan Mohammed email: smaz@t2.technion.ac.il

 

 

 

 

Goal of the project:

Performing approximate inference via a heuristic which ignores extreme markers in likelihood computations, when those are too strenuous to be performed exactly, since, they may take a long time to finish the calculation or may cause memory overflow.

 

Requirements:

 

          To code a function that would satisfy the project goal.

This function is coded within the super link code and uses its internal data structures.

The function keeps removing markers from the data structures till reaching the desired predefined complexity (which is lower than the original one), this will enable super link (after reducing the complexity coefficient below the predefined one), run on minimized data, meaning, faster calculation. As may be expected, this way we loose information that is needed to have full accurate results. Our task was to develop means to remove less valuable markers in an attempt to have results as close as possible to the accurate ones, and to implement those means in this function.

 

Implementation:

 

 

1. Start with the total number of loci as specified in the input files.

2. Determine an elimination order of the problem.

3.     Derive the complexity coefficient of the elimination order.

4.     If it is less than the predefined complexity, we run the superlink program to calculate the likelihood without any further operations.

5.     Choose one marker and remove it from the data structures.

6.     goto 3

 

This is an abstract algorithm of the solution.

 

The main problem was to determine the marker to be clipped without harming the accuracy of the likelihood result as possible.

 

 

 

How to choose a marker:

 

We considered few characteristics of the loci to help choosing the extreme ones:

1.     Te distance of a locus from the affection.

This characteristic indicates how close a locus to the affection; hence, we avoid clipping loci located too close to the laffection.

2.     The number of persons in which the phenotype of a locus is unknown.

As bigger this number, as less the information a locus provides. So its recommended to get rid of it.

3.     The number of possible alleles a locus has.

A locus with a big number of possible alleles causes higher complexity.

4.     The closeness of a locus to its flanking neighbors.

If two loci are too close to each other, its possible to give up one of them since we have the information from the other.

 

Data structures:

 

We had only one data structure, an array that holds former four characteristics for each locus.

The array size is the number of loci, and it is of type prio:

Prio is a struct and defined as follows:

     

typedef struct Prio{

          double leftDist;

          double rightDist;

          int numUnknown;   

          double totalDist;

          double totalPrio;

     }prio;

 

leftDist- is the distance between a locus and its left neighbor.

rightDist- is the distance between a locus and its right neighbor.

numUnknown- the number of persons that their phenotype at this locus is unknown.

totalDist- the distance of this locus from the affection.

totalPrio- the total priority of this locus.

*As lower a locus` priority as higher the chance to be clipped off!

 

Hence, the total memory complexity is O(number of loci)

 

Algorithm of setting the priority of the loci:

 

1. Calculate and set the values of every field of each cell of the priority array corresponding with the proper locus. A locus located at index i in the array of the loci (loci), corresponds with index i in the priority array (prioLoci).

        leftDist / rightDist / totalDist - are calculated once by moving along  the array that    holds the recombination values between the loci time complexity is O(number of loci)

        numUnknown- is calculated by moving all the persons in all pedigrees at each locus calculation time complexity is O[(number of loci)*(number of people in all pedigrees)]

2. Calculate the total priority:

a.      As mentioned above we have no interest in removing loci located too close to the affection; therefore, we protect the area surrounding the affection. The size of the area (dangerousInterval) is determined by a percentage defined (the user may change it (percentageForInterval)). For each locus in the protected area we set a high positive value (100) to its total priority (totalPrio), and zero for the others. If however, during the run of the program, no loci left out the protected are, it shrinks by one from each side.

b.     Calculate the average of the number of unknown phenotype of any person at any locus, and the average of alleles at each locus outside the protected interval, time complexity is O(number of loci)

c.     For each locus (out of the protected area) that has number of unknown greater than the average, reduce 10 from its total priority; (the new total priority is -10). The same is done for each locus (out of the protected area), and has number of possible alleles greater than average.

d.     We had defined a critical distance (criticalDist) so that any two adjacent loci with recombination frequency equal or less than it, reduce a defined positive value from total priority (totalPrio) (minVal 30) the user can change it.

 

(c + d) in one loop over priority array time complexity is O(number of loci)

 

e.      In choosing the marker, the search selects the locus with the lowest priority, if there is more than one locus with the lowest priority, take the one with higher number of alleles, if the alleles are equal take the farthest one from the affection locus.

This is done by moving along the priority array only once time complexity is O(number of loci)

3. Remove the locus chosen an adjust the data structure to get consistent, by deleting the marker chosen from the array loci and shift the others (time complexity is O(number of loci) in worst case shift), also updating the two dimensional array (maleThetaVals and femaleThetaVals) (time complexity is O(number of loci) in worst case shift). Also updating the pedigree, shift the loci for each person (time complexity is O((number of loci)*(number of people in all pedigrees)) in worst case shift).

 

 

 

 

 

 

 

prioLoci array:

 

 

This is how the priority array looks like at running time. In the middle of the diagram, the affection locus is located, and the rectangle describes the protected (dangerous) area which includes the loci surrounding the affection within a fixed length, the priority of these loci is denoted by 100 and the rest of loci receive priority according to the data each one holds. The flanking arrows are to say that the width of the protected area may shrink if this is needed, if no loci left out of it (very low chance to happen).

As already been mentioned above, we choose the locus that has lowest priority of the loci.

 

Results and Conclusions:

 

*in all the file bellow, the affection locus is set to be in the middle of the map

 

 

 

 

Resuls (same number of alleles for each locus)

 

File

#people

#loci

#Nodes

Lod score/likelihood

Exact.

Lod score/likelihood

Approx.

Absolute error

Error rate %

Run time

Exact.

(sec)

Run time approx.

(sec)

#loci removed

Result files. (Exact)

Result files. (Approx)

Data/ped0

57

21

659

-0.078388

-0.078245

0.00014

0.182425882

3.140000

3.890

1

result0

result0

Data/ped1

57

21

1838

-1.731987

-1.102788

0.6292

36.32815951

22.516000

8.000

8

result1

result1

Data/ped2

57

23

2007

-1.232434

-2.253489

1.02106

82.84865559

21.110000

20.406

7

result2

result2

Data/ped3

57

25

2190

-2.577146

-2.251073

0.32607

12.65248457

24.671000

35.547

6

result3

result3

Data/ped4

57

27

2380

-1.356160

-0.723942

0.63222

46.61824563

1132.828000

293.813

7

result4

result4

Data/ped5

57

29

2531

-1.282169

-0.706973

0.5752

44.86116885

2371.485000

622.578

5

result5

result5

Data/ped6

25

23

851

-1.145314

1.614273

2.75959

240.9458891

76.266000

45.515

6

result6

result6

Data/ped7

25

25

922

-0.354794

-1.584810

1.23002

346.684555

95.843000

56.297

8

result7

result7

Data/ped8

25

27

996

-1.623843

-0.578665

1.04518

64.36447366

120.156000

49.500

12

result8

result8

Data/ped9

25

29

1062

-0.260679

-0.630194

0.36952

141.7509657

89.235000

71.937

13

result9

result9

 

 

 

 

 

 

 

 

 

 

 

 

This kind of file in which we assigned the number of alleles for each locus, reflects the effect of the number of unknown and the

recombination values between the loci, on the final results of our implementation. As it seems from the results, our implementation

stands it nice, since we had only one file in which we obtained relatively high error distance.

We have obtained the following graphs from the table above. The first graph below describes the absolute error (error size) as a

function of the number of loci been removed.

The absolute error is generally low (except one file) as it seems from the table the same as from the graph.

 

 

 

 

The following graph describes the absolute error as a function of the number of loci at each side of the affection locus.

In this group, from a certain point, the more loci at each side of the affection the lower the absolute rate is.

 

 

 

 

Result (same known)

 

Files

#people

#loci

#Nodes

Lod score/likelihood

Exact.

Lod score/likelihood

Approx.

Absolute error

Error rate %

Run time

Exact.

(sec)

Run time approx.

(sec)

#loci removed

Result files. (Exact)

Result files. (Approx)

Data/ped0

50

21

1716

-2.105777

-1.588505

0.51727

24.5644244

3.2500

2.547000

5

result0

result0

Data/ped1

50

23

1855

-0.798646

-0.677597

0.12105

15.1567778

2.7960

2.828000

2

result1

result1

Data/ped2

50

25

2033

-2.237754

10.164054

12.4018

554.207835

5.6560

6.343000

14

result2

result2

Data/ped3

50

27

2197

-1.301117

-1.727963

0.42685

32.8061197

12.6250

10.89000

8

result3

result3

Data/ped4

50

29

2330

-2.227370

0.892379

3.11975

140.064246

12.1710

13.81200

15

result4

result4

Data/ped5

50

31

2472

-2.339975

-2.540232

0.20026

8.55808289

8.8590

8.171000

9

result5

result5

Data/ped6

50

33

2632

-2.024760

-0.525317

1.49944

74.0553448

10.6250

7.421000

9

result6

result6

Data/ped7

50

35

2779

-0.705142

1.642086

2.34723

332.873095

7.8120

6.671000

10

result7

result7

Data/ped8

50

37

2919

-3.120746

1.638174

4.75892

152.493026

7.3120

8.156000

9

result8

result8

Data/ped9

50

39

3078

-0.294515

3.999787

4.2943

1458.0928

18.2960

12.40600

11

result9

result9

 

 

 

 

 

 

 

 

 

 

 

 

 

In these files, all the loci have the same number of known phenotype in the people given in the input, (matter fact we assigned the loci to be all known at any person since we take into account the average - as been described above this wouldnt have any effects). According to our implementation, in this case we remove mainly loci with the highest number of alleles in an attempt to

reduce the complexity coefficient. As it is obvious from the table and the graph bellow (the first one) the more loci removed, the bigger the absolute rate is.

 

 

Also in this case, from a certain point, the more loci at each side of the affection the lower the absolute rate is.

 

 

 

 

Results (same distance)

 

Files

#people

#loci

#Nodes

Lod score/likelihood

Exact.

Lod score/likelihood

Approx.

Absolute error

Error rate %

Run time

Exact.

(sec)

Run time approx.

(sec)

#loci removed

Result files. (Exact)

Result files. (Approx)

Data/ped0

30

15

686

-0.386085

2.580186

2.96627

768.294806

0.4840

0.9220

8

result0

result0

Data/ped1

30

17

750

-0.231126

0.934130

1.16526

504.164828

0.6870

0.9840

4

result1

result1

Data/ped2

30

19

837

-0.456340

-0.586443

0.1301

28.5101021

0.7340

1.3430

6

result2

result2

Data/ped3

30

21

927

-0.293532

0.739871

1.0334

352.058038

0.890

1.140

2

result3

result3

Data/ped4

30

23

1012

-0.645523

-0.571646

0.07388

11.4445186

1.0150

1.6710

4

result4

result4

Data/ped5

30

25

1095

-0.586887

-0.288023

0.29886

50.923602

1.0460

2.5150

8

result5

result5

Data/ped6

30

27

1174

-0.345031

-1.090526

0.7455

216.066093

1.000

3.250

11

result6

result6

Data/ped7

30

29

1267

-0.641469

-0.914389

0.27292

42.5460934

3.2180

4.2030

13

result7

result7

Data/ped8

30

31

1341

0.275094

-1.297687

1.57278

571.724938

2.610

4.9060

14

result8

result8

Data/ped9

30

33

1435

-0.321124

-1.310705

0.98958

308.161645

2.860

5.8280

15

result9

result9

 

 

 

 

 

 

 

 

 

 

 

 

 

All loci in these files have the same recombination values between them = 0.1.

In general the absolute error is low except the first file, however, this file has relatively a low number of loci from each side of the affection and half of the loci were removed. So this particular file is not a representative one.

aragraph; mso-element-anchor-horizontal:column;mso-height-rule:exactly'>24.481928

1111.5160

41.7500

5

result1

result1

Data/ped2

20

25

1003

-0.792989

-0.647185

0.1458

18.386636

2326.140

50.3590

7

result2

result2

Data/ped3

20

27

1079

-0.128628

-1.262478

1.13385

881.49548

2688.6250

175.0620

8

result3

result3

Data/ped4

20

29

1149

-0.734409

-0.419229

0.31518

42.916141

199.2190

426.9530

2

result4

result4

Data/ped5

20

31

1238

-0.239630

-0.102404

0.13723

57.265785

748.1250

1628.031

11

result5

result5

Data/ped6

20

33

1319

-0.417219

-0.907606

0.49039

117.53707

337.5470

974.7180

14

result6

result6

Data/ped7

20

35

1406

-0.724944

0.475627

1.20057

165.60879

5233.9380

233.6090

16

result7

result7

Data/ped8

20

23

901

-0.475182

-0.432396

0.04279

9.0041289

2897.203

1279.765

2

result8

result8

Data/ped9

57

29

2531

-1.282169

-1.281969

0.0002

0.0155986

6093.265

1217.25

2

result9

result9

 

 

 

 

 

 

 

 

 

 

 

 

These are general files with random values of the former variable explained above!

In general the results are fine especially when working on long massive files with complexity coefficient higher than 9.

Files number 8/9 are an example for such files (true for all categories).

 

*a very significant improvement in running time is obtained (regardless the category).

It took file no. 9 about 101 mins running on the exact calculation, while in the approx it took it 21 mins within a very low absolute error and error rate!

 

This graph supports our assumption that when a big number of loci is removed, the error may get higher

Also here we can notice a tendency of reducing absolute error within growing number of loci at each side of the affection locus.

the absolute error is very low relatively, so the graph still supports the assumption.

 

 

 

 

Total results:

 

These are graphs as a combination of all the above

 

 

 

 

 

 

Download Executable for Windows

 

Note: the complexityThreshold was initialized to 8.