Experimental Results

 

Following are various inputs to test the approximate algorithm effectiveness both in time and accuracy.

The tests were divided in different categories, where every category focuses on specific aspect.

The input files were generated randomly by PEDTOOL, and sometimes were adjusted in order to fit their category.

 

 

  1. All people are genotyped.

Percentage of typed people in the pedigree is 100%,

Percentage of missing information for typed people is 0%

 

Files

#People

#loci

complexity

from

Complexity

To

affected locus

position

Lod-Score/

Likelihood(exact alg.)

Lod-Score/

Likelihood(approx)

Error rate likelihood % 

 

Run Time

exact

Run Time

approx.

#clipped

Markers

clipped Markers

 

pedfile17.dat

50

50

9

8

25

0.088898

0.088898

0

300.625

152

1

45

datafile17.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile17.dat

50

50

9

7

25

0.088898

0.088898

0

300.625

68.109

2

45 49

datafile17.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile17.dat

50

50

9

6

25

0.088898

0.173503

95.17086999

300.625

66.625

3

45  49  4

datafile17.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile18.dat

50

25

6

5

12

0.807321

0.807321

0

1.406

1.437

1

25

datafile18.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile18.dat

50

25

6

4

12

0.807321

0.291836

63.85130574

1.406

3.156

14

25  1  2  24  3  4  23  22  5  21  6  20  7  19

datafile18.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile19.dat

25

50

7

6

25

1.217278

1.217278

0

14.593

10.437

1

45

datafile19.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile20.dat

40

70

7

6

35

3.557786

3.557786

0

11.234

24.734

3

13  48  51

datafile20.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile20.dat

40

70

7

5

35

3.557786

3.557786

0

11.234

45.843

9

13  48  51  64  16  10  52  24  15

datafile20.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile21.dat

25

40

5

4

13

4.999356

3.315356

33.68433854

1.39

4.781

7

1  2  25  3  24  4  23  5

datafile21.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile22.dat

30

100

7

6

15

2.692149

2.692149

0

8.312

20.828

9

30  29  28  27  26  1  25  2  3

datafile22.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile22.dat

30

100

7

5

15

2.692149

2.692149

0

8.312

18.421

10

30  29  28  27  26  1  25  2  3 24

pedfile22.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile23.dat

30

100

6

5

15

1.38493

-0.042575

103.0741626

11.812

14.156

3

2  39  38

datafile23.dat

 

 

 

 

 

 

 

 

 

 

 

 

 

Remarks:

Indeed we can see this in the results above (each clipped marker is the extreme marker or it close to its neighbor)

In general the approximate results are very close to the exact results, and the error ratio is great just when the number

Of the clipped markers is very large.                                                                     

 

 

 

 

  1. high percentage of the people are not genotyped

(35-50% people are untyped and about 30-50% missing information for typed people)

 

Files

 

#People

 

#loci

 

complexity

from

Complexity

To

affected locus

position

Lod-Score/

Likelihood(exact alg.)

Lod-Score/

Likelihood(approx)

Error rate likelihood % 

 

Run Time

exact

Run Time

approx.

#clipped

Markers

clipped Markers

 

pedfile24.dat

20

20

8

7

10

0.136772

0.028404

79.23259147

29.234

2.968

2

1  20

datafile24.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile25.dat

30

15

8

7

7

0.429287

0.429297

0.002329444

87.47

53.4

1

14

datafile25.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile25.dat

30

15

8

6

7

0.429287

0.5173

20.50213494

87.47

25.921

 

 

datafile25.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile26.dat

15

20

6

5

7

0.479571

0.201098

58.06710581

0.984

0.671

11

20  19  18  17  16  15  14  1  2  13  12

datafile26.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile27.dat

30

20

10

9

15

0.782396

0.783114

0.091769385

5306

4561

1

5

datafile27.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile35.dat

30

25

9

8

12

1.28700

1.28700

0

1611

255

6

28 24 23 30 26 27

datafile35.dat

 

 

 

 

 

 

 

 

 

 

 

 

 

Remarks:

and we can see in files 27 :

Marker position

Distance from disease locus

Number of untyped individuals  at this marker

5

0.4988

15

0

0.4998

9

 

 

 

 

 

Besides in calculating the heuristic value we take in consideration the “environment” of these individuals (children and parents)

and after all we have received very good approximation.

And in files 35 the markers were clipped in this order 28 24 23 30 26 27 cause many people not typed at 23 24 .

of people and markers are small. While we shall increase this two numbers : loci number : in order to get high accurate result after clipping

some of them, people number because there are many individuals are untyped so they give less information. Files 35 stand to prove that

where their were relatively big number of loci and people and then the accuracy is perfect ( We tried to make very input like this

but maybe because there are many untyped persons the run time of “original” SUPERLINK is long and sometime we get

there is no sufficient memory. 

 

 

 

 

  1. Affection status locus at the start of the genetic map.

Files

 

#People

 

#loci

 

complexity

from

Complexity

To

affected locus

position

Lod-Score/

Likelihood(exact alg.)

Lod-Score/

Likelihood(approx)

Error rate likelihood % 

 

Run Time

exact

Run Time

approx.

#clipped

Markers

clipped Markers

 

pedfile8.dat

25

25

8

7

1

-0.00456

-0.00456

0

156

165.171

6

25 18 21 19 23 24

datafile8.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile8.dat

25

25

8

6

1

-0.00456

-0.004569

0.197368421

156

21.921

16

25 18 21 19 23 24 20 22 17 16 15 14 13 12 11 10

datafile8.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile14.dat

40

15

6

5

1

-0.011731

-0.011868

1.167845878

2.609

2.25

4

15 14  13  12  11

datafile14.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile15.dat

60

15

7

6

1

-0.100067

-0.100067

0

14.265

7.719

2

15 14

datafile15.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile15.dat

60

15

7

5

1

-0.100067

-0.100067

0

14.265

6.094

11

15 14  13  12  11  10  9  8  7  6  5

datafile15.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile16.dat

25

20

6

5

1

-0.020113

-0.020113

0

3.796

1.062

3

19  17  15

datafile16.dat

 

 

 

 

 

 

 

 

 

 

 

 

 

Remarks:

In the other side of the chromosome are very far (distance and number of markers exist between them) so as we see it seems that this markers

don’t contribute to the accuracy of the results.

In files 15 there are many clipped markers but that doesn’t affect the result accuracy, besides in this files the marker clipped exactly according to

the distance from the disease locus.

 

 

 

  1. number of People is small (comparatively with number of Loci )

Files

 

#People

 

#loci

 

complexity

from

Complexity

To

affected locus

position

Lod-Score/

Likelihood(exact alg.)

Lod-Score/

Likelihood(approx)

Error rate likelihood % 

 

Run Time

exact

Run Time

approx.

#clipped

Markers

clipped Markers

 

pedfile2.dat

15

20

6

5

6

-0.20874

-0.208741

0.000479065

2.171

1.359

2

20 19

datafile2.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile2.dat

15

20

6

4

6

-0.20874

-0.208969

0.109705854

2.171

0.906

10

20 19  18  17  16  15  14  13  12  11

datafile2.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile2.dat

15

20

6

3

6

-0.20874

-0.208969

0.109705854

2.171

0.906

10

20 19  18  17  16  15  14  13  12  11

datafile2.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile4.dat

20

25

7

6

11

-0.009586

-0.00394

58.89839349

17.781

8.937

16

0  1 24   23  2  3  22  21  4  20  19  5  18 6  17  16

datafile4.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile5.dat

20

25

6

5

11

-0.199995

-0.199995

0

5.046

1.546

3

25  24  23

datafile5.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile5.dat

20

25

6

4

11

-0.199995

-0.56639

183.2020801

5.046

1.797

9

25 24  23  22  21  1  20  19  2

datafile5.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile6.dat

15

40

5

4

16

-0.094554

0.2

311.5193434

0.328

1.625

20

22 28 4 30 35 2 37 34 36 31 32 33 38 39 40 1 29 3 27 5

datafile6.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile7.dat

20

35

8

7

21

-0.24115

-0.41397

71.66493883

381.453

77.89

27

5 1 12 3 7 2 4 33 6 34 8 35 9 32 10 31 30 11 29 13 28 14 26 15 23 16 19

datafile7.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile9.dat

15

20

6

5

6

-0.206032

-0.206032

0

2.828

1.11

2

20 19

datafile9.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile9.dat

15

20

6

4

6

-0.206032

-0.205636

0.192203153

2.828

1.09

8

20 19 18 17 16 15 14 13

datafile9.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile11.dat

24

30

9

8

16

-0.387493

-0.388159

0.171874073

1985.9

112.58

2

26 8

datafile11.dat

 

 

 

 

 

 

 

 

 

 

 

 

Remarks:

there is a need to clip off many marker ,which leads for miss in accuracy. although where is the number of clipped marker is small

the accuracy is very well. 

 

 

 

5.Number of Loci is small

Files

 

#People

 

#loci

 

complexity

from

Complexity

To

affected locus

position

Lod-Score/

Likelihood(exact alg.)

Lod-Score/

Likelihood(approx)

Error rate likelihood % 

 

Run Time

exact

Run Time

approx.

#clipped

Markers

clipped Markers

 

pedfile12.dat

30

15

8

7

6

1.6083

1.776

10.42715911

8.82

3.953

4

15 14 13 1

datafile12.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile13.dat

40

10

8

7

6

0.258605

1.987379

668.4998357

44.75

3.468

5

1 10 2 9 3

datafile13.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile13.dat

40

10

8

6

6

0.258605

0.246179

4.805011504

44.75

2.906

5

1 10 2 9 3 4

datafile13.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile32.dat

40

15

7

6

7

0.003842

-0.29329

77337.84488

9

5.45

5

1 2 15 14 3

datafile32.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile33.dat

40

16

7

6

8

0.129873

-0.560718

531.737

2.39

4.578

2

1 2

datafile33.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile34.dat

40

17

9

8

8

-0.288983

-0.288956

0.0093

490

122

4

17 16 15 14

datafile34.dat

 

 

 

 

 

 

 

 

 

 

 

 

Remarks:

and then it very clear that the result will be different.

 

 

 

 

6. Number of Loci is large

Files

 

#People

 

#loci

 

complexity

from

Complexity

To

affected locus

position

Lod-Score/

Likelihood(exact alg.)

Lod-Score/

Likelihood(approx)

Error rate likelihood % 

 

Run Time

exact

Run Time

approx.

#clipped

Markers

clipped Markers

 

pedfile6.dat

15

40

5

4

16

-0.094554

0.2

311.5193434

0.328

1.625

20

22 28 4 30 35 2 37 34 36 31 32 33 38 39 40 1 29 3 27 5

datafile6.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile28.dat

20

40

9

8

20

0.06747

0.05929

12.123

1042.98

162.23

4

13 4 34 35

datafile28.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile29.dat

15

50

8

7

25

0.248177

0.289056

16.471

117.01

30.015

8

4 8 5 6 12 15 41

datafile29.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile30.dat

25

40

9

8

20

-0.041718

-0.041273

1.066685

445.3

76.453

2

36 7

datafile30.dat

 

 

 

 

 

 

 

 

 

 

 

 

pedfile31.dat

20

45

6

5

20

0.353116

0.352786

0.093453

3.49

2.187

1

10

datafile31.dat

 

 

 

 

 

 

 

 

 

 

 

 

Remarks:

 

this marker the result doesn’t change.

 

 

Graphs and analysis

 

To examine how the error rate depends in the distance from the affected locus

we build a graph for the error rate as  a function of the number of alleles in each side of the disease locus, since what in my tests  not always

the affected locus in the middle of the markers, We consider in the maximum between the number of right markers and left markers.

Also in our test files there is more than one time that the same Max appears so we calculate the average of all the Error rate for the same

“X” value

 

Max(number left markers, number left markers)

Error rate              

6

668.5

8

410.5799

10

79.2326

13

29.033

14

0.00016

16

0.1718

20

6.6

21

71.665

25

8.23

27

33.68

35

0

85

5.15

 

 

 

 

 

 

 

 

 

 

 

 

 

Graph:  the error rate as a function of number of clipped markers divided by the initial number of markers

 

 

Conclusion:

 

            by our strategy for clipping markers we obtain a good approximation in most of the times, and the most digressions were

when the initial number of the markers was small, but we notice that in this case the run time before clipping  a markers is acceptable

and there is no need to minimize the number of markers.

           

            The running time was also reduce significantly (by more than half) in most the tests.

but when the initial number of markers is small the run time become larger, and that because the time we spend to calculate the complexity (

which could be more than one time ), but as we said before in this cases there is no such need to reduce the number of the markers.

 

 

disadvantage of suggested algorithm

 & How to improve it

 

 

1.       In our decision for clipping a marker we haven’t tried to clip the minimal number of markers in order to get the desired threshold

of course we take in consideration the number of possible alleles that influence the complexity, but we should make deeply investigation

of how it exactly influence the complexity , besides it’s good to look for another factors that influence the complexity.

2.       In our algorithm we clip marker by marker , and we don’t return

3.       the function byRecombination which evaluate a heuristic value according to distances from neighbors, shouldn’t be linear instead

it’s better if this heuristic was such that if the neighbor become very close the value get very large ( like calculating byUnknown).