Summary of Results

 

As described in the "P-value" section, we implemented 2 versions of the weighted TNoM algorithm:

This version refers to the unidirectional case – assuming the correct position of healthy (+) samples is in the left side of the vector.

 

We ran both versions on a data matrix of size:  700 (genes) X 42 (samples). 29 of the samples are labeled ‘-‘ and 13 of them are labeled ‘+’.

For the data matrix, click here. 

The data was edited from the data in: http://www.rii.com/publications/2002/vantveer.html.

 

At first we wanted to check the correctness of the 2 algorithms.

For this purpose, we produced the weights matrix according to the distribution in table no 1 described in the quality analysis section. In addition, we produced the noised data, according the first (naïve) noise production method.

With these two matrices we ran:

  1. The non-weighted version.
  2. The “attached” weighted version.
  3. The “unattached” weighted version.

 

For the results data  (the output of the algorithms – the TNoM score and p-value of all genes sorted by their p-value) click here.

 

In Graph 1 we described the dependency of the M measure (which was received in every run) in Q (quality of genes in the experiment). In this graph we can see the comparison of the performances of both algorithms – attached and unattached. The performances are shown by their M measure. We can see that the attached version has much better performances than the unattached version because the attached version shows a lower M than the non-weighted version while the unattached version not only not improves the results but also shows worse results from the non weighted version.

 

graph 1: The comparison of the performance of both algorithms – attached and unattached. Both show the M measure (noise effect on algorithm results) as a function of Q parameter (gene quality) when noise is produced according to the naïve way described in the "quality analysis" section.

 

 

Focusing on the attached version, which improves the results, we can see in graph 2 the M measure dependency in Q, and its trend line.

The higher Q is, so the advantage of the weighted algorithm reduces. In general, we can see that the M of the weighted version is lower than the M of the non-weighted version.

 

graph 2: M measure of the attached version (noise effect on algorithm results) as a function of Q parameter (gene quality) when noise is produced according to the naïve way described in the "quality analysis" section.

 

 

Since we observed that the attached version is the one improving the results while the unattached version does not, we decided to continue our testing using only the attached version. From this point on, we used the second (statistical) method for producing noise. For each gene quality, we’ve run 10 tests, each time re-creating the weights and noise matrices. The processed results can be found here.

 

In graph number 3, we can see the comparison between the “attached” weighted and the un-weighted algorithms. We can see that in all gene qualities (except for 0) the weighted algorithm significantly improves the M-measure. The standard deviation (of the 10 runs for each quality) can be seen on top of each bar. Qualities 3 to 6 (inclusive) are the most statistically significant since there is no overlapping between the standard deviation of the 2 runs.

 

graph 3: M measure of the attached version (noise effect on algorithm results) as a function of Q parameter (gene quality) including the standard deviation of each gene quality. The noise is produced according to the statistical way described in the "quality analysis" section.

 

 

In graph 4 we can see performances of the weighted algorithm out of the noised non-weighted algorithm results. What we actually did is: for each run of gene quality we first normalized both the weighted and the non-weighted M by subtracting the un-noised M (the baseline). Then, we divided the normalized weighted M by the normalized non-weighted M ().

Finally, we calculated the average of all 10 runs of each gene quality.

It’s important to notice that the lower the result of the given formula, the better.

 

graph 4: performances of the attached weighted algorithm out of the noised non-weighted algorithm results as a function of Q parameter (gene quality) when noise is produced according to the statistical way of producing noise.

 

 

 

We can see that the weighted algorithm performs best when the gene quality is 6. When the gene quality is higher than 6, the weighted algorithm improvement is still almost twice as good than the non-weighted version. But the ratio inclines because the un-weighted algorithm performs better due to a lower noise.

When the gene quality is lower than 6, the noise levels rise and weighted version improves less (but still improves – ratio lower than 1).

The only gene quality where the weighted algorithm does not improve the un-weighted one is 0. As seen in table 1 in the quality analysis section, in gene quality 0, 80% of the weights are 0. According to the weighted algorithm these samples are ignored, so the weighted algorithm deals only with 20% of the original labels. Yet, a low weight does not necessarily mean that the label should be ignored for 2 main reasons:

1)      The noise generated for a sample weighted 0 can still be close to 0. This means that even though the weight is 0, this sample can be still trusted.

2)      Sometimes, even a high noise can still not damage the result of the un-weighted algorithm. If for example a sample that is greatly over expressed is given a positive noise, it will still stay on the same side of the threshold and will not change the result of the un-weighted algorithm. But, in the weighted algorithm, samples with a weight of 0 are ignored, and as explained above, not always justifiably. Since in the weighted algorithm, in gene quality 0, around 80% of the samples are removed, the results greatly differ from the un-noised baseline.

 

 

In addition to results presented so far, we also tried creating weights in different ways. One of which is according to the beta distribution.

To see description of choosing weights according to beta distribution click here.

 

 

Running real weighted data

After evaluating the algorithm, we ran it on real weighted data. The data, which hasn’t been published yet, was collected at the Reynolds Cardiovascular Clinical Research Center at Stanford.

The data contains samples that were taken from normal and diseased human arteries during heart transplant operation. The data contains 7580 genes. We’ve looked at the following classifications:

  1. Normal artery vs. diseased artery. (82 samples. 27 normal vs. 55 diseased).
  2. Diseased arteries from patients who suffer from diabetes vs. diseased arteries from patients who don't suffer from diabetes. (55 samples, 17 diabetics vs. 38 non-diabetics).
  3. Normal arteries from patients who suffer from diabetes vs. normal arteries from patients who don't suffer from diabetes. (27 samples, 8 diabetics vs. 19 non-diabetics).

 

In the data, the weights are given as p-values, hence, a low p-value means that the sample is good, and a high p-value indicates a bad sample (unreliable). Because our program assumes that the higher the weight, the better the sample, we needed to normalize the weights.

In order to decide on the best conversion function, we first plotted a histogram of the distribution of the given weights:

We can see, that over 50% of the weights are below 0.025, and therefore we didn’t choose the first intuitive method of subtracting the weight from one, because we wanted to emphasize the differences between the lower weights. For that purpose, the natural function to use is: newWeight = –log(oldWeight). If the old weight is zero, then the new weight is defined as 20.

 

After that conversion, the distribution plot is:

Now, we want to transform these new weights, into 5 discrete weights between 0 and 1 (inclusive) which we will run our program with. 

We wanted the weights to be distributed with a medium-high quality, similar to gene qualities 6-7 in Table 2 (In the Quality Analysis page).

 

Weight

Bins (inclusive)

Percentage

weight=0

0

8.0597%

weight=0.25

0.25

10.7169%

weight=0.5

0.5-0.75

12.9703%

weight=0.75

1-2.25

20.2101%

weight=1

2.5-20

48.0430%

total

 

100.0000%

We chose borders to suite the distribution, which we wanted to achieve. For example, we wanted around 5%-10% of the samples to have the discrete weight of 0, so all the samples in the lowest “bin” (~8%) were given the discrete value of 0.

 

 

After creating the new discrete weights for all the data, we ran our program with this data and weights. For each test, we ran both the weighted version, and the non-weighted one. Due to memory constraints, on the first test, we reduced the number of total samples from 82 to 65 (22 Normal and 43 diseased).

 

For the results data (the output of the algorithms – the TNoM score and p-value of all genes sorted by their p-value) click here.

 

Since in the real data, we don’t really know the actual noise level, we cannot precisely measure weather the weighted algorithm improved the results or not. Still we would like to examine genes whose place changed dramatically in the weighted / non-weighted versions.

What we did is checked in every test the relative position of the top ten ranked genes in the weighted version and the non-weighted one.

Following, is a list of genes that changed their position significantly (did not appear in the 20th first genes):

 

Blue mark indicates the genes that were ranked at the top ten genes.

 

Normal vs. diseased test

Gene Number

Gene Name

Position in weighted version

Position in non-weighted version

4447

14N.3.A10

7

158

746

14N.4.G1

9

22

1830

8R.9.C5

65

3

74

14N.1.B4

25

7

2163

7R.4.E5

111

9

 

 

 

 

Diseased population test. Comparing Diabetic vs. Non Diabetic

In this test we could observe the highest similarity between the two runs.

Gene Number

Gene Name

Position in weighted version

Position in non-weighted version

4641

7F.4.D7

8

32

 

 

 

 

Normal population. Comparing Diabetic vs. Non Diabetic

Gene Number

Gene Name

Position in weighted version

Position in non-weighted version

5313

7R.7.B11

2

20

7356

8R.4.B12

4

26

5137

14N.5.B8

5

85

7156

7R.9.F12

6

25

6921

9R.7.D9

9

24

521

9F.10.B1

23

4

540

9F.3.B4

52

5

1279

12R.1.C5

119

8

1302

11R.1.F2

49

9

1943

12R.3.A5

34

10