Quality Analysis

In order to analyze the quality of our software, we have run a few tests on it.

We ran 3 types of tests:

1. The first type of test was run with the data without the noise using the original non-weighted algorithm.
2. The second type of test was run with “noised” data using the original non-weighted algorithm.
3. The third type of test was run with “noised” data and weights using our new weighted algorithm.

We compared the tests from the two latter types to the tests from the first type, which served as a baseline.

We wanted to check the percentage of genes that remain good classifiers after adding noise and not damaged from it.

How did we evaluate the quality of the results?

We wanted to examine the best X scoring genes.

We defined a quality measure M to be the sum of positions of the first X genes.

In type 1 of the tests we received a basic order of the best scoring genes (regarding low TNoM and P-value). This is our baseline order.

Mb (M baseline = M1) is the sum of all numbers from 1 to X.

In versions 2 and 3 we have set Mi () to be the sum of the new positions of genes, received in the results, summed in the original order determined in test type 1.

For example: if X=2 and in test 1 (non weighted not noised version) the best scoring gene was gene number 49 (original position 1) and the second best scoring gene is 7 (original position 2), and in test type 2 gene number 49 appeared only as a third best scoring gene and gene number 7 is 10th best scoring gene.  So, M1   = 1 + 2 = 3 and M2   = 3 + 10 = 13.

After sorting the data we used in our testing by its TNoM score, we decided to set X to be 19, genes whose TNoM score is lower or equal to 5.

We expected the weighted version to be more stable to noise and that the percentage of genes not damaged as good classifiers will be high in that version's results. e.g. we expected M3 to be closer to M1 rather then M2.

How did we produce "noised" data?

Noise can originate from several sources:

1. The gene “quality” - how “strong” is the probe.
2. Human error – errors originating in the technician performing the tests.
3. Machine error – i.e. the machine measuring the expression levels is producing incorrect data.

We assumed that there is a reverse correlation between the noise and the weights i.e. the higher the noise the lower the weight and vice versa.

We have assumed most errors are from type 1 – probe quality.

Producing Weights:

We have set a parameter Q that indicates how "good" are the genes (Good = not noised).

0≤Q≤10. Q=0 means all genes are very noised. Q=10 means all noise in all genes is 0.

After determining Q, we randomly chose weights. The probability to choose any weight is dependent on Q - the higher Q is, so the probability of choosing low weights is low. The exact probabilities table is:

 P(0) P(0.25) P(0.5) P(0.75) P(1) 0 80% 10% 10% 0% 0% 1 35% 30% 30% 5% 0% 2 15% 25% 35% 20% 5% 3 15% 20% 35% 20% 10% 4 15% 20% 30% 20% 15% 5 10% 15% 25% 25% 25% 6 10% 10% 20% 20% 40% 7 5% 10% 10% 20% 55% 8 0% 5% 10% 15% 70% 9 0% 0% 5% 10% 85% 10 0% 0% 0% 0% 100%

Table 1: Weights probabilities according to gene quality.

# Producing Noise

After producing the weights matrix we produced the noise matrix. We chose 2 ways of producing noise.

A naïve method of creating noise is to define for each discrete weight its corresponding noise. After looking at the gene expression data, we have noticed that the majority of gene expression levels are in the range of: (–0.3, 0.3). For that specific range, we produced the following table preserving to each weight an SNR (Signal to Noise Ratio) lower than 1.

 Weight Noise 1 0 0.75 0.1 0.5 0.2 0.25 0.3 0 1

Table 2: Discrete noise as function of discrete weights.

After creating the noise matrix we produced the “noised data” in the following way: We randomly chose between adding the noise to the raw data and subtracting the noise from the raw data. In that way it is impossible to decode noise matrix out of the noised data.

The second method is a statistical one. We assumed noise in normally distributed around 0 but with different standard deviation for each weight.

For each weight, the noise is distributed ~N(0,(1-weight)). The rationale choosing this standard deviation is to enable higher randomization in noise (greater std) when weight is low and vice versa – the higher the weight is, so we would like noise to be closer to 0 – less randomization. In this method in contrast to the previous one, even for weight 0, the noise can still be 0. Due to this fact, this method is closer to reality. In reality a weight of 0 implies that we have no confidence in this sample and therefore we have no information of the amount of noise.

In addition, this method is statistical and randomized, so even for two identical weights the corresponding noise can be different.

After creating the noise matrix, the noised data is simply created by adding the noise to the original data. Since noise is normally distributed around 0, it has an equal probability of being positive or negative and therefore there is no need to further choose the sign of the noise.

To see the correlation between weights and noise in this method, click here.