P value

What is p-value?

Definitions:

From mathworld:

"The probability that a variant would assume a value greater or equal to the observed value, strictly by chance".

In statistical words: Probability value (p-value) of a statistical hypothesis test is the probability of getting a value of the test statistic as extreme as or more extreme than that observed by chance alone, if the null hypothesis H0, is true.

The p-value is compared with the desired significance level of our test and, if it is smaller, the result is significant.

In our words and in our context:

The p-value is the probability of a gene scoring better than some fixed level s in randomly labeled data (with the same number of "pluses" and "minuses"). Genes with a low p-value are rare, and are likely to have biological significance.

How will we calculate the P-value?

When calculating non weighted TNoM and therefore also non weighted P-value – we will want to translate a vector of "pluses" and "minuses" to a path.

In every step of the algorithm, we will "use" one label. If the label is a plus we will move one step up, and if it is a minus we will move one step down. Every step has a certain probability – referring to the number of pluses and minuses we can "still use" or haven't used yet (no repetition).  The probability of a certain path is the multiplication of the probabilities of all the steps which the path is composed of.

The probability to reach a certain coordinate is the sum of the probabilities of all the paths leading to it.

For a certain vector, all the paths (p+n long) will end in the coordinate: (p+n, p-n) while p is the number of pluses and n is the number of minuses.

If we will sum ALL the possible paths probabilities we will reach at the final coordinate (p-n, p+n) to probability 1.

Now we would like to calculate P-value for a certain TNoM score – i.e. all the paths that pass a certain border in the graph.

It would be easier to restrict the paths to the ones that DO NOT reach a certain TNoM score. Since we will allow bidirectional TNoM (there is no significance if the correct position of the pluses is at the right side of the vector or at its left), we will set 2 borders at the paths graph. The "positive" border will be the line which value is subtracting the wanted TNoM from p and the negative border will be adding the wanted TNoM to -n.

Finally, calculating the wanted P-value is summing the probabilities of all paths that reach the "end point" and not passing or 'touching' the borders we defined and subtracting this probability from 1. (We calculated all the paths that DO NOT pass the thresholds and we want only the paths that DO pass it).

For example:

For the set – {+,+,+,-,-}

In addition, a closed form formula for calculating the P-value for the non-weighted TNoM can be found at [4].

When dealing with the weighted TNoM and P-value, things are getting little bit more complicated:

In our project we implemented two versions of the weighted algorithm.

First version – the "attached version":

In the first version we will assume every label is attached to the weight of that sample in this specific gene. We will allow no repetition of weights. (Every weight can be used only in one step) Again every vector is described as a path.

Now, in every point in the path we have to remember how many pluses/minuses we have used so far and with which weights in order to know what the next possible moves of the path are.

So, in fact, every point is defined by a list of all labels and their attached weights that its use led to that point.

Every step is determined as follows:

If the label is a plus with weight w then the corresponding step is a step up in the size w. In the same way a label minus with weight w will be considered as a step sized w down.

The probability of each step is calculated in the following way:

If the last step was a plus with weight w, then the previous probability is multiplied by the chance of receiving plus with such weight -

Then this plus, attached to its weight, is added to the "used labels" list of the current coordinate.

Similarly, if the last step was a minus with weight w, then the new probability is calculated as:

Previous probability X

and this minus with its weight is added to the list of used labels of the next coordinate.

In this version in every coordinate there is a large amount of information stored, but the full graph of all paths is sparse – many paths are not possible due to the fact we allow no repetitions of weights and labels. Therefore the probability of such a path is 0.

For a graphical example,.

Second version – the "unattached version":

In the second version we will refer to the unidirectional algorithm – i.e. assuming the correct classifying position of all '+' labels are at the left side of the vector and all the '-' labels in the right side.

In this method we will allow repetitions – every given weight can be chosen an un-limited number of times during calculation of the P-value. The probability for choosing one weight is calculated for each gene according to the weights given in the original vector.

Similarly to the un-weighted version every path is composed from connected coordinates. Now, each coordinate is defined by few parameters:

1.      Step index (first step has the index 1).

2.      K counters which imply how many more pluses from each weight haven't been "used" yet. i.e. ki will imply there are still ki pluses with weight wi to be used.

3.      "Number of mistakes" done so far. i.e. the sum of weights given to the minuses seen so far.

Every coordinate holds the probability of all paths ending in this certain point.

In different from the un-weighted version, the weighted version has few starting points. Starting points are all possibilities of dividing P pluses to k buckets of weights. (n+k-1 choose k such possibilities).

Each coordinate symbolise as if the TNoM threshold is located after the step index label.

Since we deal with the unidirectional case every '-' we have seen so far (is located to the threshold's left) is a mistake, and every '+' we haven't seen yet is also a mistake (located to threshold's right). Each coordinate has all the information to calculate the TNoM score for the threshold located in that coordinate. This score will aid us in deciding whether to continue with this path or not.

In the total graph we would like to sum all the paths that HAVEN'T reached a certain TNoM, so paths that has reached the wanted TNoM or better, will be eliminated.

Transitions from coordinate to coordinate:

If the last step was a minus:

1. Add 1 to step index.

2. K counters remain unchangeable.

3. Add the weight chosen for the last minus to "number of mistakes" so far.

4. Update P to be P + P'*Pminus*Pweight

P= the coordinate probability. Only once is initialized to 0.

P'=probability of coordinate from which path arrived

Pminus=probability to choose minus out of left labels =

Pweight= probability of the last weight chosen for the minus. This probability is calculated for each gene and constant for each weight during the whole path (enables weights repetition)

If the last step was a plus:

1. Add 1 to step index.

2. Update Ki to be Ki-1, while Wi is the weight chosen for the last plus.

3. "Number of mistakes" remains unchangeable.

4. Update P to be P + P'*Pplus*Pplus weight

P= the coordinate probability, only once initialized to 0.

P'=probability of coordinate from which path arrived

Pplus=probability to choose plus out of left labels =

Pplus weight= probability to choose plus with such weight. =

Finally we will sum the probabilities of all ending points to a common probability P.

The P value is 1-P = all the paths that HAVE reached the wanted TNoM or better.

For example:

For the very simple vector of 3 labels:  - + - with weights 0.25, 1, 0.5 respectively and description of the graph of paths click here.