P
value

What is p-value?

Definitions:

**From mathworld:**

"The probability that a variant would
assume a value greater or equal to the observed value, strictly by
chance".

**In statistical words**: Probability value (p-value) of a statistical hypothesis test
is the probability of getting a value of the test statistic as extreme as or
more extreme than that observed by chance alone, if the null hypothesis H_{0},
is true.

The p-value is compared with the desired
significance level of our test and, if it is smaller, the result is
significant.

**In our
words and in our context: **

The p-value is the probability of a gene scoring better than some fixed level s in randomly labeled data (with the same number of "pluses" and "minuses"). Genes with a low p-value are rare, and are likely to have biological significance.

**How will we
calculate the P-value?**

When calculating non
weighted TNoM and therefore also non weighted P-value – we will want to
translate a vector of "pluses" and "minuses" to a path.

In every step of the
algorithm, we will "use" one label. If the label is a plus we will
move one step up, and if it is a minus we will move one step down. Every step
has a certain probability – referring to the number of pluses and minuses we
can "still use" or haven't used yet (no repetition). The probability of a certain path is
the multiplication of the probabilities of all the steps which the path is
composed of.

The probability to
reach a certain coordinate is the sum of the probabilities of all the paths
leading to it.

For a certain vector,
all the paths (p+n long) will end in the coordinate: (p+n, p-n) while p is the
number of pluses and n is the number of minuses.

If we will sum ALL the
possible paths probabilities we will reach at the final coordinate (p-n, p+n)
to probability 1.

Now we would like to
calculate P-value for a certain TNoM score – i.e. all the paths that pass a
certain border in the graph.

It would be easier to
restrict the paths to the ones that DO NOT reach a certain TNoM score. Since we
will allow bidirectional TNoM (there is no significance if the correct position
of the pluses is at the right side of the vector or at its left), we will set 2
borders at the paths graph. The "positive" border will be the line
which value is subtracting the wanted TNoM from p and the negative border will
be adding the wanted TNoM to -n.

Finally, calculating
the wanted P-value is summing the probabilities of all paths that reach the
"end point" and not passing or 'touching' the borders we defined and
subtracting this probability from 1. (We calculated all the paths that DO NOT
pass the thresholds and we want only the paths that DO pass it).

For example:

For the set –
{+,+,+,-,-}

In addition, a closed
form formula for calculating the P-value for the non-weighted TNoM can be found
at [__ 4__].

When dealing with the ** weighted**
TNoM and P-value, things are getting little bit more complicated:

In our project we
implemented two versions of the weighted algorithm.

**First version – the
"attached version":**

In the first version we will assume every label is attached to the
weight of that sample in this specific gene. We will allow no repetition of
weights. (Every weight can be used only in one step) Again every vector is
described as a path.

Now, in every point in the path we have to remember how many
pluses/minuses we have used so far and with which weights in order to know what
the next possible moves of the path are.

So, in fact, every point is defined by a list of all labels and their
attached weights that its use led to that point.

Every step is determined as follows:

If the label is a __plus__ with weight w then the corresponding step
is a step __up__ in the size w. In the same way a label __minus__ with
weight w will be considered as a step sized w __down__.

The probability of each step is calculated in the following way:

If the last step was a plus with weight w, then the previous probability
is multiplied by the chance of receiving plus with such weight - _{}

Then this plus, attached to its weight, is added to the "used
labels" list of the current coordinate.

Similarly, if the last step was a minus with weight w, then the new
probability is calculated as:

Previous probability X_{}

and this minus with its weight is added to the list of used labels of
the next coordinate.

In this version in every coordinate there is a large amount of
information stored, but the full graph of all paths is sparse – many paths are
not possible due to the fact we allow no repetitions of weights and labels.
Therefore the probability of such a path is 0.

For a graphical example,** click here**.

**Second version – the
"unattached version":**

In the second version we
will refer to the unidirectional algorithm – i.e. assuming the correct
classifying position of all '+' labels are at the left side of the vector and
all the '-' labels in the right side.

In this method we will
allow repetitions – every given weight can be chosen an un-limited number of
times during calculation of the P-value. The probability for choosing one
weight is calculated for each gene according to the weights given in the
original vector.

Similarly to the
un-weighted version every path is composed from connected coordinates. Now,
each coordinate is defined by few parameters:

1.
Step
index (first step has the index 1).

2.
K
counters which imply how many more pluses from each weight haven't been
"used" yet. i.e. k_{i }will imply there are still k_{i }pluses
with weight w_{i} to be used.

3.
"Number
of mistakes" done so far. i.e. the sum of weights given to the minuses
seen so far.

Every coordinate holds
the probability of all paths ending in this certain point.

In different from the
un-weighted version, the weighted version has few starting points. Starting
points are all possibilities of dividing P pluses to k buckets of weights.
(n+k-1 choose k such possibilities).

Each coordinate
symbolise as if the TNoM threshold is located after the step index label.

Since we deal with the
unidirectional case every '-' we have seen so far (is located to the
threshold's left) is a mistake, and every '+' we haven't seen yet is also a
mistake (located to threshold's right). Each coordinate has all the information
to calculate the TNoM score for the threshold located in that coordinate. This
score will aid us in deciding whether to continue with this path or not.

In the total graph we
would like to sum all the paths that HAVEN'T reached a certain TNoM, so paths
that has reached the wanted TNoM or better, will be eliminated.

Transitions from
coordinate to coordinate:

If the last step was a
minus:

1. Add 1 to step index.

2. K counters remain
unchangeable.

3. Add the weight
chosen for the last minus to "number of mistakes" so far.

4. Update P to be P +
P'*P_{minus}*P_{weight}

P= the coordinate
probability. Only once is initialized to 0.

P'=probability of
coordinate from which path arrived

P_{minus}=probability
to choose minus out of left labels = _{}

P_{weight}= probability
of the last weight chosen for the minus. This probability is calculated for
each gene and constant for each weight during the whole path (enables weights
repetition)

If the last step was a
plus:

1. Add 1 to step index.

2. Update K_{i }to
be K_{i-1}, while Wi is the weight chosen for the last plus.

3. "Number of
mistakes" remains unchangeable.

4. Update P to be P +
P'*P_{plus}*P_{plus weight}

P= the coordinate
probability, only once initialized to 0.

P'=probability of
coordinate from which path arrived

P_{plus}=probability
to choose plus out of left labels = _{}

P_{plus weight}=
probability to choose plus with such weight. = _{}

Finally we will sum the
probabilities of all ending points to a common probability P.

The P value is 1-P =
all the paths that HAVE reached the wanted TNoM or better.

For example:

For the very simple
vector of 3 labels: - + - with
weights 0.25, 1, 0.5 respectively and description of the graph of paths **click here**.