Statistical models

# Equation a:

Calculate the expectation: number of palindromes to be found in a string,

of length n.

The equation, includes the following parameters:

n – length of string.

l – length of palindrome (one side of pal’).

G – maximum length of gap.

mis – max number of mismatches allowed.

p - number of palindromes in a string of length n.

Explanation:

First we choose x places to locate the mismatches in the second part of the palindrome (which contains l nucleotides).

Since there are four nucleotides and x mismatches allowed, we obtain 1/(4l-x).

For each mismatch there are ¾ possibilities.

Summing up maximum “mis” mismatches multiply it by (G+1) possible gap sizes, and by (n-2*l+1) number of places to start the palindrome.

The subtraction of ((1+G)*G/2) is for the pal’ in the end of the string, which can’t have all possibilities of gap size.

# Equation b:

Calculate the same expectation like in equation a, but now, consulting the background distribution of the specific sequence, meaning the probability of having a mach between two bases, is not ¼ as we consider in equation a, but depended on the probability of each base to appear in the sequence.

For example if we have a sequence length of 4000 bases and base A is appear in it500timewecanassumep(A)=500/4000=1/8 .

Computing this probability for each one of the DNA bases, so that the probability for a mach between two bases is now:

R = ( p(A) 2+p(C) 2+ p(G) 2+p(T)2 )    (and not ¼ as in equation a)

(The probability for a mismatch in now: (1-R) and not ¾ as in equation a).

The new expectation: (number of palindromes to be found in a string, of length n

with consolation with background distribution) is given by:

# Equation c:

Calculating the probability to a find a specific palindrome of length l, k times in a string of length n.

·        These methods help us to evaluate the significance of the palindromes we have found.