The Algorithm

Palindrome Application, implemented in C.

The program goes through the DNA sequence, searching for all palindromes in a specific length, a range of allowed mismatches and a range of allowed gaps.

The strategy of this algorithm is to read the DNA sequence as a stream.

Parameters used by the program:

1)     Input sequence, genome of different organisms, in unspecified length (The file must be in a FASTA format).

2)     Length of palindrome (one side).

3)     Maximum gap between repeated regions.

4)     Number of mismatches allowed.

Searching for palindromes done by checking all the possible places in the sequence (in order to be correct and not to miss even one palindrome).

For each place in the sequence we will check all possible sizes of gap, checking whether the pal’ has allowed number of mismatches.

The “heart” of the algorithm compare the first letter to the last letter, the second letter to the letter second form the end, etc. (A matches T and C matches G).

If we found a palindrome that correlated with our demands, we will print it to the output in the following format:

• The pal’ above starts at the 122’Th base of the sequence and ended on the 140’Th base.

Its length 8 bases (each side), containing 2 mismatches and a gap of 3 bases.

The program counts and prints the total number of palindromes founded.

It’s also count for each base the number of times it appear in the sequence, this is needed for calculating the expectation through equation b (consulting the background distribution).

The program prints the running start and end times, so we could calculate its running times.

Algorithm Testing

In order to check if the algorithm we wrote is accurate, and works according to our demands, we “plant” an approximate palindrome in different genomes and compared the results with our expectations.We also compared our formulas expectation, with the average result of a large bunch of random sequences

Running Times

We run our program on Pentium-3 with 128MB RAM.

Using the following parameters:

•  Length of Palindromes (one side): 14
•  Maximum gap allowed: 100
•  Number of mismatches allowed: 2
 Organism Sequence length (Bases) Time (h:min:sec) E.Coli 4,639,221 (0:16:06) Human genome ~ 3x109 ~ (173:31:14)

Practical usage of the Algorithm

Compare the palindrome profile of different organisms and evaluate the results:

a)    Genome from different bacteria.

b)    Same gene, for example: hemoglobin, insulin in different mammals.

c)    Gene families, for example: Histones, Immunoglobins.