Efficient motif search in ranked lists and applications to variable gap motifs
Limor Leibovich and Zohar Yakhini
DRIMUST is a software implementation of a new statistical and algorithmic approach that we developed to enable efficient motif searches, covering a broader range of motif spaces comparing with state of the art motif searching tools. In particular, DRIMUST can efficiently search for variable gapped motifs, long motifs and motifs over large alphabets. The motif structure that DRIMUST searches comprises two half sites separated by a gap, that can be of a variable length.
4. In the command prompt enter:
java -jar -Xmx2048m DRIMUST.jar <first half length> <second half length> < upper limit on the spacer length > < P-value stringency> < path to the input file> < path to the output file>
There are six input parameters, in the following order:
1. First half length
2. Second half length
3. Upper limit on the spacer length (it is not recommended to use values greater than 20)
4. P-value stringency
5. Path to the input file, containing the ranked list (an example of an input file can be found here)
6. Path to the output file
Running example: java -jar -Xmx2048m DRIMUST.jar 4 4 7 0.0001 input.txt output.txt
The output will appear in the file as indicated by the sixth parameter (an example of an output file can be found here). This file is tab-delimited and it contains motifs that are enriched at the stringency level indicated by the fourth input parameter. Each such motif is represented by a row in the output file, which contains 9 columns, in the following order:
1. Motif first half
2. Motif second half
3. The optimal combination of spacer lengths for the gap
4. Minimum hypergeometric (mHG) score
5. Corrected score
* It holds that mHG score ≤ p-value ≤ corrected score [Eden et al.]. These scores are used as tight bounds for the p-value, to save calculations.
6. N - the total number of sequences in the ranked list
7. B - the number of sequences in the list containing the motif
8. n - the threshold that defines the top of the list as calculated by the mHG statistics
9. b - the number of sequences among the top n sequences that contain the motif
Motif refinement as PWM:
Motifs outputted by DRIMUST are exact words. As a refinement, we implemented software that returns PWMs for the two halves of a given variable gap motif. This extension is separate from DRIMUST and can be downloaded from here.
The extension to DRIMUST takes seven parameters as input:
1. Path to the input file, containing the ranked list of sequences (an example of an input file can be found here)
2. Motif first half
3. Motif second half
4. The combination of spacer lengths for the gap (separated with commas; no spaces are allowed)
5. n - the threshold that defines the top of the list as calculated by the mHG statistics (outputted by DRIMUST for each motif)
7. Path to the output file
It returns a PWM for each half of the motif by considering all occurrences of variants of the motif in the top n sequences in the list, when allowing up to d mismatches in total in the motif halves. The output will appear in the file as indicated by the seventh parameter (an example of an output file can be found here). This file is tab-delimited and it contains a PWM for each of the halves. A PWM is represented by columns that stand for the positions within the site, and in each column the probabilities of each character are indicated.
Running example: java -jar refine.jar input.txt UAAU UAAU 0,2,3 397 2 pwm.txt
When using this software please cite: Limor Leibovich, Zohar Yakhini: Efficient motif search in ranked lists and applications to variable gap motifs. Nucleic Acids Research, 2012 (Full Text)
For questions and support please contact Limor Leibovich (llimor AT cs.technion.ac.il).