Mutual enrichment in ranked lists and the statistical assessment of position weight matrix motifs
Limor Leibovich and Zohar Yakhini
mmHG-Finder is a software implementation of an algorithmic approach that enables efficient motif searches in biological sequences. mmHG-Finder takes as input a ranked list of DNA or RNA sequences and returns significant motifs in PWM format. In cases where sequence ranking is not relevant or not available, it allows the use of positive and negative sets of sequences, searching for enriched motifs in the positive set using the negative set as the background.
java -jar -Xmx2048m mmHG-Finder.jar <minimal length> <maximal length> < strand search mode > <speed mode> < path to the input file> < path to the output file>
java -jar -Xmx2048m mmHG-Finder.jar <minimal length> <maximal length> < strand search mode > <speed mode> < path to the target set file> < path to the background set file> < path to the output file>
There are six or seven input parameters (depending on the input type), in the following order:
1. Minimal motif length (as a range of lengths is supported)
2. Maximal motif length (this parameter strongly affects computation time)
3. Strand search mode - can be 'ss' or 'ds' (representing single-strand search or double-strand search, respectively)
4. Speed mode - either 'fast' or 'standard' (the latter is recommended; please note that this parameter strongly affects running time; you can try the fast mode first and then see if the standard mode improves the findings)
5. Path to the input file, containing the ranked list (an example of an input file can be found here). In case of target and background sets, two input parameters are expected here - the first is the path to the target set file, and the second is the path to the background set file (both should look the same as the input example above).
6. Path to the output file
The output will appear in the file as indicated by the last parameter (an example of an output file can be found here). This file contains motifs that are enriched at the top of the list (or at the target set compared to the background set). Each such motif is represented by a section in the output file, where the section is divided to three. First, the statistical significance is indicated, including the p-value (N is the total number of sequences; n is the threshold that defines the top of the list as calculated by our statistics; B is the total number of motif occurrences; b is the number of motif occurrences at the top n sequences; score is the hypergeometric tail; p-value is the Bonferroni bound for the p-value). Below the statistical significance indication appears the position weight matrix (tab-delimited). Finally, the alignment of the motif occurrences is shown, by which the PWM was calculated (to get a Shannon logo of the motif, give this list as input to Weblogo). The motifs are separated with '----'.
When using this software please cite:
Limor Leibovich and Zohar Yakhini: Mutual Enrichment in Ranked Lists and the Statistical Assessment of Position Weight Matrix Motifs. Algorithms in Bioinformatics, Lecture Notes in Computer Science, 2013. Edited by Darling A, Stoye J: Springer Berlin Heidelberg; 273-286 (Full Text)
For questions and support please contact Limor Leibovich (llimor AT cs.technion.ac.il).