GeneSelector

version 1.0

This software is based on the M.Sc. thesis I wrote, A New Branch and Bound Feature Selection Algorithm, which was done under the supervision of Prof. Dan Geiger and Dr. Zohar Yakhini.
The files can be downloaded from these links: thesis.ps (1.31M)  or thesis.zip (351K).

Background

With the aid of microarray technology, the expression levels of thousands of genes in cells from a tissue sample can be measured simultaneously. These gene expression measurements can be used for for several purposes, one of them is tissue classification. However, most of the genes that are measured are not relevant to the classification task at hand. Using them introduces noise in to the classification task. Our objective is to find a small subset of genes which are relevant to the classification task, and which minimize the samples' classification error when using a naive Bayesian classifier.

We assume there are two classes of points in Rn. Each class is distributed according to a multivariate independent Normal distribution. We implement a Branch and Bound feature selection algorithm to make a heuristic selection of an optimal  subset of k dimensions to be used in the classification.

Input File Format

Assuming there are  g  genes and  s  samples, the input file should be formatted as follows:

Files

Command line flags

The following flags can be used:

-h Help message
-file <input file> The name of the input file (if no path is given, it looks for it in the current directory).
-g <int> The total number of genes (features) in the input file.
-s <int> The number of samples in the input file.
-n <int> The number of features from which the optimal subset is selected (if this is less than the total number, the n features with the highest Bhattacharyya distance are taken).
-k <int> The desired subset size.
-eps <e> The epsilon accuracy parameter (e>0, default e=0.1)
-del <d> The delta accuracy parameter (d>0, default d=0.1)
-test <range> Splits the dataset. Uses the indices in range as a testing set. The range should be in the following format: 1,3,5-7,38-25 or 39-72 etc. (no white spaces allowed). After a subset is chosen according to the training set, the classification is tested on the test set.
-rnd_test <int> Randomly splits the input data into a training set and a test set of int randomly selected points. After a subset is chosen according to the training set, the classification is tested on the test set.
-syn Use this flag if synthetic data is used (the best subset will not be chosen according to a LOOCV test).
-leukemia
-breast
-prostate
These flags set the parameters for the datasets (-file, -g, -s)

Examples:

genesel -leukemia -n 50 -k 3  (selects a subset of 3 from the top 50 genes in the leukemia data set). This is the same as the following command:
genesel -file leukemia.dat -g 7129 -s 72 -n 50 -k 3 -eps 0.1 -del 0.1

The test used in the leukemia data are samples 39-72 and in the breast cancer data 48-58.

 

The following commands are the same ones used to produce the results in our experiments:

genesel -breast -n 100 -k 3 -test 48-58
produces the following output.

genesel -leukemia -n 100 -k 3 -test 39-72
produces the following output.

Note that the due to the random processes involved the output may very from run to run (especially the pruning rate).


 

For comments and bug reports contact Ari Frank.