Optimized SuperLink: External Code Documentation

 

Table of Contents

    Program parameters

    The Optimization Policy

      Definition of Elimination sequence Complexity

      Optimization Modes

      The Optimization Algorithm

    Data Structures

    Program Configuration

      Turn Optimization ON/OFF

      Specify probability functions to apply

      Specify Optimization Mode

      Specify Optimization Policy

      Turn Debug messages ON/OFF

    Modifications performed in the original code of SuperLink

    Statistics Gathering and analyzing

  1. Parameters:
    1. Input Files : Datafile and Pedfile - in the same format as in original SuperLink.
    2. -optimize :  flag that specifies whether the optimization should be activated in current run.

    3. If this flag is provided, the optimization over the variable elimination order will be performed.
      Otherwise, variable elimination order will be determined by the original greedy algorithm.
  2. The Optimization Policy:

  3.  

     

      Definition of Elimination sequence Complexity:
      We measure the complexity of a given variable elimination sequence by the size of the maximal probability table that is created during the elimination process and by the sum of the sizes of all the probability tables induced by this sequence. We also take into consideration the number of break variables in the sequence
      and the number of their values if PERFORM_PRODUCT_BREAKING flag is set.

      We determine the problem complexity coefficient by the following formula :
      complexityCoefficient = log10 (sum of the table sizes * the total complexity of break variables) .

      We strive to reduce this coefficient during the optimization process.

      We believe that this coefficient yields better evaluation of the program run time than the sum of the table sizes alone. The graphs below illustrate this conclusion.


       


      Optimization Modes
      Tests show that the sum of table sizes has greater impact on the program run time than the maximal table size. But there are cases when a certain elimination sequence slightly improves the first parameter, but drastically worsens the second. The experience shows that in most cases it is still preferable as far as run time is concerned. However, maximal table size is a potential bottleneck on small RAM machines. Therefore, we define the following modes of optimization process :

        Strict mode - in this mode we balance the improvement in the sum of table sizes with the enlargement of the maximal table size, but only in case this size is greater than the BREAK_PRODUCT_BOTTOM_THRESHOLD.

        Loose mode - we accept any improvement in sum of table sizes, regardless of its impact on the maximal table size.

      The mode is defined by STRICT_MODE constant . It is currently set to 0, as it produces better results (at least in our running environment).
       
       

      The Optimization Algorithm:

        Run greedy heuristics
        We apply original greedy algorithm in order to estimate the problem complexity.

        Determine Optimization Policy
        We use the complexity of the greedy sequence in order to calculate the complexityCoefficient by the formula above. We access the matching entry in the OptimizationPolicy array where the optimization policy for this coefficient is defined.

        Run the optimization process according to the found optimization policy
        In each step of the algorithm we choose several variables with the least elimination complexity. They form a variables pool. The pool size is defined by the current OptimizationPolicy. Among them, we choose the next variable to be eliminated by tossing a weighted coin. A weight given to each variable is a function of its elimination complexity. We chose 3 following functions to be applied : LOG10, SQRT and the function that gives equal weight to all the variables. Once a variable is chosen, we update the total sum of the table sizes and the maximal table size (if needed) by its elimination complexity. Once the whole elimination sequence is determined, we compare its complexity to the current minimum. If we found a less complex sequence, we save it and discard the previous minimum. Otherwise, we discard this sequence.

        We repeat the search for elimination sequence a number of times. The number of iterations is determined by the current OptimizationPolicy.  During the optimization process we update the number of iterations needed in accordance with reduction of the problem complexity. That is ,if the problem complexity is reduced to some pre-defined minimum, the optimization process is stopped in order to reduce optimization overhead for the total run time.

        During the search for the elimination sequence we check that the intermediate results are better than the current record. If they are not, we stop the current iteration, as it has no chances to improve the results of the previous champion and proceed to the next iteration. The check is performed in accordance with the currently defined Optimization Mode. If it is STRICT_OPTIMIZATION - we discard any sequence that enlarges maximal table size more than 10% from the previous champion. Otherwise, we only check the total tables sizes sum.

        Currently, we start the optimization using the SQRT- based weights for the coin (which generally yields better results). However, there are cases when one of the other functions conquers, so we repeat the optimization with the two other functions subsequently with the same number of iterations.

        In order to further reduce the optimization overhead for the total run time of heavy files, we apply the following policy: we check an improvement rate every 200  iterations. If it is insufficiently low, we discard the currently used heuristics and progress to the next one.
         


        Back to Top

    Data Structures:
    typedef struct optimizationPolicyDef {
        int numOfIterations;
        int maxNumChoosenVars;
        int skip;
        int herNum;
    }optimizationPolicyDef;

    optimizationPolicyDef optimizationPolicy[MAX_COMPLEXITY];

    The data structure above defines the optimization policy. The optimizationPolicy array is indexed by a complexity coefficient. Each entry specifies the policy for its index. The optimizationPolicy array entries and size may be configured in accordance with user-preferred policy. Our current configuration is based on multiple experiments in our environment. This configuration should be revised while running on more powerful machines.

     numOfIterations specifies the number of optimization iterations for each heuristics to be run.
    The fields maxNumChoosenVars and skip define the sizes of the variables pools that will be used. The smallest pool size is hard-coded as 3 (which usually is efficiently enough), but in heavy files with a large amount of variables it is worth to expand the variables pool. The sizes of the variable pools that will be sequentially used start from 3 and increase in steps of skip till the maxNumChoosenVars.
    herNum specifies the total number of variable pools to be used. This field is used in order to evaluate the time that will be dedicated to the optimization process.
     

    typedef struct Heuristics {
    int optimize;
    int func;
    int bestFunc;
    int numBestVars;
    int iteration;
    double origProblemComplexity;
    double newProblemComplexity;
    BestResults *bestResults[3];
    }Heuristics;

     The data structure above contains the details about the currently applied heuristics. optimize flag indicates that the optimization process is activated. func indicates the weight function applied. The functions are defined as following :
    #define SQRT 0
    #define LOG 1
    #define EQ 2

    bestFunc
    numBestVars is the size of current variables pool. iteration is the current iteration number.
    origProblemComplexity is the original problem complexity calculated after running greedy algorithm.  newProblemComplexity is the best complexity achieved by now.
    BestResults *bestResults[3] points to the structure that holds the best results achieved by each heuristics. This information is needed only for statistics gathering and analyzing the performances of different heuristics.

    typedef struct BestResults {
    double maxTable;
    double sumTable;
    double maxTableImprovement;
    double sumTableImprovement;
    int last_improvement_sum;
    int last_improvement_max;
    boolean used;
    }BestResults;

    maxTable holds the maximal table size of the best sequence found by matching heuristics.
    sumTable holds the sum of the table sizes of the best sequence found by matching heuristics.
    used flag indicates that this heuristics is in use.

    The following fields are used for the dynamic evaluation of the heuristics performance:
    last_improvement_sum is the last iteration that improved the sum of table sizes for the matching heuristics.
    last_improvement_max is the last iteration that improved the maximal table size for the matching heuristics.
    maxTableImprovement is the improvement rate in maximal table size.
    sumTableImprovement is the improvement rate in the sum of table sizes.

    Variables Pool

    During the search for the next to be eliminated variable we create the variables pool (which is implemented as a bi-directional sorted linked list with entries of the type struct scoreData below). The size of the pool depends on the current optimization policy. We iterate over all the variables that are still not eliminated and insert them to the pool until it reaches its bounds. After that, each new variable's eliminate complexity  is compared against the variables in the pool. If it is better than that of one of the variables in the pool,  the new variable is added to the pool and the worst variable in the pool is removed.
     

    typedef struct scoreData {
    elimOrderVar *bestVar;
    double score;
    struct scoreData *prev,*next;
    double relProb;
    } scoreData;

    The data structure below holds information for a single variable in the variables pool. bestVar is a pointer to the variable.score is its eliminate complexity. relProb is the probability of the variable to be chosen. prev and next are used for iterating over the variables pool.

    Back to Top

    Program Configuration
      Turn Optimization ON/OFF
      Default : Optimization OFF.
      Configuration Switch: -optimize program parameter
      To turn on the optimization process provide -optimize parameter.

      Specify probability functions to apply
      Default: Apply all existing functions sequentially.
      Configuration Switch: RUN macro in Optimization.h
      Currently the following functions are available : LOG10, SQRT and the function that gives equal weight to all the variables.
      Use RUN macro in Optimization.h to turn specific functions ON/OFF :
      #define RUN(LOG) 1
      #define RUN(SQRT) 1
      #define RUN(EQ) 1

      Specify Optimization Mode (See explanations about various modes)
      Default: Loose Mode .
      Configuration Switch: STRICT_OPTIMIZATION constant in the Optimization.
      To turn on Strict Mode , set the STRICT_OPTIMIZATION to 1.

      Specify Optimization Policy (based on current h/w parameters)
      Default: Defined in function configureOptimizationPolicy() in EliminationOrder.c
      Configuration Switch: OptimizationPolicy array
      View Policy Configuration details

      Turn Debug messages ON/OFF
      Default: Print all Debug Messages
      Configuration Switch:  in EliminationOrder.c
      To turn debug messages OFF , set DEBUG_OPTIMIZATION to 0;
       
       

    Back to Top

     

     

    Modifications performed in the original code of SuperLink:

    Note: all the modifications are marked by  // NGMS

      added 1 header file "Optimization.h".

      added 3 functions: printBestResults ,configureOptimizationPolicy,getEliminateVarElimOrderProb to EliminationOrder.c

      added the following fields to struct elimOrderPed in "EliminationOrderStructures.h" :
      double maxTableSize;
      double sumTableSizes;
      double problemComplexity;

      The fields are used for storing the elimination complexity parameters for an elimination sequence found by the optimization procedure.

      added struct scoreData to EliminationOrder.h

      modified function main in main.c - added support for -optimize parameter and statistics gathering

      modified function calcLikelihoodSMlink in Prog.c added support for -optimize parameter and statistics gathering.

      modified function determineEliminationOrder in EliminationOrder.c - added implementation of the optimization algorithm.

      modified function  determineEliminationOrderPed - in EliminationOrder.c - moved the elimOrderArr array allocation to the determineEliminationOrder().

      modified function createElimOrderArr  - in EliminationOrder.c - added implementation of the probabilistic search for the elimination sequence.

    Back to Top

    Statistics Gathering and Analyzing
    The program creates 3 files in excel format in the current directory. If files already exist, the new data is appended to the existing files. Each run of the program appends one or several rows (excel format) to each one of the files. Each row contains pedfile/datafile names and number of people/loci. In addition to this data each file contains the following information:
    BestResultsMax.csv - contains the best size of the maximal table that was found by each one of the heuristics. This information is used to determine which heuristics is preferable for a given pedigree characteristics.

    BestResultsMax.csv - contains the best sum of tables sizes that was found by each one of the heuristics. This information is used to determine which heuristics is preferable for a given pedigree characteristics.

    TotalResults.csv - contains various data about the optimization process performance : the optimization policy that was applied on current file, the time consumed by the optimization process, the complexity of the best elimination sequence that was found by the optimization procedure and the total run time. We use this file to compare the performance of the optimized program with the performance of the original program. The file is created in the following way :  Invoke the program on a chosen input files with -optimize flag first. After it finishes, invoke it on the same files but without the -optimize flag (this will run original greedy algorithm). The last run will only add its total run time to the TotalResults.csv. Thus the two runs create one row in this file that contains all the necessary data for analyzing the optimization performance.
     

      Back to Top