Ideas for improvement.


The general idea of the implementation of NPL analysis in this project is to iterate over inheritance vectors in the outermost loop and over alleles in the inner loop: for each new inheritance vector possibly there are O(F+N) alleles to rearrange, computation of Spairs takes O(A) and computation of Sall – O(2^A) [ F – number of founders, N – number of nonfounders, A – number of affected ]. Overall number of inheritance vectors is O(2^(2*N-F)), so the computation of Spairs for all vectors takes O((F+N+A)*2^(2*N-F)) and the computation of Sall – O((F+N+2^A)*2^(2*N-F)), as it was already mentioned in complexity analysis.

But this may be reduced to O(2^(2*N-F)) for Spairs and to O(2^(A+2*N-F)) for Sall, if we will use another principle: not to iterate over vectors in the outermost loop, but to “build” vector bit by bit and for each new bit of vector to compute new score using score (either Spairs or Sall) of the previous “shorter” vector (and without taking into consideration the rest of the bits that will be added to the vector later), when we start with vector of length 1 and compute initial score assuming only 1 bit. This principle avoids duplicated calculations for vectors that differ only for branches in the pedigree.

If the previous vector was v, we added some bit to v and got v’, then we can calculate new scores in the following manner (assuming that the previous scores were Spairs(v) and Sall(v)):

Spairs(v’): let’s assume we have array of 2*F alleles “f_alleles” in which for each founder allele j we store number of times allele j appears in pedigree in different affected individuals assuming vector v; if the added bit corresponds to allele i in some affected individual then: Spairs(v’) <– Spairs(v) + f_alleles[i] ;  f_alleles[i]++;

                   this simple calculation takes O(1).

Sall(v’): this case is much more complicated, because in order to use the previous score we need the scores for all collections (for each collection h we need to store ∏bi(h)! – multiplication of factorials of number of times each allele i appear in h), and not only that, for each collection h we need also array “f_alleles” (for each allele j storing the number of times j appears in h), that means memory complexity increases up to O(2*F*2^A); if we have all this and the added bit corresponds to allele i in some affected individual then we loop over all collections and for each collection h’ we use the score of corresponding collection h (from vector v) and array f_alleles of that collection and get: f_alleles[i]++;                ∏bi(h’)! <- (∏bi(h)!)*f_alleles[i];

we take sum over all new scores of the collections (dividing of sum by 2^A is performed only at the last step);

for each collection O(1) operations are performed, so computing the new Sall score takes O(2^A) time.

         The number of inheritance vectors is O(2^(2*N-F)), so we get the desired time complexity.

         It is not the exact implementation, its only the main idea and perhaps can be improved further (especially memory complexity in the calculation of new Sall score).