NPL Analysis: finding NPL scores Spairs and Sall.

Algorithms and complexity.

The main idea of the implementation is to go over all possible inheritance vectors in special order and for each vector to find Spairs and Sall scores. We use “gray code” to define the order of looping (over inheritance vectors):

Example of “gray code” for vector containing 3 bits: 000, 001, 011, 010, 110, 111, 101, 100. This order is very useful, because only 1 bit in vector is changed per iteration, and in the case when the inheritance bit is changed in the person which is not affected or has no affected children, NPL score (Spairs or Sall) doesn’t depend on this bit, so we can just copy the scores from the previous iteration.

This order is also used in the computation of Sall(v) for some vector v: we collect alleles from the affected individuals, one allele from each individual, get some collection and compute it’s score; when we want to get the next collection, we change 1 bit for some individual (in the order of “gray code”) and use the saved score of the previous collection to compute in O(1) score for current collection.

Here is more detailed description of the algorithm.

The main steps in the computation of Spairs and Sall scores:

Let’s assume:

F – number of founders in the pedigree

N – number of non-founders in the pedigree

A – number of affected individuals in the pedigree

SizeV = 2*N-F– the size of inheritance vector

PossibilitiesV =  2^SizeV – number of inheritance vectors

1)  first of all, we need to initiate all the relevant data structures for the further computation:

- the array “ped” of the individuals in current pedigree, where indexes of parents are smaller than indexes of their children (parents appear to the left of their children in “ped” – topological sorting of pedigree graph), and for each person is stored the next information: indexes of parents in “ped” (mother and father), affected(true/false), allele1 and allele2 (out of 2*F founders alleles), useful (if the person has no affected children and is not affected himself => not useful, else useful). In this step, we initiate “ped” by using arrays PedigreeGraph, Founders and NonFounders (they already exist in the program), and each person in “ped” gets indexes of parents and affected status : time and memory complexity is O(F+N);

- we go over “ped” and create the linked list of affected persons which is then converted to array of pointers to affected persons in “ped”(Affected array): time complexity O(F+N), memory O(A);

- we go over all founders in “ped” and for each founder with index i we store alleles (2*i = allele1, 2*i+1 = allele2): time complexity O(F);

- we allocate array Falleles of size 2*F, which will be used for storing the number of times each founder allele appears in affected individuals during the computation of NPL scores, initiated to 0 at each cell: memory and time O(F);

- we also allocate array Aff of size A (for Sall computation), each cell with index i in Aff will reflect the status of collecting allele from appropriate affected individual (with index i in Affected): 1 if allele1 is collected, 2 if allele2 is collected (all cells are initiated to 1): memory O(A), time O(A);

- in order to initiate “useful” field of each person in “ped”, we iteratively go over “ped” from the last index (F+N-1) to first and perform for each person: if he is affected => update for him and recursively for all his ancestors “useful”=true, else go to the next (previous in order of “ped”) person; time complexity O(F+N);

- we allocate array Ivector (inheritance vector) of size SizeV, in each cell store: inheritance status (0 or 1; initiated to 0), index of person in “ped” for which this cell is responsible; parent status for this person (mother or father), for which this cell is responsible; last 2 fields we initiate by going through all nonfounders in “ped” and only for those nonfounders that are not first children of their parents, we store their bits in Ivector; all the alleles in nonfounders bits which are first children of their parents are stored once (in “ped”, allele1 and allele2) and will be never changed (assuming inheritance pattern 0 at corresponding bits), for the rest of nonfounders – their alleles are stored using the inheritance pattern defined by Ivector (and will be changed at each iteration of the main loop, for each inheritance vector): time complexity O(N), memory O(SizeV);

Overall time and memory complexity of step 1: O(F+N).

Now we are ready to perform the computation of NPL scores, step 2.

2) As it was mentioned above, in the main loop of this step we go over all inheritance vectors in the “gray code” order, in each iteration we also remember the values of Spairs and Sall from the previous iteration. Before computing new scores we check if the current bit of Ivector to be changed corresponds to useful person or not: if not => we change this bit and copy the previous values of Sall and Spairs to the current scores, and go to the next vector. If the person is useful => we change the bit, rearrange alleles in “ped” according to new inheritance vector (this may take O(F+N) time) and compute Spairs(v) and Sall(v) separately (for the current inheritance vector v):

Computing Spairs(v): intiation: Spairs(v) = 0; we go through all affected persons using Affected array and for each allele (1 and 2) of person i we add to Spairs(v) Falleles[allele] and then increase Falleles[allele]++; this yields exactly the number of pairs of persons that share some founder alleles (IBD, because each founder allele is designated with unique number).

Time complexity: O(A).

Computing Sall(v):  initiation: Sall(v) = 1; Sum =0;

For each affected person i from Affected we collect allele1 (by setting Aff[i] = 1 and increasing Falleles[i.allele1]++), and multiply Sall(v) <- Sall(v)*Falleles[i.person];

Time complexity of initiation: O(A);

This yelds initial Sall(v) score ( = Pi (Falleles[i]!); Sum = Sall(v);

Now we iterate through all possible collections exactly the same way as we did with vectors in main loop using “gray code” order, and for each collection bit i to be changed we perform (assuming that the score from the previous iteration is Sall(v)):

If Aff[i] = 1 (then now we need to pick up allele2 of person i instead of allele1) :

Aff[i] = 2;

We divide Sall(v) by the number of times that allele1 of person i appears in Falleles (by Falleles[Affected[i].allele1]);

Decrease Falleles[Affected[i].allele1]-- (because now we pick up allele2 instead of allele1);

Increase Falleles[Affected[i].allele2]++;

We multiply Sall(v) by the number of times that allele2 of person i appears in Falleles (Falleles[Affected[i].allele2]);

Else we do exactly the same but allele1 and allele2 are now changed in their places.

Each iteration takes O(1) time.

At the end of loop : Sall(v) = Sum/(2^A);

The overall time complexity of computing Sall(v) is O(A) + (number of iterations)*O(1) = O(A)+O(2^A) = O(2^A);

The overall time complexity of step 2:

(number of inheritance vectors)*( O(F+N) + O(A) + O(2^A) ) = O(PossibilitiesV*(F+N+A+2^A)). Memory complexity of this step is equal to the memory complexity of step 1 = O(F+N) , we just use structures which were initialized in step 1.

So the memory complexity of the whole algorithm is O(F+N) and time complexity is O(PossibilitiesV*(F+N+A+2^A)) = O( 2^(A+2*N-F) + (F+N+A)*2^(2*N-F) ). By the definition of “O” it can be said that time complexity is O( 2^(A+2*N-F) ) only, but in practice O( (F+N+A)*2^(2*N-F) ) can be also meaningful.