Geppetto - Population Synthesis Software

by Amit Tzur (sendwithchibo AT gmail DOT com) , Daneial Anvar (danielanvar AT gmail DOT com)

January 2nd, 2008, under guidance of Prof. Geiger Dan (dang AT cs DOT technion DOT ac DOT il) and Bercovici Sivan (sberco AT cs DOT technion DOT ac DOT il)

Problem

Geppetto is population synthesis software. It can produce genomes of individuals, according to genotype data provided by the user - it creates the genomes by assigning an allele to each marker in the genotype data. The genotype data input defines probabilities for each allele in each marker, and Geppetto assigns the alleles according to the alleles’ probabilities.

Geppetto aims to provide an extensible tool that can create diverse populations under different scenarios, with sick and healthy individuals under a certain disease model.

Implementation

Geppetto is written in Java. Due to the size of the genotype data, the size of the output (each individual genome is a collection of text files), and different population scenarios, Geppetto uses large amounts of CPU time and memory, uses the hard drive, and its execution time may vary between few seconds to several minutes (depending on the type of population creation and the genotype data). In order to keep track of Geppetto execution, you can see the console logs throughout the execution (Tip: use the trace verbosity level).

The admixed population creation scenarios, which define sick people to create are most prone to lengthy execution times - Geppetto tries to create a sick person, and then according to the disease model, determines if indeed the created person is sick. If the person is not sick, Geppetto tries again, and that can result in long execution time. Therefore, the execution time is dependent on the disease model and the complexity of the created population (when admixed population creation methods are the most complex).

Geppetto is highly configurable - the user can define an extensive disease model, define the rate of genetic recombination in the pedigree, define number of sick and healthy people to produce, define the number of generation of admixture that the created population underwent (see the admixed population creation scenarios for more info: HI, CGF and Straight Admixed), and more.

Downloads

Full report (PDF)

Source code (.zip)


A detailed description of the project is available.