Uncovering the Molecular Biological Principles that Govern Cellular Systems
Orly Alter, Assistant Professor
Department of Biomedical Engineering, and Fellow, Institute for Cellular and Molecular Biology University of Texas at Austin
December 27th

DNA microarrays make it possible, for the first time, to record the complete genomic signals that guide the progression of cellular processes. Future discovery in biology and medicine will come from the mathematical modeling of these data, which hold the key to fundamental understanding of life on the molecular level, as well as answers to questions regarding diagnosis, treatment and drug development.

I will discuss the first data-driven models that were created from these genome-scale data, through adaptations and generalizations of mathematical frameworks from matrix algebra that have proven successful in describing the physical world, in such diverse areas as mechanics and perception: the singular value decomposition (SVD) model, the generalized SVD (GSVD) comparative model and the pseudoinverse projection integrative model. These models provide mathematical descriptions of the genetic networks that generate and sense the measured data, where the mathematical variables and operations represent biological reality: The variables, patterns uncovered in the data, correlate with activities of cellular elements, such as regulators or transcription factors, that drive the measured signal, and cellular states where these elements are active. The operations, such as data reconstruction, rotation and classification in subspaces of selected patterns, simulate experimental observation of only the cellular programs that these patterns represent.

I will illustrate these models in the analyses of RNA expression data from yeast and human during their cell cycle programs and DNA-binding data from yeast cell cycle transcription factors and replication initiation proteins. Two alternative pictures of RNA expression oscillations during the cell cycle that emerge from these analyses, which parallel well-known designs of physical oscillators, convey the capacity of the models to elucidate the design principles of cellular systems as well as guide the design of synthetic ones. In these analyses, the power of the models to predict previously unknown biological principles is demonstrated with a prediction of a novel mechanism of regulation that correlates DNA replication initiation with cell cycle-regulated RNA transcription in yeast.

These models may become the foundation of a future in which biological systems are modeled as physical systems are today.

References: 1) Alter, Brown & Botstein, PNAS 97, 10101 (2000).
2) Alter, Brown & Botstein, PNAS 100, 3351 (2003).
3) Alter & Golub, PNAS 101, 16577 (2004).
4) Alter & Golub, PNAS 102, 17559 (2005).

Functional and Evolutionary Inference in Gene Networks: Does Topology Matter? (An Evolutionary Approach to Systems Biology)
Prof. Aviv Bergman
Albert Einstein College of Medicine, Bronx, NY
December 22nd

The relationship between the topology of a biological network and its functional or evolutionary properties has attracted much recent interest. It has been suggested that most, if not all, biological networks are 'scale free.' That is, their connections follow power-law distributions, such that there are very few nodes with very many connections and vice versa. The number of target genes of known transcriptional regulators in the yeast, Saccharomyces cerevisiae, appears to follow such a distribution, as do other networks, such as the yeast network of protein-protein interactions. These findings have inspired attempts to draw biological inferences from general properties associated with scale-free network topology. One often cited general property is that, when compromised, highly connected nodes will tend to have a larger effect on network function than sparsely connected nodes. That is, robustness is connectivity dependent. For example, more highly connected proteins are more likely to be lethal when knocked out. However, the correlation between lethality and connectivity is relatively weak, and some highly connected proteins can be removed without noticeable phenotypic effect. Similarly, network topology only weakly predicts the response of gene expression to environmental perturbations. Evolutionary simulations of gene networks suggest that such weak or nonexistent correlations are to be expected, and are likely not due to inadequacy of experimental data. We argue that 'top-down' inferences of biological properties based on simple measures of network topology are of limited utility, and our simulation results suggest that much more detailed information about a gene's location in a regulatory network, as well as dynamic gene-expression data, are needed to make more meaningful functional and evolutionary predictions.

Superlink-online system For Faster Multipoint Linkage Analysis Via parallel Execution On Thousands Of Personal Computers
Mark Silberstein
Department of Computer Science, Technion
December 15th

Computation of LOD scores is a valuable tool in mapping disease-susceptibility genes in the study of Mendelian and complex diseases. However computation of exact multipoint likelihoods of large inbred pedigrees with extensive missing data is often beyond the capabilities of a single computer. We present a distributed system called Superlink-online for computing multipoint LOD scores of large inbred pedigrees. It achieves high performance via efficient parallelization of the algorithms in Superlink, a state-of-the-art serial program for these tasks, and through utilization of idle cycles of thousands of PCs. The main algorithmic challenge has been to efficiently split a large task for distributed execution in highly dynamic non-dedicated running environment. Notably, the system is available on-line, which allows computationally intensive analyses to be performed with no need for either installation of software, or maintenance of a complicated distributed environment. As the system was developed, it was extensively tested by collaborating medical centers worldwide on a variety of real data sets, some of which will be presented here.

Understanding genomes: predicting function with sequence and structure
Prof. Steven Brenner
Department of Plant and Microbial Biology
University of California, Berkeley
Thursday, November 3rd, 13:30

No abstract available
Biological context networks: viewing biology via multiple functionalities and milieux
Prof. Simon Kasif
Department of Bioengineering, Boston University
Thursday, August 18th, 11:30

In this talk we will focus on the need to build new methodologies for studying the vast diversity of molecular mechanisms found in nature. For various historical reasons much of bionformatics research, originating from sequence alignments has been focused on identifying conserved domains, conserved motifs and conserved networks. While SNPs and polymorphisms have been a major focus for biological and clinical research, this research is primarily aimed at understanding micro-evolution and its profound implications. It is much more challenging to identify relationship among fast evolving genes and systems. We describe our preliminary research aimed at elucidating or modeling such mechanisms, their biological implications and the challenges involved in this study. Speculating and using the analogy between biology and language, both evolutionary artifacts, we hypothesize a heavy tail (Zipf) distribution over frequencies of basic components found in nature. Consequently if we only focus on the highly conserved, we will learn the relatively significant but incomplete parts of biology which are by some measure already reasonably "well" understood.

Homology mapping using associative Markov Networks
Anat Caspi

UCSF/UCB Joint Graduate Group in Bioengineering

Thursday, August 11th

We introduce a novel method for biological sequence comparison which expands the model of mutation to include operators for chromosomal duplication, inversion and rearrangements. The method uses a graphical model (specifically, a Markov random field) with an underlying statistical model to express the relationships among components of the sequence. In this sense, the approach is probabilistic (like Hidden Markov Models or profile approaches, as in HMMER, SAM, etc.). Utilizing the unique underlying structure of the sequence comparison problem, we find the maximum a posteriori assignment of homologs. The model prefers assignments with no mismatches, gaps, duplications and inversions, but allows them. In this sense, this is a score optimization approach (like MSA or SAGA). This approach allows us to calculate the statistical significance of a particular homology assignment against another. Notably, the method is not restricted to pairwise sequence comparisons (whereas traditional scoring matrices and gap costs are not directly applicable in multiple sequence alignment). Our method can be directly extended to multiple sequence homology assignments. The extension to multiple sequences is also guaranteed to infer the optimal scoring homology assignment and does not proceed iteratively or progressively.

Information Theoretic Approaches to Whole Genome Phylogenies
Benny Chor

Tel-Aviv University

Thursday, June 14th

We describe a novel method for efficient reconstruction of phylogenetic trees, based on sequences of whole genomes or proteomes. The core of our proposal is an algorithm for efficiently computing pairwise distances. This is a simple string algorithm, which has a basis in information theoretic measures.

The algorithm is fast enough to enable constructing the tree of two hundreds species, and the forest of almost two thousand viruses. An initial analysis of the results exhibits a remarkable agreement with the "acceptable phylogenetic truth". We will compare it to other, known whole-genome reconstruction techniques. This comparison shows a significant improvement introduced by our method over the existing approaches, both in terms of performance and of accuracy.

Joint work with David Burstein, Igor Ulitsky, and Tamir Tuller

Predicting Domain Binding Sites through a Combination of Protein Interaction Data
Doron Betel

Samuel Lunenfeld Research Institute, Mt. Sinai Hospital, Toronto
and Department of Biochemistry, University of Toronto, Toronto

Thursday, April 20

Protein interactions play a crucial role in driving many cellular processes such as intra-cellular signaling, transcription control, cell cycle regulations and metabolic activities. The interactions are often mediated by the binding of conserved regions (i.e. conserved domains) to short sequence motifs known as binding recognition modules. Despite the increased coverage and sensitivity levels of experimental and computational techniques for detecting protein interactions few domains have known binding modules. This talk will present a two step approach for learning the binding modules of interacting domains through a combination of structural and experimental interaction data. In the first step, protein complexes with structural information are used to extract the short sequence motifs that bind to a domain of interest. These short peptide sequences represent a first approximation of the domains binding modules. In the second step, the short motifs are converted to position specific scoring matrices (PSSM) using Gibbs sampling on the subset of experimental pairwise interactions that contain the domain of interest. The resulting PSSMs were then used to predict new interactions between yeast proteins. The predicted interactions were compared to a new and comprehensive list of interactions extracted from the literature, as well as to a new set of unpublished experimental data. In addition, a number of the predicted interactions were confirmed by directed experiments.

Joint analysis and visualization of SNP genotype and mRNA expression data.
Anya Tsalenko, Ph.D.
Scientist at Agilent Laboratories, Palo Alto, CA USA

Thursday, March 17th

High throughput expression profiling and genotyping technologies drive recent interest in the genetics of normal variation in expression levels of transcripts and in the genetic basis of variation observed in disease. Several studies in model animals as well as in humans have collected and analyzed parallel genotype and expression data. I will describe statistical approaches to determine genetic association between SNP sites and the expression levels of transcripts. Using these methods we compute an association matrix, whose entries represent the p-values of the association for every SNP/transcript pair. SNPs whose corresponding rows have significantly many entries with low p-values are potentially related to the normal functioning of transcription factors or other regulatory molecules. In a collaboration with Institute for Cancer Research in Oslo we studied such associations in breast cancer patients. The genotype data for this study was collected by Dr. Vessela Kristensen's group in Institute for Cancer Research in Oslo. We adapted algorithms from the literature and search the association matrix for biclusters - subsets of SNPs that have significantly many associations in common with a set of transcripts. We analyze the identified gene sets using the gene ontology (GO) annotations. Methods for visualizing the results of such analyses will be presented.

The cell cycle expression program: From yeast to humans
Ziv Bar-Joseph
Center for Automatic Learning and Discovery and Computer Science Department, School of Computer Science, Carnegie Melon University

Tuesday, 8th March

Cyclic systems, such as the cell cycle, play an important role in many biological processes. Recent advances in high-throughput experimental methods in molecular biology are enabling researchers to obtain a global view of the of the expression program of such systems. By developing algorithms that combine diverse high throughput biological datasets we were able to model some of these dynamic systems in yeast. However, when moving from model organisms to humans we face many new computational challenges. Human systems are more complex, their temporal duration is longer and the data is often noisier. In this talk I will present algorithms that combine ideas from computer science, statistics and signal processing to address these issues. Using these algorithms we were able to determine, for the first time, the set of cycling human genes. This set can be used to identify cancer related genes and to determine the core set of cell cycle genes using a method we call comparative systems biology.

A 1.375-Approximation Algorithm for Sorting by Transpositions
Isaac Elias, Dept. of Numerical Analysis and Computer Science, Royal Institute of Technology, Stockholm, Sweden
13th January 2005
Sorting permutations by transpositions is an important problem in genome rearrangements. A transposition is a rearrangement operation in which a segment is cut out of the permutation and pasted in a different location. The complexity of this problem is still open and it has been a ten-year-old open problem to improve the best known 1.5-approximation algorithm. We provide a 1.375-approximation algorithm for sorting by transpositions. The algorithm is based on new results regarding the diameter of three subsets of the symmetric group: We determine the exact transposition diameter of 2-permutations and simple permutations, and find an upper bound for the diameter of 3-permutations.

Joint work with Tzvika Hartman, Dept. of Molecular Genetics, Weizmann Institute of Science

Structure-based computational characterization of protein-protein interactions
Ora Schueler-Furman, Laboratory of David Baker, Dept of Biochemistry, University of Washington, Seattle
23rd December 2004
Protein-protein interactions play a key role in cellular processes. While the structure of a protein complex is an excellent starting point for the characterization of an interaction, only few protein complexes have been solved, compared to the number of structures of single proteins. This calls for the development of docking methods that can predict the complex structure starting from the unbound proteins. We have developed a high-resolution docking method (ROSETTADOCK) that allows for side chain flexibility during its search for the optimal rigid-body orientation, and is thus able to account for a part of the conformational changes that occur upon binding. This method has been shown to perform remarkably well on a benchmark of protein complexes, as well as in the community wide blind docking experiment (CAPRI).

Accurate models of protein complexes are indispensable for the structure-based design of new interfaces and prediction of binding affinities (specificity). For a set of well characterized protein families, we have compared the experimental relative binding affinity of different sequence pairs to the predicted relative affinity that was calculated from structural models. Binders could always be identified for a protein partner if structures of the individual proteins were available. However, when sequences are modeled onto the backbone of a structural template, the distinction of binders and non-binders is impaired, indicating that conformational changes in the protein backbone cannot be neglected.

Faster, low-resolution methods can be used for the structural characterization of protein-protein interactions on a large scale. I have previously demonstrated that evolutionary conserved residues tend to cluster in space, and that this feature can be used to select native-like protein models from a set of structural models created by a de novo structure prediction protocol. The same measure can be applied to select native-like models of protein complexes from a set of models created by a docking protocol.

A Hitchhiker's Guide to Molecular Evolution
Guy Sella, Center for the Study of Rationality, Hebrew University
2nd December 2004
Perhaps the most basic evolutionary question to emerge from the genome projects is that concerning the causes of molecular evolution. What are the properties of the mutations that cause nucleotide and amino acid substitutions, and by what process do these substitutions take place? The current view attributes these changes at the molecular level to two distinct basic processes: adaptation, whereby a mutant is swept to fixation by virtue of its selective advantage, and nearly neutral evolution, where, by a much slower process, a mutation that is selectively neutral, or nearly neutral, drifts to fixation. The work I will present, which is done in collaboration with Dmitri Petrov and Aaron Hirsh, suggests that a third process may account for as many and perhaps more changes at the molecular level than adaptation and drift. When an adaptation sweeps to fixation, it brings along a significant number of sites in its genetic vicinity that do not have enough time to separate from the adaptation through recombination ("hitchhikers"). We have found that this process can account for a significant portion of the substitutions at the molecular level. Moreover, taking into account the rate at which adaptations occur and the size of the genetic region that hitchhikes, it appears that there would never be enough time for a nearly neutral mutant to drift all the way to fixation. In other words, it appears that the vast majority of molecular evolution consists of adaptations and hitchhikers that are driven to fixation by these adaptations. In my talk I will review a few of the far-reaching implications of this discovery, which we are currently studying. A significant amount of the theory and analysis of genetic data is based on the premise that adaptation and nearly neutral evolution are two separate processes and that nearly neutral evolution is governed by genetic drift. One example is the method that is used to evaluate the portions of molecular changes attributed to adaptations and to nearly neutral substitutions. The new picture suggests that these methods need to be modified, and our results suggest that the changes are very significant. We also found that genetic hitchhiking allows the fixation of deleterious mutations of a surprisingly large selective effect. This observation changes our picture of the mutations that take part in the non-adaptive portion of molecular evolution. It also suggests a new explanation for one of the greatest mysteries of evolutionary biology: the evolution and maintenance of sex. Because the radius of hitchhiking becomes larger as the rate of recombination decreases, and because hitchhiking allows the fixation of deleterious alleles, an organism without recombination will suffer from a genetic load: a harmful dose of deleterious alleles that is much greater than that of a sexual organism.
Ultra-Conservation in the Human Genome
Dr Gill Bejerano, University of California, Santa Cruz
4th November 2004
The sequence of functional DNA is often observed to be under selective pressure to maintain its function. Indeed, sequence conservation between diverged species is a hallmark of functional importance. However, nearly all types of functional sequences acquire some changes between diverged species, without harming their function. For example, homologous genes in the human and mouse genomes, thought to have diverged over 80 million years ago, are only 85% identical in sequence, on average.

And yet, in a systematic evaluation of the complete human, mouse and rat genomes, we find hundreds of syntenic regions that are perfectly conserved between all three. These unexpected, so called ultra-conserved regions, challenge our current understanding of the human genome. Moreover, the majority of these regions do not overlap protein-coding exons.

Initial analysis has shown these regions to be preferentially located overlapping exons of genes involved in RNA processing, or in introns or nearby genes involved in the regulation of transcription and development. And while nearly all of these regions are found in chicken, and most are also found in frog and jawed fish, virtually none could be traced using direct sequence similarity beyond the vertebrates.

Deciphering the functions, evolutionary origins and molecular mechanisms that maintain these regions pose exciting computational and experimental challenges. I will present the work which has led to this discovery, and that of a broader collection of human putative functional elements, and discuss our current computational and experimental efforts to better understand this phenomenon.

Joint work with David Haussler, Mathieu Blanchette and Sofie Salama.

See http://www.soe.ucsc.edu/~jill/ for additional information.

Digital Oscillations of p53 in individual living cells
Galit Lahav, Department of Systems Biology, Harvard Medical School, Boston, MA
9th September 2004
A major goal of systems biology is to understand complex protein networks in cells. Great simplification would occur if networks could be broken down into basic recurring circuits, such as the recently defined 'network motifs'. We focused on a common network motif, where a transcription factor is negatively regulated on the protein level by one of its downstream gene products. We employ the well studied p53-Mdm2 feedback loop to experimentally study the dynamics of this motif in single living cells. We constructed human cell lines expressing functional p53-CFP and Mdm2-YFP fusions. Accurate measurements of protein levels, localizations and interaction were obtained at high temporal resolution by fluorescence microscopy. We also studied the variability between the dynamics of individual cells, which can not be seen in assays on cell populations. Detailed oscillatory kinetics following DNA damage was found, and suggested that the p53-Mdm2 feedback loop generates a novel 'digital' clock that releases well-timed quanta of p53 until damage is repaired or the cell dies. The results allowed the construction of a mathematical model that captures the behavior of this regulatory module. We discuss the design-principles of biological feedback that were found in this and other systems.

Joint work with Uri Alon, Nitzan Rosenfeld, Naama Geva-Zatorski, and Alex Sigal.

Generalizing basic information theory to take distance into account
Samuel Sattath, Department of Molecular Genetics and Biotechnology, Faculty of Medicine, The Hebrew University of Jerusalem
28th June 2004
Entropy is an effective measure of randomness between states that are implicitly assumed to have no material relations between themselves. Measuring randomness in contexts where states do have an underlying non uniform distance requires the measure to be distance sensitive and there is no clear way of doing so. Here we present a novel generalization of entropy that does take an ultrametric distance into account. To do so we generalize information theory to accommodate distances and show that this can be done while preserving the basic theorems, assuming the distances are ultrametric. The formal findings have implications on the foundations of information theory and might have applications in the numerous disciplines utilizing it. One such application is in protein sequence analysis where distance sensitive entropy can resolve a long standing biological problem of measuring positional AA conservation.
Transcription factor binding sites: Selection forces caught in the action
Ron Shamir, School of Computer Science, Tel Aviv University
24th June 2004
The interaction between transcription factors and their DNA binding sites is key to understanding gene regulation. By performing a genome-wide study of the evolutionary dynamics in yeast promoters, we provide a first global view of the network of selection forces in the evolution of transcription factor binding sites. This analysis gives rise to new models for binding site activity, identifies families of related binding sites and characterizes the functional similarities among them.

We discover rich and highly optimized selective pressures operating inside and around these families. In several cases, this organization reveals that a single transcription factor has multiple functional modes. For example, Reb1 binding sites are sub-divided to two distinct clusters separated by strong selective forces. This pattern reflects a dual mode of Reb1 operation and is corroborated by the existence of two auto-regulatory Reb1 sites, one from each motifs cluster. To study the mechanisms shaping the evolutionary patterns and functional roles of binding sites, we have analyzed direct binding site affinity (kd) for 100 variants of the Leu3 consensus site. We discovered a striking grouping of sites to high and low affinity clusters, separated by significant selective forces. We conclude that Leu3 dual functional mode is mechanistically induced by distinct sites affinities.

Our global network of selective pressures between TF binding site motifs greatly extends the scope of extant methodologies for comparative non-coding regions analysis. It underlines the synergistic power of integrated evolutionary and functional genomics approaches.

In this talk I will focus mainly on the biological results. No computational background is assumed.

Joint work with Amos Tanay and Irit Gat-Viks

Dense Genotyping in Linkage Analysis
Benjamin Yakir, Department of Statistics, The Hebrew University
22nd June 2004
New technologies are emerging, which promise to economically allow genotyping of individuals over a dense set of genetic markers. Originally, these technologies were designed for population-based genetic association studies. However, some merits may be found in applying them in family-based linkage studies.

In this informal talk we will share some thoughts about the potential impact of these methodologies on linkage analysis, discuss the need for appropriate statistical methodology and computational tools, present some preliminary results, and, hopefully, initiate some discussion and collaborations.

Improved Algorithms for the Random Cluster Graph Model
Dekel Tsur, University of Haifa
15th June 2004
We model noisy clustering data using random graphs: Clusters correspond to disjoint sets of vertices. Two vertices from the same set (resp., different sets) share an edge with probability p (resp., p>r). We give algorithms that reconstruct the clusters from the graph with high probability. Compared to previous studies, our algorithms have lower time complexity and apply under wider parameter range.
What's important in life: Connections between measures of genome evolution and biological quantities
Eugene V. Koonin, National Center for Biotechnology Information, NLM, NIH, Bethesda, MD, USA
27th May 2004
In the last few years, we witnessed the advent of multiple, complete genome sequences and, in parallel, of genome-wide data on biological properties of gene ensembles, such as knockout effect, expression levels, protein-protein interactions, and more. Inevitably, numerous attempts were made by several groups, including ours, to examine connections between these properties and quantitative measures of gene evolution. The question asked are quite fundamental and interesting such as: What determines the effect of the knockout of a given gene on the phenotype (in particular, is it essential or not) and the rate of a gene's evolution? And how are the phenotypic effect and evolutionary rate connected? And more questions like that. The best succinct description of the results of the genome-wide studies addressing these questions goes like this: it is a mess. Many tantalizing correlations have been detected and made into a big deal (or not). For example, positive correlations were detected between the tendency of a gene to be lost during evolution and sequence evolution rate, and negative correlations were noticed between each of the above measures of evolutionary variability and phenotypic effect or expression level. Simply put, genes associated with a major phenotypic effect tend to be highly expressed, form many protein-protein interactions and tend to evolve slowly, however that is measured. These connections seem to meet the expectations based on common sense. What is quite unexpected, however, is that, although many of them are statistically highly significant (mostly thanks to the large number of data points), they are typically rather weak. Linear correlation coefficients in the range of 0.1-0.2 explaining only a small fraction of the scatter in the data are typical. This makes detecting real causal relationships a serious challenge. In this talk, I will give an overview of the reported genomic correlations; present in some detail a few examples explored in our group; describe some preliminary attempts to separate the wheat from the chaff using multivariate statistical analysis; and make an attempt, also very preliminary, to outline a generalized approach. I will introduce the notion of the "social status" (or, simply, "importance") of a gene in the genome-wide community and classify the variables with respect to their positive or negative correlation with this status. I will also argue that, despite the complexity of the emerging web of relationships, there is a clearly discernible pattern, and deviations from it point either to problems with the data or to biologically interesting phenomena.

Joint work with Yuri Wolf, with thanks to Liran Carmel, I. King Jordan, Fyodor Kondrashov, Dmitry Krylov, and Igor Rogozin.

Genetic Analysis of the Claudin Family of Tight Junction Proteins
Tamar Ben-Yosef, Department of Genetics, the Bruce Rappaport Faculty of Medicine, Technion
20th May 2004
Tight junctions (TJs) are circumferential strands around cells that selectively modulate paracellular permeability between extracellular compartments and are also involved in maintaining cellular polarity. Recent studies suggest that TJs also play a role in the regulation of cell growth and differentiation, and may have a significant contribution to human health, including injury repair, immune response and cancer. The main components of TJ strands are more than 20 members of the claudin family. Many tissues express several different claudins, which can interact with each other in both homotypic and heterotypic manners. Mutations of different claudins are associated with various phenotypes in both humans and animal models. For example, recessive mutations of the human gene encoding claudin 14 cause profound, congenital deafness DFNB29, thus demonstrating the importance of claudin 14 for hearing. To study the pathophysiology associated with CLDN14 mutations I generated a mouse model by a targeted deletion of the Cldn14 gene. Cldn14-null homozygous mice have profound hearing loss, due to cochlear hair cell degeneration during the first weeks of life.

Cldn14 is highly expressed in inner ear, kidney and liver. In these tissues other claudin genes are expressed as well. Yet, while claudin 14 does not seem to be necessary for normal kidney and liver functions, it is indispensable in the inner ear. Currently I am in the process of generating Cldn14/Cldn11 double mutants. Such mutants are the first step toward better understanding of the interactions between different claudins and their contributions to various physiological systems.

The Role of Conserved Sequence Motifs in Protein Interactions
Doron Betel, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto and Department of Biochemistry, University of Toronto
22nd April 2004
A growing body of research has concentrated on the definition and identification of conserved sequence motifs. It is widely recognized that these conserved sequence and structural domains mediate protein functions and interactions. Domains that facilitate similar or related functions may appear in interacting proteins such as those that constitute a protein complex. This talk presents a new method for identification of statistically significant domain pairs that co-occur in protein complexes. The method was used to analyze four datasets containing yeast protein complexes from curated and high-throughput experimental sources. The resulting correlations are represented as domain networks that form the basis of comparison between the datasets, as well as to binary protein interactions. The results show that the curated datasets produce domain networks that map to known biological assemblies such as the ribosome, RNA ploymerase, and transcription factors. In contrast, the high-throughput datasets contain large networks of domains with high connectivity to RNA processing domains. Currently, this work has been extended to identify potential short binding motifs that mediate domain interactions by a combination of structural and experimental protein interactions.
A Simpler and Faster 1.5-Approximation Algorithm for Sorting by Transpositions
Tzvika Hartman, Department of Computer Science and Applied Mathematics, Weizmann Institute
22nd January 2004
An important problem in genome rearrangements is sorting permutations by transpositions. Its complexity is still open, and two rather complicated 1.5-approximation algorithms for sorting linear permutations are known (Bafna and Pevzner, 98 and Christie, 99). The fastest known algorithm is the quadratic algorithm of Bafna and Pevzner. In this talk, we observe that the problem of sorting circular permutations by transpositions is equivalent to the problem of sorting linear permutations by transpositions. Hence, all algorithms for sorting linear permutations by transpositions can be used to sort circular permutations. Our main result is a new O(n^(3/2) sqrt(log n)) 1.5-approximation algorithm, which is considerably simpler than the previous ones, and achieves better running time. Moreover, the analysis of the algorithm is significantly less involved, and provides a good starting point for studying related open problems.

Joint work with Ron Shamir.

Prediction of Interaction Sites from Sequence
Yanay Ofran, Department of Biochemistry and Molecular Biophysics, Columbia University
1st January 2004
All biological processes are realized through interactions between proteins. Hence, to understand and be able to manipulate biological systems, one needs to decipher the mechanisms of protein interactions. In recent years researchers introduced various high throughput methods, both experimental and computational, for the identification of interacting pairs of proteins. However, this is only a first step: to fully understand these interactions, and to be able to enhance or inhibit them, one needs to identify the interaction sites on the surface of each of the interacting proteins. Based on an analysis of a large dataset of protein interactions, we were able to characterize protein-protein interfaces in great detail. The results of this analysis enabled us to devise a method for the prediction of protein-protein interaction sites from sequence. We demonstrate the power of this method by mutating some of our predicted interacting residues, and showing how these mutations hampered, and sometimes even completely abrogated, the interactions.
Modeling the Cellular Response to Metabolic Perturbations
Daniel Segrè, Department of Genetics, Harvard Medical School
4th December 2003
We introduce a new computational approach for predicting the metabolic response to genetic and environmental perturbations in a cell. Our method is based on flux balance analysis (FBA), which computes whole-cell steady state metabolic fluxes associated with optimal performance (e.g. maximal growth, in bacteria). Here we suggest that the immediate cellular response to a metabolic gene deletion deviates from optimality, and that the phenotype of a perturbed cell is better approximated by a minimization of metabolic adjustment (MOMA) with respect to the unperturbed state (wild type), rather than by maximization of growth rate. Comparing MOMA and FBA predictions to experimental flux data for E. coli mutants, we find that MOMA displays a significantly higher correlation than FBA. Our method can therefore be used for predicting the suboptimal phenotype of perturbed metabolic networks. Despite the increased complexity of eukaryotic cells, MOMA is also providing new insight for a large-scale S. cerevisiae mutant analysis. This approach can help understand the regulatory and evolutionary adaptation of metabolism, and may be generalized to multicellular systems, such as bacterial ecosystems and eukaryotic tissues.
An Eigendecomposition Method for the Analysis of RNA Structure, Biomolecular Simulations, and Numerical Geometry of Images for Use in Computational Biology
Danny Barash
27th November 2003
The ability of certain RNA molecules to perform as conformational switches, alternating between two states, has been known for some time to participate in a variety of biological processes. Recently, small RNA constructs that regulate gene expression without the participation of proteins have been discovered, called riboswitches. In bacteria, they control transcription termination and translation initiation by switching between two states. Their secondary structure is indicative of their function.

RNA secondary structure can be represented as tree-graphs at several levels: on the level of paired and unpaired bases, which can be advantageous for motif finding and structure alignment, and a coarse grain representation at the level of loops, bulges, and stems. The latter allows to predict mutations that will cause RNA conformational rearrangements, by using properties of the algebraic connectivity of trees (Fiedler, 1973; Merris, 1987) combined with folding predictions by dynamic programming. A tree-graph corresponding to RNA secondary structure can be represented as a Laplacian matrix, and the second eigenvalue of the Laplacian matrix is a measure of the tree-graph compactness. This allows to quantify structural changes between the wildtype fold and the mutant fold, thereby predicting which mutations in the wildtype sequence will disrupt stable RNA constructs. I will present a recipe for predicting deleterious mutations, with examples given on riboswitches, by the generation of eigenvalue tables.

Finally, two additional topics will be mentioned: biomolecular simulations using the numerical solution of the equations of motion, and the nonlinear diffusion of digital images with extension to non-parametric clustering methods for biological data.

Reconstructing Regulation from Gene Expression Patterns
Nir Friedman, Hebrew University
1st April 2003
Microarray methods allow us to simultaneously measure the mRNA expression level for thousands of genes. It is clear that such measurements contain information about many different aspects of gene regulation and function. Thus, a major challenge is how to extract new biological understanding from this wealth of data. In particular, can we understand the regulatory mechanisms that control the expression of genes?

In this talk we will overview an ongoing research project that attempts to learn about regulatory mechanisms from gene expression data and other high-throughput biological data. This project builds on the use of the representation language of probabilistic graphical models and methods for learning them from data. I will start with a brief background on the relevant biology, and a short survey of the challenges in the field. Then, I will describe several recent results on learning regulatory information from large-scale microarray studies and how these are validated based on known literature and through validation experiments.

Intensity-based statistical scorer for tandem mass spectrometry
Moshe Havilio, Compugen Ltd
23rd January 2003
We describe a new statistical scorer for tandem mass spectrometry. The scorer is based on the probability that fragments with given chemical properties create measured intensity levels in the experimental spectrum. The score's parameters are computed using a fully automated procedure. Benchmarking the new scorer on a large set of experimental spectra, we show that it performs significantly better than the widely used cross-correlation scoring algorithm of Eng et al.

Joint work with Yariv Haddad and Zeev Smilansky.

A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Cost Matrices
Michal Ziv-Ukelson, Haifa University
16th January 2003
The classical algorithm for computing the similarity between two sequences uses a dynamic programming matrix, and compares two strings of size n in O(n^2) time. We address the challenge of computing the similarity of two strings in sub-quadratic time, for metrics which use a scoring matrix of unrestricted weights. Our algorithm applies to both local and global similarity computations.

The speed-up is achieved by dividing the dynamic programming matrix into variable sized blocks, as induced by Lempel-Ziv parsing of both strings, and utilizing the inherent periodic nature of both strings. This leads to an O(n^2 / log n) algorithm for an input of constant alphabet size. For most texts, the time complexity is actually O(h*n^2 /log n) where h <= 1 is the entropy of the text.

Joint work with Gad M. Landau and Maxime Crochemore.

Algorithms for Nucleic Acid Folding and Hybridization
Michael Zuker, Rensselaer Polytechnic Institute, Troy, New York
9th January 2003
No abstract available.
Large Scale Reconstruction of Haplotypes from Genotype Data using Imperfect Phylogeny
Eleazar Eskin, Hebrew University
21st November 2002
Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize an individual's variation, we must determine an individual's haplotype or which nucleotide base occurs at each position of these common SNPs for each chromosome.

In this project, we present results for a highly accurate method for haplotype resolution from genotype data. Our method leverages a new insight into the underlying structure of haplotypes which shows that SNPs are organized in highly correlated ``blocks''. The majority of individuals have one of about four common haplotypes in each block. Our method partitions the SNPs into blocks and for each block, we predict the common haplotypes and each individual's haplotype. We evaluate our method over biological data. Our method predicts the common haplotypes perfectly and has a very low error rate (0.47%) when taking into account the predictions for the uncommon haplotypes. Our method is extremely efficient compared to previous methods, (a matter of seconds where previous methods needed hours). Its efficiency allows us to find the block partition of the haplotypes, to cope with missing data and to work with large data sets such as genotypes for thousands of SNPs for hundreds of individuals.

The algorithm is available via webserver at http://www.cs.columbia.edu/compbio/hap/.

Joint work with Eran Halperin and Richard Karp of UC Berkeley.

Multiple Timescale Methods in Biomolecular Simulations
Danny Barash, Department of Chemistry and the Courant Institute of Mathematical Sciences, New York University
24th October 2002
Biomolecular dynamics (MD) simulations play an important role in elucidating essential functional properties in key biological processes. Because of the crucial link between structure and function, structural information derived from experimental studies can be refined and interpreted at atomic resolution using all-atom computer simulations. I will present a novel multiple timestep (MTS) integrator, designed to speed up the calculation in modeling Protein/DNA complexes by marching with large timesteps, describing its advantages and limitations.

Next, I will mention how MTS methods may be used to target specific scales within a digital image. I will show how marching with large timesteps in the nonlinear diffusion equation where the variable is an image leads to nonlinear filtering approaches, culminating with bilateral and mean-shift filtering.

Finally, I will present examples and future applications in structural biology for these methods.

Joint work with Tamar Schlick, Hin-Hark Gan, Moshe Israeli, Ron Kimmel and other collaborators.

Analyzing time series expression data: From individual gene expression to genetic regulatory networks
Ziv Bar-Joseph, Laboratory for Computer Science, Massachusetts Institute of Technology
14th October 2002
Recent advances in DNA microarray technologies are enabling researchers to measure the expression levels of thousands of genes simultaneously. Time series expression data offers a particularly rich opportunity for understanding the dynamics of biological processes. These, and other high-throughput data sources, hold the promise of revolutionizing molecular biology by providing a large-scale view of genetic regulatory networks. Unfortunately, these high throughput sources are often noisy and contain many missing data points. Time series expression data introduces additional complications, including sampling rate differences between experiments and variations in the timing of biological processes. Thus, in addition to providing specific solutions for individual data sources, an important computational goal is to integrate information from multiple data sources, so that each type of data can compensate for the limitations of the others.

In this talk I will present algorithms for two different levels of time series expression data analysis. For the individual gene level, we present algorithms that permit the principled estimation of unobserved time-points, clustering, dataset alignment and the determination of differentially expressed genes. For the network (or data fusion) level, we introduce a new algorithm for efficiently combining complementary large-scale expression and transcription factor protein-DNA binding data to discover co-regulated modules of genes. The discovered modules are used to build a regulatory network of transcription factors and modules. We also use modules to label transcription factors as activators or repressors and identify several patterns of combinatorial regulation.

Finally, we present an algorithm that combines these methods to automatically infer genetic regulatory sub-networks for specific biological processes. In particular, we show how this algorithm accurately reconstructs key elements of the cell cycle sub-network in yeast.

Finding Motifs in the Twilight Zone
Uri Keich, Department of Computer Science and Engineering, UCSD
9th June 2002 (cancelled)
Gene activity is often affected by binding transcription factors to short fragments in DNA sequences called motifs. Identification of subtle regulatory motifs in a DNA sequence is a difficult pattern recognition problem.

In our talk we will briefly review the computational problem of motif finding and introduce a new motif finding algorithm that can detect very subtle motifs: relying on a generalization of positional profiles to multiprofiles, our MULTIPROFILER algorithm outperforms other leading motif finding algorithms in a number of synthetic models. Moreover, it can be shown that in some previously studied motif models, MULTIPROFILER is capable of pushing the performance envelope to its theoretical limits.

Joint work with Pavel A. Pevzner.

Using Haplotype Trees to Examine Genotype-phenotype Associations
Alan R. Templeton, Washington University, St. Louis, MO, USA, currently visiting the Bruce Rappaport Faculty of Medicine
19th March 2002
In complex diseases, such as coronary artery disease, the relationship of genotype to phenotype is not one-to-one, but rather can vary as a function of interactions with other genes and environments. Even when a known gene affects the phenotypic variation of interest, it explains only a small portion of the total phenotypic variation. Accordingly, in searching for associations between genetic variation at candidate loci for coronary artery disease, it is critical to augment the signal over the noise as much as possible. One method for increasing the genetic signal to noise ratio is to use evolutionary history. Mutations at a candidate gene occur in the context of prior mutations. At its time of creation, a new mutation will exist on a specific haplotype background containing some, but not all, of the mutational variants that had occurred earlier. If recombination is rare or is concentrated into hotspots, large segments of the candidate gene sequence will accumulate mutational change in a manner that reflects their evolutionary origin. Any mutation that causes a phenotypic change will therefore be imbedded in this evolutionary historical structure. It will be shown how this evolutionary history can be used to increase the power to detect associations between DNA variants and phenotypes. Two basic methods will be outlined: nested clade analysis and tree scanning.

Download the presentation PPT.

Evolution of Protein Function
Dr Tal Pupko, The Institute of Statistical Mathematics, Tokyo, Japan
14th February 2002
The molecular sequencing efforts had resulted in an exponential growth of data in public domain. This 'defuse' has allowed us for the first time to address biological questions that were inapproachable previously. In my talk I will show some examples of how to utilize this bioinformation for studying the structure and function of genomes and proteins.

I will present 3 topics:

  1. A new maximum-likelihood based approach for identifying functional regions on a protein surface.
  2. Detection of primate-specific algorithms for adaptive evolution of mitochondrial genomes.
  3. The evolution of the lysozyme c protein in ruminants.

These research topics required the development of new tools, such as reconstruction of ancestral sequences. These algorithmic aspects will also be discussed.

Modeling Biological Processes Using Workflow and Petri Net Models
Mor Peleg, Stanford Medical Informatics
6th January 2002
With the increasing volume of genomic data available, it has become clear that biologists need computational methods for data organization and analysis. To develop computer applications that aid in this task, we first need a knowledge model that can represent biological systems. In this talk, I will present a knowledge-based environment for organizing biological data into computer-interpretable and human-browsable formats, which I developed during my post-doc at Stanford. This model bridges the gap between high-level physiological processes and molecular-level functions. My model enables verification of safety and soundness, which in the context of biological systems, may aid in predicting system behavior in the presence of dysfunctional processes or structural components. The model also supports queries that can assist in discovering relationships among processes and structural components that participate in them. I tested the knowledge model by representing the process of host cell invasion by Malaria parasites. I assessed thirteen distinct process models that were developed in different fields with respect to their appropriateness for representing biological systems. In developing my framework, I combined the best aspects of two of the models: (1) Transparent Access To Multiple Biological Information Sources (TAMBIS) - a biological concept model, and (2) a workflow model that can represent the ordering of processes, the structural components that participate in them, and the roles that they play. The Workflow model maps to Petri Nets, allowing verification of properties such as oundedness and soundness, and determination of reachability. I composed queries that can aid discovering relationships among processes and structural components. I used reachability analysis to answer queries that relate to dynamic aspects of the model.

See here for more details.

Identifying regulatory networks by combinatorial analysis of promoter elements
Tzachi Pilpel, Department of Genetics, Harvard University Medical School
20th December 2001
Several computational methods based on microarray data are currently used to study genome-wide transcriptional regulation. Few studies, however, address the combinatorial nature of transcription, a well-established phenomenon in eukaryotes. Here we describe a new approach using microarray data to uncover novel functional motif combinations in the promoters of Saccharomyces cerevisiae. In addition to identifying novel motif combinations that affect expression patterns during the cell cycle, sporulation and various stress responses, we observed regulatory cross-talk among several of these processes. We have also generated motif-association maps that provide a global view of transcription networks. The maps are highly connected, suggesting that a small number of transcription factors are responsible for a complex set of expression patterns in diverse conditions. This approach may be useful for modeling transcriptional regulatory networks in more complex eukaryotes.

See here for more information.

Exploiting Redundancy in the Genetic Code
Steven Skiena, Department of Computer Science, State University of New York, Stony Brook (Visiting Scholar at the Haifa University and the Technion)
29th November 2001
Genes are DNA sequences which describe how to build proteins. DNA sequences are strings of four bases (A,C,G,T), while proteins are strings made of 20 different amino acids. The triplet code maps the 4^3 = 64 different sequences of three consecutive bases (codons) to one of the 20 different residues. With more codons than residues, there is an inherent redundancy in this encoding.

In this talk, which requires no background in biology, we study two distinct ways to exploit this redundancy:

(1) Optimizing RNA Secondary Structure for a Given Protein
RNA is the intermediary step between gene and protein. Each RNA sequence folds to a characteristic shape. Through extensive computational experiments, we have established that microbial genes disproportionately favor both high and low energy structures over random encodings.
Motivated by this finding, we present algorithms to find minimum and maximum energy RNA encodings over the space of possibilities for a given protein. Our designed sequences exhibit far more extreme energies than those built by evolution.

(2) Designing Better Bacteriophages
Bacteriophages are viruses which kill bacteria by injecting DNA. There is increasing interest in medical applications of phages, motivated by the growing problem of antibiotic resistance.
We propose increasing the effectiveness of a given therapeutic phage by engineering it to avoid a primary bacteriological defense mechanism. Restriction enzymes cut unmethylated DNA at specific patterns. Bacteria race to deploy these enzymes to cut the injected phage DNA into fragments before it can become active.
We exploit the redundancy of the triplet code to minimize the number of possible cut patterns in a given phage and have developed algorithms to find the optimal encoding. Our computational results are very encouraging - we demonstrate that there is sufficient flexibility to remove the overwhelming majority of restriction sites.

(Joint work with Barry Cohen.)

BIND - The Biomolecular Interaction Network Database
Doron Betel, Samuel Lunenfeld Research Institute, Mount Sinai Hospital, Toronto and Department of Biochemistry, University of Toronto, Canada
25th October 2001
As proteomics technologies such as DNA microarray, mass spectrometry and yeast two-hybrid become more sensitive and robust, they are becoming more automated and high-throughput. These experimental systems are currently providing a wealth of data on gene expression, molecular interactions and post-translational protein modifications. The Biomolecular Interaction Network Database (BIND - http://bind.ca) has been designed to store details about molecular interactions, complexes and pathways and thus captures proteomics data in a computer readable format.

Chemical reactions, photochemical activation and conformational changes can be described down to the atomic level of detail. Everything from small molecule biochemistry to signal transduction is abstracted in such a way that graph theory methods may be applied for data mining. The database can be used to study networks of interactions, to map pathways across taxonomic branches and to generate information for full pathway kinetic simulations. Recently, BIND has been used to manage and automatically discover new knowledge residing in large yeast protein-protein and genetic interaction networks in Saccharomyces cerevisiae.

Currently available tools for BIND include a visual navigation Java applet and a BLAST against BIND service, both for the web-based interface. BIND is an open community effort. All BIND records are in the public domain and source code for the project is made freely available under the GNU Public License.

Joint work with Gary D. Bader, Ian Donaldson, Cheryl Wolting, B.F. Francis Ouellette, Tony Pawson, and Christopher W.V. Hogue.

Multiple Maxima of Likelihood in Phylogenetic Trees: An Analytic Approach
Benny Chor, Computer Science Faculty, Technion
21st November 2000
Maximum likelihood (ML) is a widely used criterion for selecting optimal evolutionary trees. However, the nature of the likelihood surface for trees is still not sufficiently understood, especially as to the frequency of multiple optima. We initiate an analytic study for identifying sequences that generate multiple optima. We concentrate on the problem of optimizing edge weights for a {\em given} tree or trees (as opposed to searching through the space of all trees). We report large families of sequences that have multiple optima, including sequences with a continuum of optimal points.

Such data sets are best supported by different (two or more) phylogenies that vary significantly in their timings of evolutionary events. Our results imply that hill climbing techniques, as currently implemented in various software packages, cannot guarantee to find the global ML point, even if it is unique.

Joint work with M.Hendy, B.Holland, D.Penny (Massey Univ., New Zealand).

Brute Force Minimization in an Antlion Town: The Generation of Decoy Sets for Protein Structure Prediction
Dr Chen Keasar, Physics of Complex Systems Department, Weizmann Institute
20th November 2000
Abstract currently unavailable.
Robustness in Simple Biochemical Networks
N. Barkai and S. Leibier, Departments of Physics and Molecular Biology, Princeton University
11th June 2000, hosted by Ze'ev Lev
Cells use complex networks of interacting molecular components to transfer and process information. These "computational devices of living cells" are responsible for many important cellular processes, including cell-cycle regulation and signal transduction. Here we address the issue of the sensitivity of the networks to variations in their biochemical parameters. We propose a mechanism for robust adaptation in simple signal transduction networks. We show that this mechanism applies in particular to bacterial chemotaxis. This is demonstrated within a quantitative model which explains, in a unified way, many aspects of chemotaxis, including proper responses to chemical gradients. The adaptation property is a consequence of the network's connectivity and does not require the 'fine-tuning' of parameters. We argue that the key properties of biochemical networks should be robust in order to ensure their proper functioning.

Nature 387, 913-917, 1997

Molecular Genetic History of Modern Humans
by Prof. Marcus W. Feldman, Stanford University
7th May 2000, hosted by Dr Michael Glickman, Dept of Biology
One of the most variable classes of DNA in humans consists of repeated motifs of DNA pairs, triplets, or quadruplets. Mathematical models for the evolution of such loci in finite populations subject to mutation are analyzed. Statistics for within- and between-population analysis are described and applied to problems of estimating dates of divergence of African and non-African modern humans. Methods for detecting signals of population growth or decline have been derived and applied to large sets of di-, tri-, and tetra-nucleotide repeats. Genealogical methods provide a powerful way to analyze samples of alleles at non-recombining loci. The non-recombining part of the Y chromosome has yielded a number of polymorphisms, both short tandem repeats and single nucelotide polymorphisms. We have made a statistical analysis of these and find that the mean of the time since the most recent common ancestor of human Y chromosomes is much shorter than previously thought.

Marcus W. Feldman received his Ph.D. from Stanford University. He has made important contributions to evolutionary theory and population genetics, including the mathematical analysis of evolution in linked sets of genes and of the means by which cultural evolution, considered alone and in combination with biological evolution, influences human behavior. Professor Feldman is managing editor of Theoretical Population Biology and associate editor of Genetics and of Complexity. Professor Feldman was a Guggenheim Fellow in 1976, and a fellow of the Center for Advanced Studies in the Behavioral Sciences in 1983-1984. He is a member of the American Academy of Arts and Sciences and the California Academy of Sciences, and a member of the board of trustees and co-chair of the science board of the Santa Fe Institute.

Algorithmic Challenges from Genomics and Molecular Biology
Prof. Richard Karp, UC Berkeley
4th May 2000
Abstract currently unavailable.
A Whole-Genome Analysis of the Combinatorics of Expression Regulation
Tzachi Pilpel, Department of Genetics, Harvard Medical School
17th April 2000
Genetic control circuits of multi-gene systems often display intricate logical design, as gene expression may depend on the combinatorial action of several transcription factors. Yet, even in the best-characterized genomes, only a few such networks are fully understood. In my talk I will present a whole-genome approach to identify combinatorial sets of transcription factors that regulate the expression of multi-gene networks and show some specific examples of such sets. To identify combinatorial sets of transcription factors two complementary methods were developed. The first is a sequence analysis approach that is based on the premise that the DNA binding sites of interacting factors should co-occur in the promoter regions of multiple genes. The second approach is based on the notion that interacting transcription factors may display correlated mRNA abundance profiles across different conditions or time courses. By clustering transcription factors and other potential cis-regulatory regions, on the basis these two properties, we have identified many potential combinatorial networks of expression regulators in the yeast genome. These predictions are now being tested experimentally.

I will show preliminary indications of the capacity of our approach to predict genome-wide expression profiles as well as functions of uncharacterized ORFs.

Analysis of Gene Expression Data: Clustering and Beyond
Dr Zohar Yakhini, Agilent Laboratories
16th March 2000
Data obtained from multiconditional gene expression assays contain information that is instrumental in understanding related biological processes, in a variety of levels and contexts. The nature of studies of multiconditional gene expression patterns may vary widely. Accordingly, it is important to have flexible analysis tools that are useful in as many contexts as possible. Clustering techniques are applicable as they would cluster sets of genes that 'behave similarly', under any appropriate definition of similarity. A common theme in the literature of clustering methods is the need to fit the approach to the problem at hand and the necessity to assess the quality of solutions by subjective impression of experts in the field.

We have developed an efficient and effective clustering algorithm well suited to the analysis of gene expression data. In contrast to the prior approaches to clustering gene expression patterns which utilize hierarchical methods (constructing phylogenetic trees) or methods that work for only a single class of similarity metrics (Euclidean distance), our algorithm is based on an abstract graph theoretic approach. There is no need to assume a particular similarity function or number of clusters sought. The cluster structure is produced directly, without involving an intermediate tree stage. On an appropriate stochastic model of the input, a O(n log( n ))-time version of the algorithm recovers cluster structures with high probability ( n is the number of genes). Practical heuristic improvements of the algorithm were implemented in BioClust, a software package we routinely use to analyze actual data. In the talk we demonstrate the algorithm's performance on simulated data as well as on gene expression data from humans and other organisms. It is expected that expression profile differences in cancer vs non-cancer samples can be exploited to classify unknown samples. Recently, we have used this algorithm, as well as other analysis methods, to study such issues of sample classification in gene expression data. As a by-product of a classification or class discovery process we also gain deeper understanding of the pathomechanism roles played by genes we have information on. In the talk we will also present classification related results.

Joint work with Amir Ben-Dor, University of Washington and other collaborators.

Dr Zohar Yakhini from Agilent Laboratories is a visiting scientist at the computer science department, Technion. He is working in the area of computational biology. Zohar sits in Taub 625 and you are welcomed to drop by and say hello.

Sequence Elongation, Expression, and Function: The New Paradigm of Lab on the Web
Rave Gilmore, Compugen
24th February 2000
Compugen LEADS is an advanced computational platform focused at extracting meaningful biological information from the genomic, expressed (EST) and protein databases.

LEADS analysis and data mining programs correctly model complex biological phenomena such as alternative splicing, sequencing errors, intron contamination, vector contamination, chimeric sequences, repeats and SNPs. In the past two years LEADS has routinely been run on every new release of GenBank, and has provided a multitude of information including thousands of new full-length genes, and new splice variants for known genes. The predictions derived from the computational processes were biologically verified at Compugen's molecular biology lab with a success rate of over 90%.

Starting December 1999, elements of LEADS technology along with Compugen's expertise in sequence analysis, have been made available to the research community as a service over the Internet.

LabOnWeb.com, allows any scientist to submit a gene fragment and receive in return an elongated sequence, often a full-length gene, with a detailed analysis including protein and DNA homologous sequences, domain analysis, and open reading frames. In addition, the recently announced collaboration with Genzyme Molecular Oncology provides LabOnWeb with access to over 2 Million proprietary SAGE tags, enabling it to provide a very accurate quantitative tissue distribution profile for the researchers' genes.

The talk will describe in detail and demonstrate these capabilities, as well as future plans for expanding the service. An emphasis will be put on advanced applications of the technology such as DNA-chip design.