GDV calculator for PPI networks

Project in Bioinformatics

Technion – Israel Institute of Technology

Computer Science Faculty

Samer Mattar & Reuven Shirazi

under the supervision of Prof. Ron Pinter

 

For Downloads click here
For input/output Examples click here

 

Abstract

As part of proteomics research, PPI networks are used to model the interactions between proteins. In order to understand the function of a protein, a method which has been suggested is to analyze its connectivity properties in the PPI network.

Finding an appropriate and agreeable method for characterizing topologies in a network is required for evaluating the connectivity properties. For that purpose, the graphlet concept has been introduced to define a specific topology, or pattern, inside a network.

One common direction for setting a hypothesis on the functionality of a specific protein is representing the count of occurrences of the protein at a specific position in a graphlet.

 

• Graphlet - Small connected graph patterns. Each node in a graphlet is identified by a specific role which is defined by its position in the graphlet.

 

• Graphlet Degree Vector – GDV is a vector, indexed by all the roles in all the graphlets up to size n. Given a graph, then a GDV of each node describes the number of occurrences of that node at each specific role.

 

• GDV types

Q: number of occurrences in each role of each graphlet of each size up to n

R: number of occurrences in each graphlet of each size up to n.

S: number of occurrences in each graphlet size up to n

 

• PPI Network - Protein-Protein Interactions in a specific organism \ tissue represented as a graph.

 

Project Definition

Objectives:

1. Develop a tool for calculating the GDVs for all proteins in a PPI network.

2. Support common PPI networks formats.

3. Provide the option to apply parameters for modifying the output data.

 

Major Challenges:

• Efficiency – Essential for traversing large PPI networks

• Scalability – In order for the tool to be extensible

• Duplication – Identifying symmetry is essential for the correct results

 

Project Description

• Input - PPI file, which is an XML file representing a PPI Network. The tool supports two well known formats: MIF and XIN.

• Configuration - The algorithm of the tool is configured using an XML file: GraphletDesc.xml. That file contains data on all the supported Graphlets which can be found and calculated by the tool over a PPI network. This file actually defines the way the main algorithm traverse a given input PPI network.

• Workflow

1. Traverse the graph and try to seek graphlets.

2. When a graphlet is found - the GDV of each of its nodes is updated in the corresponding role.

3. Correct GDVs duplications.

4. Export the calculated GDVs to files.

 

• Output files

GDVout.txt  - The data according to the user parameters

GDVoutFull.txt – The full calculated data with no reductions

Note: The format of the output files is tab-delimited, so they can be opened and manipulated by an application for editing\viewing data tables such as MS Excel.

 

Specifications

Software Technology:

• Unmanaged - Selected for the “heavy” calculations, achieving better performance.

Apps:    GDVCalculator.exe

BuildDependencies.exe

• Managed .NET - Selected for the UI for providing a friendly intuitive interface.

Apps:    GDV.UI.exe

 

System Requirements:

• Windows OS

• .NET Framework 2.0

 

Usage

1. Execute GDV.UI.exe in

2. Define the following parameters to modify the output:

• Max graphlet size - by selecting the right tab

• Graphlet filtering - checking the corresponding checkboxes

• GDV type –A radio button.

3. Select a PPI network file

4. Press “Calculate GDVs”.

 

some_text

 

Downloads

Getting the Source:

• Extract the archive “GDV_Source.zip” to any location on host

To download GDV_Source.zip Click here

• Open solution GDV.sln using VS2008 and build all

 

Getting Binary Files:

• Extract the archive “GDV_Bin.zip” to any location on host

To download GDV_Bin.zip Click here

 

Getting the Project Documentations:

To get the full Project documentation please click the preferred format PDF DOCX

 

Examples

Example 1:

• input file: Li2004a.mif
A set of C. elegans identified using yeast two hybrid approach. See Li et al. (2004, ScienceExpress, 2 January 2004) for experimental detail. Note, that only the original (experimental) results described in the paper are included in the data set.

• output file: Li2004a_mif_5_out.txt
the GDV out file with Max_graphlet_size=5, GDV_type=Q

Example 2:

• input file: Gavin2002a.xin
A set of S. cerevisiae protein-protein interactions identified by TAP purification of protein complexes followed by mass-spectrometric identification of individual components. The data is provided as a list of binary 'bait-prey'-type interactions. See Gavin et al. [PMID:11805826] for experimental details.

• output file: Gavin2002a_xin_4_out.txt
the GDV out file with Max_graphlet_size=4, GDV_type=Q

Example 3:

• input file: Giot2003a.mif
A set of D. melanogaster binary protein-protein interactions identified using yeast two hybrid approach. See Giot et al. [PMID:14605208] for experimental details.

• output file: Giot2003a_mif_3_out.txt
the GDV out file with Max_graphlet_size=4, GDV_type=Q

 

Acknowledgement for Third Party Components

XMLParser

A free, small, simple, cross-platform and fast C++ XML Parser by Dr. Ir. Frank Vanden Berghen.

 

Boost Unordered Map

Boost free c++ libraries is alternative STL, which provides a simple string hash table (unordered_map).