First Annual Report of GeneFun

Workpackage 1

Partner 04(T2) has developed and is maintaining an exhaustive set of
orthologous families of the genes in 179 completely sequenced genomes. The
procedure is similar to the COG approach but it is able to split some
inclucive COGs (those with several parlogs of one species included), but
is based on many more genomes. The NOG approach (non-supervised orthologus
groups) yielded more than 2000 orthologous groups of which 6000 overlap
with the manual annotated COGs. The set is publicly available.

Workpackage 2


WP2. Reliability scores for functional annotations

 

The objective of the WP is to produce a reliable metric to indicate
the friability of the transference fo functional annotations, similar
to the one commonly used to indicate
the confidence of the similarities identified in sequence searches in large databases.



Specific goals are the revision of the work carried out in this area,
the updated of the best available approaches, the application the test
sets prepared in WP1, the inclusion of information derived from
multiple sequence alignments and protein structures, and finally the
delivery of the scores in the proper integrated technical framework.



During this first period we have revisited the literature and the
technical details underlying the previous situations of annotation
errors.

(Valencia, Curr. Op. Struc. Biol., 2005 submitted).



As consequence of that study we have carried out a complete new
implementation using the CE database as the guide for the extraction
of pair wise alignments, we have incorporated a new procedure to
reduce the redundancy at the level of sequence similarity and
functional classes (in this case codes indicating biochemical
functions, EC numbers).

The current estimate of the discrepancy of the EC numbers between
pairs of proteins at different levels of sequence similarity can be
described as inter medium between our previous work (Devos Vlaencia
Proteins 2001) and Todd, et al., J Mol Biol. 2001).  This new up dated
calibration will be the basis of the rest of the work (del Pozo
Valencia 2005 in preparation).



For the next reporting period we propose to update the calculations
using the same basic dataset of pairs of proteins, and analyzing other
definitions of protein functions complementary to the definition of
protein enzymatic function.



In parallel we will develop the algorithms for extending this
evaluation to full alignments taking into account not only the pair
wise relations but the full family structure.



Finally, we will incorporate the method for estimating levels of
errors in a web server, able to serve XML annotations to be used by
the other partners, and as a DAS server able to be integrated in other
genome annotation pipelines.




Partner 04 are developing a phylogenetic tree for all the sequenced
genomes which includes branch lenghts. The branch length between
species indicate the average rate of evolution i.e. when analysing
orthologs between different species, the normalisation to the species
tree will be an important prerequisite for function transfer between
species.



Deliverables

- Reliability score for function transfer based on pair wise
similarities:  Month 6



The work has been carried out, and the draft of the corresponding
publication (dell Pozo, Valencia 2005) will serve as report.





Workpackage 3

Objective

The objective of this WP is to undertake a systematic analysis of the
correlation between domain architecture and protein function, and
apply the derived rules and criteria to infer functional features to
as of yet non-annotated regions of proteins in eukaryotic genomes
Specific goals include to implement and benchmark functional
prediction methods based on domain information as well as to develop
methods to assign function to less well characterized section of
eukaryotic genomes. Three partners (SBC, INBGU and bioALMA) have
contributed significantly to fulfilling these goals, while other
partners have also been involved in discussions.

Specific goals are revision of previous work carried out in this area,
and the production of protocol to use domain architecture to predict
gene functions. The relation between domain architecture and different
context-based functional features (e.g. cellular localization,
cellular abundance, functional classes etc.) has systematically been
explored, and methods for using the discovered relation for inferring
function will be developed. Available methods for predicting domain
and protein interactions from domain architecture will be benchmarked
and improved in the progress of this workpackage.

Deliverable 1. Methods for the reliable delineation of domain architecture (Month 6)

The partners have implemented four systems to assign domains into
eukaryotic genomes. SBC has developed a method using either the SCOP
or Pfam databases to complete genomes, and extended these databases by
assigning less well characterized domains to Pfam-B or our novel MAS
database. This has enabled us to predict the number of domains in each
protein more accurately (Ekman, 2005 in print). A web-server
implementing the algorithm has been made publicly available at:
http://sbcweb.pdc.kth.se/cgi-bin/diaek/domsearch.cgi. It was
discovered that a significant fraction of Pfam-B and MAS domain
families are homologous to Pfam-A or SCOP domains. Structural
features, such as disorder, low complexity, transmembrane regions and
secondary structure, of regions matching different domain types and
different classes of unassigned regions were studied in detail and it
was found that all regions that do not match a Pfam-A or SCOP domain
contain a significantly higher fraction of disordered structure. These
unstructured regions may be contained within orphan domains or
function as linkers between structured domains. Identification of
domains in these regions will be studied later in this
workpackage. Structural predictions and domain architectures for 21
genomes are availiable for download at
http://www.sbc.su.se/~arne/domains.

Further, a Meta predictor of domain boarders that uses ten different
domain prediction methods plus a consensus prediction has been
developed by the INBGU partner. It is available at:
http://meta-dp.bioinformatics.buffalo.edu/.

The EMBL have adapted SMART, their domain analysis tool, to the
Ensembl annotation of metazoan genomes. This allows a fast and unique
retrieval of domain architectures of the genes annotated by
Ensembl. SMART also captures Pfam domains i.e. covers the HMM-based
approaches. It is available at http://smart.embl-heidelberg.de/.

Finally, BioAlma has developed a method to detect protein domains
automatically based on sequence information with little human
intervention and to a ever higher degree of certainty.

Deliverable 2. Methods for prediction of functional features from DA (Month 12)

During this first period the partners have revisited the literature
and the technical details underlying previous efforts in this area. As
a result of this study SBC have chosen to implement the GO-graph
method (Lord, 2003) as the basis for further studies. It was shown
that our protein distance measure based on DA correlates well with the
GO-graph measure of semantic similarity based on Gene Ontology (GO)
annotation (http://www.geneontology.org) especially for molecular
function but also for biological process and cellular component. The
distance measure was used to build evolutionary trees which have been
the basis for studying protein evolution and the origin of new
functionalities. (Björklund, manuscript).

SBC have also used the GO annotations to all SWISS-PROT proteins to
calculate how well the GO-terms for a protein can be described by the
domain architecture using Pfam domains, similar to the Pfam2go
annotations found at
(http://www.geneontology.org/external2go/pfam2go), but also including
statistics on SCOP domains. For each domain, a score was computed,
measuring how well each GO-term is described by the domain. A method
to predict the function from DA was implemented based on these
scores. A web-server has been made publicly available at
http://sbcweb.pdc.kth.se/cgi-bin/diaek/domfunction.cgi

An alternative approach has been taken at BioAlma where the functional
annotation is objective is to speed up the process of curating by
extracting automatically information from text relevant for the
functional description of protein domains. This will make it easier
for curators to find similarities between different domains and pick
up the most relevant publications for further detailed curating. In
the first reporting period BioAlma investigated the following
scheme. Given a (multi-domain) protein family the proteins in this
group (set A) are composed of a number of domains, some of them they
share with one set of proteins (set B) and others with another group
(set C). If we now analyze the literature corresponding to these sets
we will find - things that are shared between all three of them
(non-specific features) - things that are only shared between sets A
and B (potentially specific for the domains that sets A and B share) -
things that are only shared between sets A and C (potentially specific
for the domains that sets A and C share) This way it is possible to
separate what is corresponds to a set of proteins to what specifically
refers to a domain.

Based on the text mining technology used in BioAlma system we set up a
number of training sets and developed additional algorithms to perform
the multi-directional comparison of document sets as explained before.

Future work

For the next reporting period we will benchmark these methods on their
ability to predict-protein interaction using the DIP database
(deliverable 3.3). In parallel we will extend our functional
predictions so that the less well characterized domains (Pfam-B and
MAS) also can be utilized so that deliverable 3.4 and 3.5 can be
fulfilled.

Workpackage 4

Workpackage report

Workpackage 5

-------------------------------------------------

In the course of the GENEFUN project, a unique scoring system for the
prediction of functional associations has been developed and already
implemented. In order to make the heterogenous data comparable, we devised
a single benchmark and scored all the different sets (regardless whether
predicted or experimentally devised). We also divised a scroring scheme
for the transfer of interactions between species (do expression data in
mice apply to human and if so under which circumstances and to which
extend?). Several factors for interaction transfer were considered e.g.
the more distant two species are the less confident we are in the function
transfer and the more inparalogs a gene has the less confidence we have in
the transfer. The existing STRING tool was entirely redesigned to cope
with novel data formats and an expected increase of interaction databases.

Furthermore, we devised a number of filters for several of the raw data
(e.g. y2h, complex purifications or arrays), some of them led to
independent publications. We have bundled the various data types into a
number of distinct channels and for each channel, visualisation tools are 
being been developed.

Another major development concentrated on an improved maintainability of
the tool and server. So far, we have information on 179 species covering
more than 730.000 proteins and more than 23Mio interactions with various
degrees of confidence. They currently come from 11 different resources and
predictions.

The development of the STRING resource has involved considerable human
resources far beyond the man-months allocated by GENEFUN. The result is a
framework that is getting heavily used by the scientific community.

In order to make a metasever a success, each of the method implementations
has to be of high accuracy. The partners in this work package worked on
the improvement of different methods. For example, considerable progress
has been made in text mining (improvement of precision and recall of
protein names and inclusion of a large number of organisms) but also in 
the extension of genomic cotnext methods.

We have continued to apply the evolving tools for the prediction of 
functional features and have successfully combined homology and context 
analysis in a number of projects.




D5.1. Our existing web, server, STRING, to predict and integrate protein 
interaction data, has been entirely redesigned to be able to cope with the 
challenges of this EU project. With version 5.0 in spring 2004 we started 
to incorporate experimental data, enabled by a unifiying benchmark and 
scoring scheme. In February 2005, version 6 was released with a veriety of 
predicted and experimentally derived interactions. The respective 
documents and data are depositied in the STRING WEB server.


Workpackage 6

This WP relies on the progress in the other WPs. We have already developed
in a pilot study a metaserver that combines homology-based predictions
(SMART) and predictions of protein interactions/functional associations
between proteins (STRING).

Workpackage 7

This WP relies on the progress in the other WPs. We have already developed
in a pilot study a metaserver that combines homology-based predictions
(SMART) and predictions of protein interactions/functional associations
between proteins (STRING).

Workpackage 8

We have contributed to a number of European courses and workshops to
introduce and disseminate the procedures developed. Examples are the
European Bioinformatics School (Nejmengen, Netherlands, Jan 2005) or the
annual courses held in Bertini, Italy (Mar 2004, 2005).

Arne Elofsson
Last modified: Mon Mar 28 22:27:07 CEST 2005