Bioinformatics Demonstration
Its video is HERE.
Web search tools have become ubiquitous, with both generic and domain-specific search services providing users with rapid and selective access to data from potentially huge repositories. However, individual search tools are often ineffective for use in applications in which the answer to a request involves combining results from more than one search engine. In particular, web search services typically seek individual documents that meet the criteria specified in a request, whereas in practice information relevant to a requirement may be spread over several resources. 
Search computing provides a platform for expressing requests over multiple search services, such that the results of the integrated requests take account of the rankings of individual search results.
In the life sciences, many resources provide vertical search capabilities, in that they are focused on a single domain. In practice, many life science services provide ranked data as results, where the ranking may reflect a property of an algorithm (e.g. a similarity score) or of an experimental result (e.g. an expression level). Furthermore, it is often essential to combine multiple vertical search services to create multi-domain searches, where the different domain searches either refine or augment previous results.
This demo explores the application of a search computing platform in a bioinformatics use case, with a view to identifying the extent to which the existing platform for multi-domain search provides useful facilities for representing and integrating bioinformatics search services.
Demonstration Goal
In the life sciences, numerous questions can be addressed only by comprehensively searching different types of data that are inherently ordered, or are associated with ranked confidence values. By using available web services for searching bioinformatics data and taking advantage of the attributes they define for providing a ranking, search computing techniques can be applied to efficiently search for globally ranked answers to such complex questions.
This online prototype shows a case study of the use of a domain independent search computing platform for describing well known bioinformatics resources as search services, and for carrying out integrated analyses over the resulting services.
In particular, this makes explicit how ranked data from sequence comparisons and from gene expression results can be integrated in a way that takes account of the ranked results from the different types of data. In so doing, the paper illustrates the use of ranking as a first class citizen for data integration in the life sciences, and identifies open issues for further investigation.
Running Example
The query that will be answered is the following: "Which genes encode proteins in different organisms with the highest sequence similarity to a given protein and are co-expressed (e.g. over expressed) in the same given biological tissue/condition?". The above multi-domain case study question can be decomposed into the following three single domain sub-queries:
- "Which proteins in different organisms have the highest sequence similarity to a given protein?";
- "Which genes encode which proteins?";
- "Which genes are co-expressed (e.g. over expressed) in the same given biological tissue/condition?".
Each of these sub-queries is mapped to an available search service:
- WU-BLAST (http://www.ebi.ac.uk/blast2/), an implementation of BLAST, a well known sequence similarity search program;
- a query service on our GFINDer GPDW (http://www.bioinformatics.polimi.it/GFINDer/), an integrated data warehouse of genomic and proteomic information;
- a search engine over ArrayExpress Gene Expression Atlas (http://www.ebi.ac.uk/gxa/), a repository of gene expression data.
The user has to submit the ID (e.g. "O75462") and its type (e.g."uniprot") of a protein, the type of differential gene expression looked for (e.g."up" for over-expression, "down" for under-expression, or "updown" for both), and the biological tissue/condition (e.g. "brain" as biological tissue, or "carcinoma" as biological condition) in which the gene expression is evaluated. In response, the systems provides the list of proteins similar in sequence to the given protein, their similarity score (expectation), the list of genes that encodes these proteins and their up- or down-expression in the given biological tissue/condition, together with their significance (p-value) based on the experimental gene expression data available in the ArrayExpress Gene Expression Atlas.
Watch the demonstration video
Download Video: HD Quality (mov 1280x720 ~ 105Mb) High Resolution (m4v 950x540 ~ 81Mb) Mobile (m4v 480x272 ~ 42Mb)
