Grid Enabled BLAST (GEB) with Distributed Objects and Components

Sequence comparison algorithms and tools, and especially BLAST (Basic Local Alignment Search Tool), is one of the corner stones of bioinformatics. However, sequence databases are exploding in size, growing at an exponential rate --- currently doubling in about 14 months, exceeding Moore's Law for hardware which is about 18 months. Grid computing together with smart parallel BLAST, is a crucial technique to maintain and improve the effectiveness of sequence comparison.

The objective of this research Master is to define models, techniques, and tools for Grid Enabled BLAST (GEB).^1.

There are various software methods by which BLAST can be parallelized to achieve linear or even super-linear speedups:

Database Segmentation
In this model, a large sized database is splitted as several nearly equal sized databases. Each node stores one part of the database. The independent segments of the database are searched on each processor or node, and results are collated into a single output file. Several implementations of database segmentation exist, the first of which was within NCBI's BLAST itself. NCBI-BLAST implements database segmentation by multi-threading the search such that each processor in an SMP system is assigned a distinct portion of the database
Database segmentation has the big advantage that it can eliminate the high overhead of disk I/O. The size of bioinformatics databases are now larger than core memory on most computers, forcing BLAST searches to page to disk. Database segmentation permits each node to search a smaller portion of the database, thus reducing (or even eliminating) extraneous disk I/O, and hence, vastly improving BLAST performance.With sequence databases roughly doubling in size each year, the problem of extraneous disk I/O is expected to persist. The adverse effects of disk I/O are so significant that BLAST searches using database segmentation can exhibit super-linear speedup versus searches on a single node.
Query Segmentation
Query segmentation splits up a set of query sequences such that each node in a cluster or CPU on an SMP system searches a fraction of the query sequences. By doing so, several BLAST searches can execute in parallel on different queries. BLAST searches using query segmentation on a cluster typically replicate the entire database on each node's local storage system. If the database is larger than core memory, query-segmented searches suffer the same adverse effects of disk I/O as traditional BLAST.When the database fits in core memory, however, query segmentation can achieve near linear scalability for all BLAST search types, even on SMP architectures.

A first constraint is to work with NCBI standard for BLAST. One cannot afford to become incompatible with the constant flux of NCBI updates.

In order to achieve maximum flexibility and efficiency, we shall aimed at an hybrid approach, using both Database and Query Segmentation, depending on the problem to BLAST, and the available resources. Load-balancing and fault-tolerance are clearly two other issues to take into account. Also, the model and infrastructure being developed must be as flexible, as configurable, and as generic as possible. They must use modern distributed techniques, such as components.

The OASIS INRIA Sophia Antipolis project has been designing and implementing ProActive, a Java library for Parallel, Distributed, and Concurrent computing and GRID. In the framework of a MIMD model (active object model), from a reduced set of rather simple primitives, a comprehensive and versatile library is defined for parallel, distributed, and concurrent programming (middleware).
In the absence of any syntactical extension, ProActive programmers write standard code. The library is itself extensible by the programmers, making the system open for adaptations and optimizations.
The prototype implementation and experimentations will take advantage of the ProActive platform, including the component capability, in order to benefit from asynchronous and group communicationd, high-level synchronizationd (wait-by-necessity with Future pool), and graphical composition.

Overall, the work will include conceptual models and practical experiments. The following steps could be undertaken:

Analysis of existing models, techniques, and tools.
Analysis of modern tools for Grid computing, active object, component models, infrastructure of ProActive.
Proposition of a generic hybrid model for distributed BLAST.
Definition of an architecture to achieve this model within the ProActive platform.
Program a prototype implementation. Experiment and benchmark with use-case BLAST tests.
A simple performance prediction model, in order, for a given triplet (database(s), query, set of parameters), to authorize:
- rough estimate of time-to-result in a given configuration,
- suggested configuration for a desired time-to-result.

Note:
It might be considered to design an infrastructure that is valid to encapsulate other alignment tools, such as FASTA, FASTX, TFASTX, for instance. In that case, any specific concatenation would have to be achieved outside any alignment tool, but rather probably in a alignment-tool-specific part of the GEB system.

This research should lead to a PhD program (Thèse de Doctorat).

Advisor : Denis Caromel
Téléphone : 04 92 38 76 31 Email : Denis.Caromel@inria.fr
Laboratoire ou équipe : INRIA Sophia Antipolis -- I3S -- CNRS

In collaboration with :

Richard Christen, Claude Pasquier
Laboratoire de Biologie Virtuelle, UMR6543 CNRS - UNSA

Pascal Barbry
IPMC, Institut de Pharmacologie Moléculaire et Cellulaire, UMR 6097 CNRS/UNSA

Prerequisite : Some knowledge in Object-oriented languages, Distributed Programming, Bioinformatics.

Hardware and software to be used : Networks of PCs, Intranet and Internet P2P machines

Internship place: Sophia Antipolis, between Nice and Cannes, France

Bibliography:

S. Altschul, W. Gish, W. Miller, E. Myers and D.J. Lipman,
Basic local alignment search tool
Journal of Molecular Biology, 215:403-410, 1990.

S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman:
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Res., 25:3389-3402, 1997.

NCBI BLAST

FASTA

R. Bjornson, A. Sherman, S. Weston, N. Willard and J. Wing.
Turboblast: A parallel implementation of blast based on the turbohub process integration architecture
In IPDPS 2002 Workshops, April 2002.
Note: TurboBLAST is a closed-source commercial parallelization of NCBI BLAST by TurboWorx, Inc.

R. Braun, K. Pedretti, T. Casavant, T. Scheetz, C. Birkett, and C. Roberts:
Parallelization of local BLAST service on workstation clusters
Future Generation Computer Systems
17(6)}:745-754, April 2001.
Note: mpiBLAST is an open-source parallelization of BLAST using Message Passing Interface (MPI)

J. D. Grant, R. L. Dunbrack, F. J. Manion and M. F. Ochs:
BeoBLAST: distributed BLAST and PSI-BLAST on a Beowulf cluster
Bioinformatics, 18(5)}:765-766, 2002.
Note:BeoBLAST is an integrated software package for Linux Beowulf cluster using Perl script and daemon.

M. Dumontier, and Christopher WV Hogue
NBLAST: a cluster variant of BLAST for NxN comparisons
BMC Bioinformatics, 3:13, 2002,
Note: NBLAST is a cluster variant of BLAST for N X N comparisons, using C/C++ and Perl.

Grid-Aware GridBLAST System
Note: a web-based system for NCSA Linux Cluster or NCSA SGI origin 2000.

GridBlast
Note: a grid-enabled, high-throughput ``task farming'' application

1. if you wanna see a GEB, god of the earth, father of Osiris and Isis in mythological Egypt.