Grid Enabled BLAST (GEB) with Distributed Objects and Components
Sequence comparison algorithms and tools, and especially
BLAST
(Basic Local Alignment Search Tool),
is one of the corner stones of bioinformatics. However, sequence databases are exploding
in size, growing at an exponential rate --- currently doubling in about 14 months, exceeding
Moore's Law for hardware which is about 18 months.
Grid computing together with smart parallel BLAST,
is a crucial technique to maintain and improve the effectiveness of sequence comparison.
The objective of this research Master is to define models, techniques, and tools for
Grid Enabled BLAST (GEB).1.
There are various software methods by which BLAST can be parallelized to achieve linear or
even super-linear speedups:
- Database Segmentation
In this model, a large sized database is splitted as several nearly equal sized
databases. Each node stores one part of the database.
The independent segments of the database are searched
on each processor or node, and results are collated into a single output file.
Several implementations of database segmentation exist, the first of which was
within NCBI's BLAST itself. NCBI-BLAST implements database segmentation
by multi-threading the search such that each processor in an SMP system is
assigned a distinct portion of the database
Database segmentation has the big advantage that it can eliminate the high
overhead of disk I/O. The size of bioinformatics databases are now larger than
core memory on most computers, forcing BLAST searches to page to disk.
Database segmentation permits each node to search a smaller portion of the
database, thus reducing (or even eliminating) extraneous disk I/O, and hence,
vastly improving BLAST performance.With sequence databases roughly doubling in size
each year, the problem of extraneous disk I/O is expected to persist. The adverse
effects of disk I/O are so significant that BLAST searches using database
segmentation can exhibit super-linear speedup versus searches on a single node.
- Query Segmentation
Query segmentation splits up a set of query sequences such that each node in a
cluster or CPU on an SMP system searches a fraction of the query sequences.
By doing so, several BLAST searches can execute in parallel on different queries.
BLAST searches using query segmentation on a cluster typically replicate the
entire database on each node's local storage system. If the database is larger
than core memory, query-segmented searches suffer the same adverse effects of
disk I/O as traditional BLAST.When the database fits in core memory, however,
query segmentation can achieve near linear scalability for all BLAST search
types, even on SMP architectures.
A first constraint is to work with NCBI standard for BLAST.
One cannot afford to become incompatible with the constant flux of
NCBI updates.
In order to achieve maximum flexibility and efficiency, we shall aimed at
an hybrid approach, using both Database and Query Segmentation, depending
on the problem to BLAST, and the available resources.
Load-balancing and fault-tolerance are clearly two other issues to take into account.
Also, the model and infrastructure being developed must be as flexible,
as configurable, and as generic as possible. They must use modern
distributed techniques, such as components.
The OASIS INRIA Sophia Antipolis project has been designing and
implementing ProActive,
a Java library for Parallel, Distributed, and Concurrent computing and
GRID. In the framework of a MIMD model (active object model),
from a reduced set of rather simple primitives, a comprehensive and versatile
library is defined for parallel, distributed, and concurrent programming
(middleware).
In the absence of any syntactical extension, ProActive programmers
write standard code. The library is itself extensible by the programmers,
making the system open for adaptations and optimizations.
The prototype implementation and experimentations
will take advantage of the ProActive
platform, including the component capability,
in order to benefit from asynchronous and group communicationd, high-level synchronizationd (wait-by-necessity with
Future pool), and graphical composition.
Overall, the work will include conceptual models and practical experiments.
The following steps could be undertaken:
- Analysis of existing models, techniques, and tools.
- Analysis of modern tools for Grid computing, active object, component models, infrastructure of ProActive.
- Proposition of a generic hybrid model for distributed BLAST.
- Definition of an architecture to achieve this model within the ProActive platform.
-
Program a prototype implementation. Experiment and benchmark with use-case BLAST tests.
-
A simple performance prediction model, in order, for a given triplet (database(s), query, set of parameters), to authorize:
- rough estimate of time-to-result in a given configuration,
- suggested configuration for a desired time-to-result.
Note:
It might be considered to design an infrastructure that is valid to encapsulate other alignment tools, such as
FASTA, FASTX, TFASTX, for instance.
In that case, any specific concatenation would have to be achieved outside any
alignment tool, but rather probably in a alignment-tool-specific part of the GEB system.
This research should lead to a PhD program (Thèse de Doctorat).
Advisor : Denis Caromel
Téléphone : 04 92 38 76 31 Email :
Denis.Caromel@inria.fr
Laboratoire ou équipe : INRIA Sophia Antipolis -- I3S -- CNRS
In collaboration with :
Richard Christen, Claude Pasquier
Laboratoire de Biologie Virtuelle, UMR6543 CNRS - UNSA
Pascal Barbry
IPMC, Institut de Pharmacologie Moléculaire et Cellulaire, UMR 6097 CNRS/UNSA
Prerequisite : Some knowledge in Object-oriented languages, Distributed Programming, Bioinformatics.
Hardware and software to be used : Networks of PCs, Intranet and Internet P2P machines
Internship place: Sophia Antipolis, between Nice and Cannes, France
Bibliography:
S. Altschul, W. Gish, W. Miller, E. Myers and D.J. Lipman,
Basic local alignment search tool
Journal of Molecular Biology, 215:403-410, 1990.
S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller and D. J. Lipman:
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Res., 25:3389-3402, 1997.
NCBI BLAST
FASTA
R. Bjornson, A. Sherman, S. Weston, N. Willard and J. Wing.
Turboblast: A parallel implementation of blast based on the
turbohub process integration architecture
In IPDPS 2002 Workshops, April 2002.
Note: TurboBLAST is a closed-source commercial parallelization of NCBI BLAST by TurboWorx, Inc.
R. Braun, K. Pedretti, T. Casavant, T. Scheetz, C. Birkett, and C. Roberts:
Parallelization of local BLAST service on workstation clusters
Future Generation Computer Systems
17(6)}:745-754, April 2001.
Note: mpiBLAST is an open-source parallelization of BLAST using Message Passing Interface (MPI)
J. D. Grant, R. L. Dunbrack, F. J. Manion and M. F. Ochs:
BeoBLAST: distributed BLAST and PSI-BLAST on a Beowulf cluster
Bioinformatics, 18(5)}:765-766, 2002.
Note:BeoBLAST is an integrated software package for Linux Beowulf cluster using Perl script and daemon.
M. Dumontier, and Christopher WV Hogue
NBLAST: a cluster variant of BLAST for NxN comparisons
BMC Bioinformatics, 3:13, 2002,
Note: NBLAST is a cluster variant of BLAST for N X N comparisons, using C/C++ and Perl.
Grid-Aware GridBLAST System
Note: a web-based system for NCSA Linux Cluster or NCSA SGI origin 2000.
GridBlast
Note: a grid-enabled, high-throughput ``task farming'' application
1. if you wanna see a GEB, god of the earth, father of Osiris and Isis in mythological Egypt.