Scientific Data Management

Last News
Jan. 1, 2011 Creation of Zenith
 





Contact:
Patrick Valduriez (Firstname.Lastname@inria.fr)



Saravá Results

We worked on three parallel tasks: P2P data management architecture, P2P query processing with uncertain data, and P2P workflow management. Each task produced important results.

P2P data management architecture. We have analyzed the requirements for the Saravá project (data model, query processing, workflow management) using two selected applications. The first one is a collaborative scientific application from bio-informatics in Brazil: the ProtozoaDB project (http://protozoadb.biowebdb.org) at Fiocruz which is involved with the so called "neglected diseases", typical from tropical countries which do not attract international laboratories investments. ProtozoaDB integrates different bioinformatics applications, heterogeneous databases, annotation systems, analysis tools and distributed computing. The increasing data volume in ProtozoaDB associated to its workflow manipulation tools demands high performance data sharing. The second one is a typical social network application such as Facebook, Myspace or Flickr for which a P2P approach would give more control of the participants over their data and programs. These two applications have common requirements such as high level data access with some levels of uncertainty, but also differences. For instance, collaborative bio-informatics research may be quite demanding in terms of quantity of data exchanged within workflows while social networks may involve very high numbers of participants, some with small quantities of shared data. These differences have been reflected in the P2P architecture. Therefore, we proposed several instances of the generic architecture. The first one is P2PFlow [ODO+10], a P2P middleware for scientific workflow components with techniques that allow parallel workflow execution using the participants computing resources. The second one is FlowerCDN [DPA+11], a P2P content distribution network for social networks that enables any under-provisioned participant website to efficiently distribute its content with the help of the community interested in its content. FlowerCDN combines the efficiency of DHTs with the robustness of gossiping. In [APT+10], we also extended the DHT part of our design to support continuous timestamping, as a way to manage reconciliation of replicated data. The third one, which we developed in 2010, is P2Prec [DPV10, DPK11], a recommendation service for P2P content sharing systems that exploits users’ social data. The key idea is to recommend to a user high quality documents in a specific topic using ratings of friends (or friends of friends) who are expert in that topic. To manage users’ social data, we rely on Friend-Of-A-Friend (FOAF) descriptions. P2Prec has an hybrid P2P architecture to work on top of any P2P content sharing system. It combines efficient DHT indexing to manage the users’ FOAF files with gossip robustness to disseminate the topics of expertise between friends.

P2P query processing with uncertain data. For the data model and query language with uncertainty, as anticipated in the initial proposal, we use a probabilistic data model for structured and unstructured data which we described in [AVV11]. To address P2P query processing with data uncertainty, we had to first define the semantics of important queries, i.e. aggregate queries, in our probabilistic data model. In previous works, the queries are defined based on the expected value semantics, i.e. returning the expected value of aggregate attributes in uncertain tuples, which has been shown to be insufficient for many applications. In [AVV11], we proposed new semantics which are very useful for uncertain applications such as social networks. And we propose algorithms that execute aggregate queries in polynomial time, the first polynomial algorithms yet proposed for these queries in uncertain databases. We have also started to address the problem of estimating the data confidence in P2P social networks, where the data shared by users is not 100% certain because it may be incomplete, inaccurate or out-of-date. Thus, we need to estimate the certainty degree (i.e. confidence) of the data. In our work, we rely on the knowledge of all participant users and use their feed-back to estimate the data confidence. We proposed a new data model, called feedback graph that models the relation between the users, their data and feedbacks. Based on this model, we propose a distributed approach for managing the feedback graph, and computing the data confidence based on a recursive formula. This distributed approach avoids any centralized control over confidences by a single participant.

P2P workflow management. We designed workflow management techniques which can exploit parallel execution and reflect them to task 1, in terms of a workflow management service of the P2PFlow middleware [ODO+10]. To reduce execution time, typical scientific workflow management systems exploit workflow parallelization in homogeneous computing environments, such as multiprocessor or cluster systems, with centralized control. Although successful, this solution no longer applies to heterogeneous computing environments, such as hybrid clouds, which may combine users’ own computing resources with multiple edge clouds. Fortunately, the Many Tasks Computing (MTC) paradigm has recently emerged to support parallelization of computational tasks in heterogeneous computing environments. The main challenge of applying MTC to scientific workflows is to scale up to very large numbers of computational resources with various levels of performance, reliability and dynamic behavior. In Saravá, we addressed this challenge using the P2P middleware P2PFlow with techniques that allow transparent parallelization of workflow tasks and dynamic scheduling of MTC. To validate our approach, we developed a P2PFlow simulator called SciMule [DRO+10] and performed various experiments under typical scientific workflow scenarios. The results show that our P2P approach scales up very well to large numbers of computing nodes. In [ODO+11], we propose an algebraic approach that enables automatic optimization of scientific workflows that manipulate huge amounts of data through specific programs and files. We conducted a thorough validation of our approach using both a real oil exploitation application and synthetic data scenarios. The experiments were run in Chiron, a data-centric scientific workflow engine implemented to support our algebraic approach. Our experiments demonstrate performance improvements of up to 226% compared to an ad-hoc workflow implementation.


INRIA main page LIRMM main page UM2 main page