Scientific Data Management

Last News
Jan. 1, 2011 Creation of Zenith
 





Contact:
Patrick Valduriez (Firstname.Lastname@inria.fr)



Zenith seminar, 30 march 2011, 10h30 Salle des séminaires, LIRMM

Data Management for the Masses Sihem Amer-Yahia Yahoo Research New York, USA

Abstract: The fast increasing content on collaborative tagging sites, such as Delicious, and collaborative rating sites, such as MovieLens, requires the development of scalable and efficient search and recommendation techniques. I will describe two concrete applications to illustrate the challenges behind data management for personalized content discovery on those sites. In the first application, network-aware search, I will argue that obvious adaptations of well-known top-k algorithms require to maintain per (seeker, keyword) indexes, due to the dependence of scores on the seeker’s interest network. I will therefore investigate two space-saving solutions for network-aware search, network clustering and behavior clustering. In the second application, group recommendation, the quality of a recommendation is a function of disagreement among group members. This calls for maintaining pair-wise user disagreement indexes. Therefore, I will explore two space-saving strategies for group recommendation, behavior factoring and partial materialization. I will show that scalable and efficient search and recommendation techniques rely on exploring the balance between storage volume and response time in both applications.

Acknowlegdements: Network-aware search is joint work with Michael Benedikt (Oxford), Laks Lakshmanan (UBC) and Julia Stoyanovich (UPenn). Group recommendation is joint work with Gautam Das (UT Arlington), Senjuti Basu Roy (UT Arlington) and Cong Yu (Google Inc)

Bio: Sihem has been a Senior Research Scientist at Yahoo! Labs since May 2006 after 7 years at AT&T Labs. She received her Ph.D. in CS from U. Paris-Orsay and INRIA, France. Sihem focuses on data management, query processing and relevance models that leverage social behavior for online content serving. Her professional activities include chairing the SIGMOD 2009 Tutorials, the Social Networks and Personal Information track at ICDE 2010, the Structured and Unstructured Data Track at WWW 2010, the SIGMOD 2011 Information Retrieval and Extraction track, and the EDBT/ICDT 2011 Industrial track. She is a member of the Board of Trustees of the VLDB Endowment and of the ACM SIGMOD Executive Committee. Sihem serves as the TODS, VLDB Journal, and Information Systems Journal Area Editor. She is currently visiting the Yahoo! Barcelona Lab.


Zenith seminar, 16 September 2011, 10h30 Salle des séminaires, LIRMM

Distributed Web Search Vincent Leroy Yahoo Research Barcelona, Spain

Abstract: An appealing solution to scale Web search with the growth of the Internet is the use of distributed architectures. Distributed search engines rely on multiple sites deployed in distant regions across the world, where each site is specialized to serve queries issued by the users of its region. Distributed search raises several challenges. In order to preserve the quality of the results, all documents should be taken into account during the evaluation of a query. However, for scalability reasons, each search site can only index a subset of the documents. When a user query requests a document which is not indexed locally, the search site has to contact the other sites in order to compute an exact result. As sites are distributed in different regions, this generates additional latency and reduces the satisfaction of the user. In this presentation, I will describe the general architecture of a distributed search engine. Then, I will focus on the problem of assigning new documents to a search site.


Colloquium Cloud Computing, 18 october 2011, 10h30-12h Amphi Saint Priest Organized by Zenith

Data Management in the Cloud Amr El Abbadi University of California, Santa Barbara

Abstract: Over the past two decades, database and systems researchers have made significant advances in the development of algorithms and techniques to provide data management solutions that carefully balance the three major requirements when dealing with critical data: high availability, reliability, and data consistency. However, over the past few years the data requirements, in terms of data availability and system scalability, from Internet scale enterprises that provide services and cater to millions of users has been unprecedented. Cloud computing has emerged as an extremely successful paradigm for deploying Internet and Web-based applications. Scalability, elasticity, pay-per-use pricing, and autonomic control of large-scale operations are the major reasons for the successful widespread adoption of cloud infrastructures. Current proposed solutions to scalable data management, driven primarily by prevalent application requirements, significantly downplay the data consistency requirements and instead focus on high scalability and resource elasticity to support data-rich applications for millions to tens of millions of users. However, the growing popularity of "cloud computing", the resulting shift of a large number of Internet applications to the cloud, and the quest towards providing data management services in the cloud, has opened up the challenge for designing data management systems that provide consistency guarantees at a granularity which goes beyond single rows and keys. In this talk, we analyze the design choices that allowed modern scalable data management systems to achieve orders of magnitude higher levels of scalability compared to traditional databases. With this understanding, we highlight some design principles for data management systems that can be used to augment existing databases with new cloud features such as scalability, elasticity, and autonomy. We then present two systems that leverage these principles. The first system, G-Store, provides transactional guarantees on data granules formed on-demand while being efficient and scalable. The second system, ElasTraS, provides elastically scalable transaction processing using logically contained database partitions. Finally, we will present two techniques for on-demand live database migration, a primitive operation critical to provide lightweight elasticity as a first class notion in the next generation of database systems. The first technique, Albatross, supports live migration in a multitenant database serving OLTP style workloads where the persistent database image is stored in network attached storage. The second technique, Zephyr, efficiently migrates live databases in a shared nothing transactional database architecture.

Bio: Amr El Abbadi is currently Professor and Chair of the Computer Science Department at the University of California, Santa Barbara. He received his B. Eng. in Computer Science from Alexandria University, Egypt, and received his Ph.D. in Computer Science from Cornell University in August 1987. Prof. El Abbadi is an ACM Fellow. He has served as a journal editor for several database journals, including, currently, The VLDB Journal. He has been Program Chair for multiple database and distributed systems conferences, most recently SIGSPATIAL GIS 2010 and ACM Symposium on Cloud Computing (SoCC) 2011. He has also served as a board member of the VLDB Endowment from 2002—2008. In 2007, Prof. El Abbadi received the UCSB Senate Outstanding Mentorship Award for his excellence in mentoring graduate students. He has published over 250 articles in databases and distributed systems.


Zenith Ph.D. thesis défense, 9 mars 2012, 10h30 Salle des séminaires, LIRMM

P2P Recommendation for Large-scale Online Communities By Fady Draidi

Ph.D. commitee : Mohand-Said Hacid Professeur, Université Lyon I Président Sihem Amer-Yahia DR CNRS, LIG, Grenoble Rapporteur Abdelkader Hameurlain Professeur, Université Toulouse III Rapporteur Marie-Laure Mugnier Professeur, Université Montpellier II Examinateur Esther Pacitti Professeur, Université Montpellier II Directeur de thèse Patrick Valduriez DR INRIA, LIRMM, Montpellier Co-directeur de thèse

Recommendation systems (RS) and P2P are both complementary in easing large-scale data sharing: RS to filter and personalize users’ demands, and P2P to build decentralized large-scale data sharing systems. However, many challenges need to be overcome when building scalable, reliable and efficient RS atop P2P. In this work, we focus on large-scale communities, where users rate the contents they explore, and store in their local workspace high quality content related to their topics of interest. Our goal then is to provide a novel and efficient P2P-RS for this context. We exploit users’ topics of interest (automatically extracted from users’ contents and ratings) and social data (friendship and trust) as parameters to construct and maintain a social P2P overlay, and generate recommendations. The thesis addresses several related issues. First, we focus on the design of a scalable P2P-RS, called P2Prec, by leveraging collaborative- and content-based filtering recommendation approaches. We then propose the construction and maintenance of a P2P dynamic overlay using different gossip protocols. Our performance experimentation results show that P2Prec has the ability to get good recall with acceptable query processing load and network traffic. Second, we consider a more complex infrastructure in order to build and maintain a social P2P overlay, called F2Frec, which exploits social relationships between users. In this new infrastructure, we leverage content- and social-based filtering, in order to get a scalable P2P-RS that yields high quality and reliable recommendation results. Based on our extensive performance evaluation, we show that F2Frec increases recall, and the trust and confidence of the results with acceptable overhead. Finally, we describe our prototype of P2P-RS, which we developed to validate our proposal based on P2Prec and F2Frec.

Key-words: P2P system, recommendation system (RS), online communities, social networks, information retrieval, large-scale data management.


Zenith meeting, 4 avril 2012, 10h30 La Galera, LIRMM

Principles of Distributed Data Management in 2020 ? Patrick Valduriez (based on DEXA’2011 keynote talk) Abstract. With the advents of high-speed networks, fast commodity hardware, and the web, distributed data sources have become ubiquitous. The third edition of the Özsu-Valduriez textbook Principles of Distributed Database Systems reflects the evolution of distributed data management and distributed database systems. In this new edition, the fundamental principles of distributed data management could be still presented based on the three dimensions of earlier editions: distribution, heterogeneity and autonomy of the data sources. In retrospect, the focus on fundamental principles and generic techniques has been useful not only to understand and teach the material, but also to enable an infinite number of variations. The primary application of these generic techniques has been obviously for distributed and parallel DBMS versions. Today, to support the requirements of important data-intensive applications (e.g. social networks, web data analytics, scientific applications, etc.), new distributed data management techniques and systems (e.g. MapReduce, Hadoop, SciDB, Peanut, Pig latin, etc.) are emerging and receiving much attention from the research community. Although they do well in terms of consistency/flexibility/performance trade-offs for specific applications, they seem to be ad-hoc and might hurt data interoperability. The key questions I discuss are: What are the fundamental principles behind the emerging solutions? Is there any generic architectural model, to explain those principles? Do we need new foundations to look at data distribution?


INRIA main page LIRMM main page UM2 main page