Modern science such as agronomy, bio-informatics, and environmental science must deal with overwhelming amounts of experimental data. Such data must be processed (cleaned, transformed, analyzed) in all kinds of ways in order to draw new conclusions, prove scientific theories and produce knowledge. However, constant progress in scientific observational instruments and simulation tools creates a huge data overload. For example, climate modeling data are growing so fast that they will lead to collections of hundreds of exabytes expected by 2020. Scientific data is also very complex, in particular because of heterogeneous methods used for producing data, the uncertainty of captured data, the inherently multi-scale nature of many sciences and the growing use of imaging, resulting in data with hundreds of attributes, dimensions or descriptors. Processing and analyzing such massive sets of complex data is therefore a major challenge since solutions must combine new data management techniques with large-scale parallelism in cluster, grid or cloud environments.
The three main challenges of scientific data management can be summarized by:
- scale (big data, big applications);
- complexity (uncertain, multi-scale data with lots of dimensions),
- heterogeneity (in particular, data semantics heterogeneity).
The overall goal of Zenith is to address these challenges, by proposing innovative solutions with significant advantages in terms of scalability, functionality, ease of use, and performance. We plan to design and validate our solutions by working closely with scientific application partners. To further validate our solutions and extend the scope of our results, we also want to foster industrial collaborations, even in non scientific applications, provided that they exhibit similar challenges.