Foundations of Geometric Methods in Data Analysis

Academic year: 2017-2018

Instructors:
  • Frédéric Cazals, Inria ABS
  • Fred Chazal, Inria DataShape

    Data analysis is the process of cleaning, transforming, modelling or comparing data, in order to infer useful information and gain insights into complex phenomena. From a geometric perspective, when an instance (a physical phenomenon, an individual, etc.) is given as a fixed-sized collection of real-valued observations, it is naturally indentified with a geometric point having these observations as coordinates. Any collection of such instances is then seen as a point cloud sampled in some metric or normed space.

    This course reviews fundamental constructions related to the manipulation of such point clouds, mixing ideas from computational geometry and topology, statistics, and machine learning. The emphasis is on methods that not only come with theoretical guarantees, but also work well in practice. In particular, software references and example datasets will be provided to illustrate the constructions.

    Course outline. The course consists of the following eight lectures (3h each):

    1. Nearest neighbors in Euclidean and metric spaces: search: data structures and algorithms

    Summary. For data modeled as a point cloud, reporting the nearest neighbors (NN) of a given point is a fundamental operation consistently reused in data analysis. Efficient NN search methods will be reviewed, both for Euclidean and metric spaces.

    Euclidean spaces:

    Metric spaces:

    Selected important metrics, with applications:

    Bibliography

    2.Nearest neighbors in Euclidean and metric spaces: analysis

    Summary. This lecture will develop selected insights / mathematical analysis related to the data structures and algorithms seen in the first lecture.

    High dimensional spaces and measure concentration phenomena: introduction and implications for NN algorithms:

    Bibliography

    3. Dimensionality reduction algorithms

    Summary. Dimensionality reduction methods aim at embedding high-dimensional data in lower-dimensional spaces, while preserving specific properties (pairwise distance, data spread, etc).

    Bibliography

    4. Covers and nerves: geometric inference and the Mapper algorithm

    Summary. A simple and basic but powerful idea is that the global structure (or shape) of data can be understood by first locally grouping together some observations (through local clustering or some other procedure) and then providing a combinatorial description of the intersection patterns of the obtained groups. This class will introduce the notions of simplicial complexe and nerve of covers that allows to formalize such an idea. It will describe some resulting useful tools, such that the the so-called Mapper algorithm for exploratory data analysis, and well-founded mathematical results to infer relevant topological information about the global structure of data.

    Bibliography.

    5. Clustering algorithms and introduction to persistent homology

    Summary. Clustering is a fundamental problem in data analysis and machine learning that has led to a wide variety of techniques and algorithms. This class will focus on a few clustering methods with a geometric flavor. This will also give us the opportunity to first introduce the notion of persistent homology as a useful tool to improve the performances of some clustering algorithms.

    Bibliography.

    6. Statistical hypothesis testing and two-sample tests (TST)

    Summary. Testing statistical hypothesis is routine in data analysis. Of particular interest for (multi-)variate data are two-sample tests (TST), which aim at checking whether two populations modeled as point clouds in some Euclidean space come from the same distribution (the null hypothesis) or not. This lecture presents basics on hypothesis testing and TST.

    Bibliography

    7. Comparing high-dimensional distributions, with applications to feature selection

    Summary. Upon performing a two-sample test (lesson 7), if the null is rejected, one if left with a binary piece of information. This class will develop techniques to expand this information, in two directions. First, we shall address the problem of identifying regions in parameter space accounting for the differences revealed by the TST. Second, we shall address the question of selecting the features which account for the differences observed.

    Bibliography

    8. Shape signatures: stability and statistical aspects

    Summary. This class will extend the notion of persistent homology introduced in the class on clustering. It will be shown how the theory can be used to build relevant topological information from data that can be converted into features for further data analysis and machine learning tasks.

    Bibliography.