Foundations of Geometric Methods in Data Analysis


Academic year: 2016-2017

  • Frédéric Cazals, Inria ABS
  • Steve Oudot, Inria Geometrica

    Data analysis is the process of cleaning, transforming, modelling or comparing data, in order to infer useful information and gain insights into complex phenomena. From a geometric perspective, when an instance (a physical phenomenon, an individual, etc.) is given as a fixed-sized collection of real-valued observations, it is naturally indentified with a geometric point having these observations as coordinates. Any collection of such instances is then seen as a point cloud sampled in some metric or normed space.

    This course reviews fundamental constructions related to the manipulation of such point clouds, mixing ideas from computational geometry and topology, statistics, and machine learning. The emphasis is on methods that not only come with theoretical guarantees, but also work well in practice. In particular, software references and example datasets will be provided to illustrate the constructions.

    Course outline. The course consists of the following eight lectures (3h each):

    1. Nearest neighbors in Euclidean and metric spaces: search: data structures and algorithms

    Summary. For data modeled as a point cloud, reporting the nearest neighbors (NN) of a given point is a fundamental operation consistently reused in data analysis. Efficient NN search methods will be reviewed, both for Euclidean and metric spaces.

    Euclidean spaces:

    Metric spaces:

    Selected important metrics, with applications:


    2.Nearest neighbors in Euclidean and metric spaces: analysis

    Summary. This lecture will develop selected insights / mathematical analysis related to the data structures and algorithms seen in the first lecture.

    High dimensional spaces and measure concentration phenomena: introduction and implications for NN algorithms:


    3. Dimensionality reduction algorithms: basic methods

    Summary. Dimensionality reduction methods aim at embedding high-dimensional data in lower-dimensional spaces, while preserving specific properties (pairwise distance, data spread, etc).


    4. Dimensionality reduction algorithms: advanced methods

    Summary. Following the previous class, this class will dive into more advanced dimensionality reduction techniques, in particular spectral techniques. Along the way, connexions with other important problems will be studied, in particular graph partitioning problems.


    5. Clustering algorithms

    Summary. An ubiquitous problem in the Sciences. Here we focus on methods that have a highly geometric flavor. It is also an opportunity to introduce the basics of Morse and size theories.


    6. Statistical hypothesis testing and two-sample tests (TST)

    Summary. Testing statistical hypothesis is routine in data analysis. Of particular interest for (multi-)variate data are two-sample tests (TST), which aim at checking whether two populations modeled as point clouds in some Euclidean space come from the same distribution (the null hypothesis) or not. This lecture presents basics on hypothesis testing and TST.


    7. Comparing high-dimensional distributions, with applications to feature selection

    Summary. Upon performing a two-sample test (lesson 7), if the null is rejected, one if left with a binary piece of information. This class will develop techniques to expand this information, in two directions. First, we shall address the problem of identifying regions in parameter space accounting for the differences revealed by the TST. Second, we shall address the question of selecting the features which account for the differences observed.


    8. Reeb graphs and mapper

    Summary. A powerful data analysis paradigm currently emerging is that of data shape analysis, as critical information on discrete data (e.g. a point cloud) may be obtained from geometric and topological descriptors of the data. This class will describe the extraction of such descriptors, using the Mapper, a construction which may be seen as a pixelized version of the Reeb graph, a classical object encountered in computational topology.