## Description

Data analysis is the process of cleaning, transforming, modelling or comparing data, in order to infer useful information and gain insights into complex phenomena. From a geometric perspective, when an instance (a physical phenomenon, an individual, etc.) is given as a fixed-sized collection of real-valued observations, it is naturally identified with a geometric point having these observations as coordinates. Any collection of such instances is then seen as a point cloud sampled in some metric or normed space.

This course reviews fundamental constructions related to the manipulation of such point clouds, mixing ideas from computational geometry and topology, statistics, and machine learning. The emphasis is on methods that not only come with theoretical guarantees, but also work well in practice. In particular, software references and example datasets will be provided to illustrate the constructions.

For any questions or concerns please contact Mathieu Carrière at mathieu.carriere[at]inria.fr, or Frédéric Cazals at frederic.cazals[at]inria.fr

Close

## Outline

**The course consists of the following eight lectures (3h each):**

1. Nearest neighbors in Euclidean and metric spaces: data structures and algorithms

2. Nearest neighbors in Euclidean and metric spaces: analysis

3. Computational Topology (I): Simplicial Complexes

4. Computational Topology (II): Persistent Homology

5. Comparing samplings, distributions, clusterings

6. Dimensionality reduction algorithms

7. Topological Machine Learning (I)

8. Topological Machine Learning (II)

NB: course material (slides, notes) will be provided after the lectures.
Close

## Class 1

**Nearest neighbors in Euclidean and metric spaces: data structures and algorithms**

**Summary**

For data modeled as a point cloud, reporting the nearest neighbors (NN) of a given point is a fundamental operation consistently reused in data analysis. Efficient NN search methods will be reviewed, both for Euclidean and metric spaces.

- Euclidean spaces:
- Exact nearest neighbors via Voronoi diagrams
- Randomized k-d trees and random projection trees

- Metric spaces:
- Metric trees
- Linear approximating and eliminating search algorithm

- Selected important metrics, with applications:
- Hausdorff and Frechet distances
- Earth mover distance and optimal transport
- Earth mover distance with connectivity constraints
- The Mallow distance and its application to compare clusterings

**Slides**

**Bibliography**

- (book) Har-Peled, S. Geometric Approximation Algorithms. Mathematical Surveys and Monographs, AMS, 2011 (chapters 17 to 20).
- (book) Shakhnarovich, G., Darrell, T., and Indyk, P. Nearest-Neighbors Methods in Learning and Vision. Theory and Practice, MIT press, 2005.
- (book) Hanan, S. Foundations of multidimensional and metric data structures. Morgan Kaufmann, 2006.
- Arya, S., Mount, D.M., Netanyahu, N.S., Silverman, R. and Wu, A.Y. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM, 45, 1998.
- Marius, M., and Lowe, D. Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration. VISAPP (1), 2009.
- Indyk, P., and Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604-613, 1998.
- Umeyama, S. Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on pattern analysis and machine intelligence, 13(4), 376-380, 1991.
- Yossi, R., Tomasi, C., and Guibas, L. The earth mover's distance as a metric for image retrieval. International Journal of Computer Vision 40.2: 99-121, 2000
- Elizaveta, L., and Bickel, P. The earth mover's distance is the Mallows distance: some insights from statistics. Proceedings Eighth IEEE International Conference on Computer Vision, IEEE, 2001.
- Ding, Z., Li, J., and Zha, H. A new mallows distance based metric for comparing clusterings. Proceedings of the 22nd international conference on Machine learning, ACM, 2005.

Close

## Class 2

**Nearest neighbors in Euclidean and metric spaces: analysis**

**Summary**

This lecture will develop selected insights / mathematical analysis related to the data structures and algorithms seen in the first lecture.

High dimensional spaces and measure concentration phenomena: introduction and implications for NN algorithms:

- The notion of intrinsic dimension
- Experiments on nearest neighbor searches, regression, dimension estimation
- Random projection trees (RPT): search performance analysis
- RPT: adaptation to the intrinsic dimension
- Concentration phenomena: application to nearest neighbor searches
- Concentration phenomena: key properties

**Slides**

**Bibliography**

- Verma, N., Kpotufe, S., and Dasgupta, S. Which spatial partitions are adaptive to intrinsic dimension? UAI, 2009.
- Dasgupta, S., and Sinha, K. Randomized partition trees for exact nearest neighbor search. JMLR: Workshop and Conference Proceedings, 30:1--21, 2013.
- Vempala, S. Randomly-oriented kd Trees Adapt to Intrinsic Dimension. FSTTCS, 2012.
- Francois, D., Wertz, V., and Verleysen, M. The concentration of fractional distances. Knowledge and Data Engineering, IEEE Transactions on, 19(7), 873-886, 2007.
- Charu, A., Hinneburg, A., and Keim, D. On the surprising behavior of distance metrics in high dimensional space. Springer Berlin Heidelberg, 2001.
- Beyer, K., et al. When is nearest neighbor meaningful? Database Theory, Springer Berlin Heidelberg, 217-235, 1999.

Close

## Class 3

**Computational Topology (I): Simplicial Complexes**

**Summary**

The purpose of this class is to introduce the most basic concepts of
computational topology. We will present the definitions and
computations of various types of deformations of topological spaces, as well as data structures to work with a specific
family of such spaces called the family of simplicial complexes.

**Keywords**

Homotopy equivalences, simplicial complexes, triangulations, Cech and Vietoris-Rips complexes.

**Slides**

**Gudhi tutorial on simplex trees**

**Bibliography**

Munkres, J. Elements of Algebraic Topology, CRC Press, 1984.
Hatcher, A. Algebraic Topology, Cambridge University Press, 2002.
Edelsbrunner, H., Harer, J. Computational Topology: an introduction, AMS, 2010.
Close

## Class 4

**Computational Topology (II): Persistent Homology**

**Summary**

The purpose of this class is to present one of the main tools of topological
data analysis, the (persistent) homology groups. We will present their
definitions and computation algorithms, as well as their theoretical
properties and relationship with clustering through the ToMATo algorithm.

**Keywords**

Persistent homology, boundary reduction algorithm, stability theorem, ToMATo algorithm.

**Slides**

**Gudhi tutorials on persistence and ToMATo:** tutorial 1, tutorial 2

**Bibliography**

Cohen-Steiner, D., Edelsbrunner, H. and Harer, J. Stability of persistence diagrams. Discrete and Computational Geometry, Volume 37, pp. 103-120, 2007.
Chazal, F., Cohen-Steiner, D., Guibas, L., Mémoli, F. and Oudot, S. Gromov-Hausdorff Stable Signatures for Shapes using Persistence. Computer Graphics Forum (proc. SGP), pp. 1393-1403, 2009.
Chazal, F., Guibas, L., Oudot, S. and Skraba, P. Persistence-Based Clustering in Riemannian Manifolds. Journal of the ACM, Volume 60, Issue 6, pp. 1-38, 2013.
Close

## Class 5

**Comparing samplings, distributions, clusterings**

**Summary**

Scenarii where one needs to compare two point clouds representing two populations in a high dimensional space, or two clusterings of such point clouds, are commonplace. This lecture will review techniques to do so, in three steps. First, we will recall basics of statistical hypothesis testing and two-sample tests (TST). Second, upon rejecting the null hypothesis, we will study techniques to single out those regions in parametric space where the distributions differ. Finally, in a complementary realm, we will study the problem of comparing two clusterings of two point clouds.

- Statistical hypothesis testing and p-values
- Two-sample tests: from univariate to multivariate (MMD)
- Information theoretic measures to compare distributions
- Localizing data discrepancies upon performing two-sample test
- Comparing clusterings

**Slides (main)**

**Slides (additional)**

**Bibliography**

- (book) G. Casella and R. Berger, Statistical inference, Duxbury Press, 2001.
- (book) Mittelhammer, Ron C., and Ron C. Mittelhammer. Mathematical statistics for economics and business. Vol. 78. Berlin: Springer, 1996.
- (book) Lehmann, Erich L., and Joseph P. Romano. Testing statistical hypotheses. springer, 2006.
- Gretton, Arthur, et al. A kernel two-sample test. The Journal of Machine Learning Research 13 (2012): 723-773.
- Samory Kpotufe. Escaping the curse of dimensionality with a tree-based regressor.ra Conference on Learning Theory (COLT) 2009.
- Murdoch, Duncan J., Yu-Ling Tsai, and James Adcock. P-values are random variables. The American Statistician 62.3 (2008).
- Sackrowitz, Harold, and Ester Samuel-Cahn. P values as random variables-expected P values. The American Statistician 53.4 (1999): 326-331.
- B. Phipson and G. Smyth, Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn, Statistical Applications in Genetics and Molecular Biology 9(10), 2010.
- J., Wittawat, W. Xu, Z. Szabo, K. Fukumizu, and A. Gretton. A Linear-Time Kernel Goodness-of-Fit Test. NIPS 2017.
- (book) Cover and Thomas, Elements of Information Theory, Wiley, 2006.
- F. Cazals and A. Lheritier, Beyond Two-sample-tests: Localizing Data Discrepancies in High-dimensional Spaces, IEEE/ACM International Conference on Data Science and Advanced Analytics, 2015.
- F. Cazals and D. Mazauric and R. Tetley and R. Watrigant, Comparing two clusterings using matchings between clusters of clusters.
- Mueller and Jaakkola, Principal Differences Analysis: Interpretable Characterization of Differences between Distributions, NeurIPS, 2015.

Close

## Class 6

**Dimensionality reduction algorithms**

**Summary**

Dimensionality reduction methods aim at embedding high-dimensional data in lower-dimensional spaces, while preserving specific properties (pairwise distance, data spread, etc).

- Dimensionality reduction algorithms: criteria and classification
- The Johnson-Lindenstrauss lemma
- Multi-dimensional scaling and principal components analysis
- Isomap, LLE
- t-SNE
- Diffusion maps

**Slides**

**Bibliography**

- (book) Lee, John A., and Michel Verleysen. Nonlinear dimensionality reduction. Springer, 2007.
- (book) Sariel Har-Peled. Geometric approximation algorithms. No. 173. American Mathematical Soc., 2011.
- (book) Cox, Trevor F., and Michael AA Cox. Multidimensional scaling. CRC Press, 2000.
- Tenenbaum, Joshua B., Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science 290.5500 (2000): 2319-2323.
- Roweis, Sam T., and Lawrence K. Saul. "Nonlinear dimensionality reduction by locally linear embedding." Science 290.5500 (2000): 2323-2326.
- Coifman, Ronald R., et al. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. PNAS (102), 2005.
- Lafon, Stephane, and Ann B. Lee. Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28.9 (2006): 1393-1403.
- Dasgupta, Sanjoy, and Anupam Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms 22.1 (2003): 60-65.
- Frédéric Cazals, Frédéric Chazal, and Joachim Giesen. Spectral Techniques to Explore Point Clouds in Euclidean Space, with Applications to Collective Coordinates in Structural Biology. Nonlinear Computational Geometry. Springer New York, 2010. 1-34.
- Diaconis, Persi, Sharad Goel, and Susan Holmes. Horseshoes in multidimensional scaling and local kernel methods. The Annals of Applied Statistics (2008): 777-807.
- (book) Lee, John A., and Michel Verleysen. Nonlinear dimensionality reduction. Springer, 2007.
- R.R. Coifman and S. Lafon and A.B. Lee and M. Maggioni and B. Nadler and F. Warner and S.W. Zucker, Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS 102, 205.
- S. Lafon and Y. Keller and R.R. Coifman, Data fusion and multi-cue data matching by diffusion maps, IEEE Trans. on Pattern Analysis and Machine Intelligence 28, 2006.
- S. Lafon and A.B. Lee, Diffusion Maps and Coarse-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning and Data Set Parameterization, IEEE Trans. on Pattern Analysis and Machine Intelligence 28, 2006.
- Kannan, Ravi and Vempala, Santosh and Vetta, Adrian, On clusterings: Good, bad and spectral, J. of the ACM 51, 2004.

Close

## Class 7

**Topological Machine Learning (I)**

**Summary**

The purpose of this class is to explain how the persistence diagram
can be incorporated in standard machine learning and deep learning
methods. We will present some of the most common vectorizations and
kernels that have been defined in the literature in recent years, as
well as topological layers for neural nets.

**Keywords**

Vectorizations & kernels for PDs, neural net architectures
for PDs, optimization and gradient descent for PDs.

**Slides**

**Tutorial**

**Bibliography**

- Chazal, F., Cohen-Steiner, D., Guibas, L., Mémoli, F. and Oudot, S. Gromov-Hausdorff Stable Signatures for Shapes using Persistence. Computer Graphics Forum (proc. SGP), pp. 1393-1403, 2009.
- Li, C., Ovsjanikov, M. and Chazal, F. Persistence-based Structural Recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
- Carrière, M., Cuturi, M. and Oudot, S. Sliced Wasserstein kernel for persistence diagrams. In Proc. International Conference on Machine Learning (ICML), 2017.
- Carrière, M., Chazal, F., Ike, Y., Lacombe, T., Royer, M. and Umeda, Y. PersLay: a neural network layer for persistence diagrams and new graph topological signatures. In Proc. International Conference on Artificial Intelligence and Statistics (AISTATS), 2020.

Close

## Class 8

**Topological Machine Learning (II)**

**Summary**

Visualizing and exploring data sets can be a difficult task, especially when dealing with very high-dimensional data. A simple yet convenient approach for data visualization is to process data sets with topological methods, which use intrinsic information (such as pairwise distances) and thus depend only on the low-dimensional subspace/manifold from which data is sampled. A very usual technique in this field is to group together similar observations into patches and provide a description of their intersection patterns. This class will formalize such ideas and describe the corresponding tools, called Reeb spaces and Mappers, as well as well-founded corresponding statistical results.

**Keywords**

Nerves and covers, Mappers and Reeb spaces, Nerve theorems and geometric inference.

**Slides on differentiation**,
**Optimization tutorial** (to run it, you have to install the optimization branch of the Gudhi library!)

**Slides on Mapper**, **Mapper tutorial** (to run it, you have to install the Mapper branch of the Gudhi library!)

**Bibliography**

- Singh, G., Mémoli, F. and Carlsson, G. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. SPBG (pp. 91-100), 2007.
- Carrière, M. and Oudot, S. Structure and stability of the one-dimensional Mapper. Foundations of Computational Mathematics 18 (6), 1333-1396, 2017.
- Carrière, M., Michel, B. and Oudot, S. Statistical analysis and parameter selection for Mapper. Journal of Machine Learning Research 19 (12), 1-39, 2018.
- Chazal, F., Huang, R., and Sun, J. Gromov-Hausdorff Approximation of Filament Structure Using Reeb-type Graph. Discrete & Computational Geometry 53 (3), 621-649, 2015.
- Chazal, F., Cohen-Steiner, D., Lieutier, A. A Sampling Theory for Compact Sets in Euclidean spaces. Discrete & Computational Geometry 41 (3), 461-479.

Close

## Location

**Classes are happening online this year due to the COVID-19 crisis.**

Close

## Validation

This course is validated with data science projects. Instructions can be found there.

Close