# Foundations of Geometric Methods in Data Analysis

## Academic year: 2017-2018

Instructors:
• Frédéric Cazals, Inria ABS
• Fred Chazal, Inria DataShape

Course description - summary. Data analysis is the process of cleaning, transforming, modelling or comparing data, in order to infer useful information and gain insights into complex phenomena. From a geometric perspective, when an instance (a physical phenomenon, an individual, etc.) is given as a fixed-sized collection of real-valued observations, it is naturally identified with a geometric point having these observations as coordinates. Any collection of such instances is then seen as a point cloud sampled in some metric or normed space.

This course reviews fundamental constructions related to the manipulation of such point clouds, mixing ideas from computational geometry and topology, statistics, and machine learning. The emphasis is on methods that not only come with theoretical guarantees, but also work well in practice. In particular, software references and example datasets will be provided to illustrate the constructions.

Course outline. The course consists of the following eight lectures (3h each):

## 1. Nearest neighbors in Euclidean and metric spaces: search: data structures and algorithms

Summary. For data modeled as a point cloud, reporting the nearest neighbors (NN) of a given point is a fundamental operation consistently reused in data analysis. Efficient NN search methods will be reviewed, both for Euclidean and metric spaces.

Euclidean spaces:

• Exact nearest neighbors via Voronoi diagrams
• Randomized k-d trees and random projection trees

Metric spaces:

• Metric trees
• Linear approximating and eliminating search algorithm

Selected important metrics, with applications:

• Hausdorff and Frechet distances
• Earth mover distance and optimal transportation
• Earth mover distance with connectivity constraints
• The Mallow distance and its application to compare clusterings

Bibliography

• (book) S. Har-Peled; Geometric Approximation Algorithms; Mathematical Surveys and Monographs, AMS, 2011 --- chapters 17 to 20
• (book) G. Shakhnarovich and T. Darrell and P. Indyk (Eds), Nearest-Neighbors Methods in Learning and Vision. Theory and Practice, MIT press, 2005.
• (book) Samet, Hanan. Foundations of multidimensional and metric data structures. Morgan Kaufmann, 2006.
• Arya, S. and Mount, D.M. and Netanyahu, N.S. and Silverman, R. and Wu, A.Y.; An optimal algorithm for approximate nearest neighbor searching fixed dimensions; Journal of the ACM, 45, 1998.
• Muja, Marius, and David G. Lowe. Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration. VISAPP (1). 2009.
• P. Indyk, R. Motwani; Approximate nearest neighbors: towards removing the curse of dimensionality; Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604-613, 1998
• Umeyama, S. (1991). Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on pattern analysis and machine intelligence, 13(4), 376-380.
• Rubner, Yossi, Carlo Tomasi, and Leonidas J. Guibas. The earth mover's distance as a metric for image retrieval. International Journal of Computer Vision 40.2 (2000): 99-121.
• Levina, Elizaveta, and Peter Bickel. The earth mover's distance is the Mallows distance: some insights from statistics. Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on. Vol. 2. IEEE, 2001.
• Zhou, Ding, Jia Li, and Hongyuan Zha. A new mallows distance based metric for comparing clusterings. Proceedings of the 22nd international conference on Machine learning. ACM, 2005.

## 2.Nearest neighbors in Euclidean and metric spaces: analysis

Summary. This lecture will develop selected insights / mathematical analysis related to the data structures and algorithms seen in the first lecture.

High dimensional spaces and measure concentration phenomena: introduction and implications for NN algorithms:

• The notion of intrinsic dimension
• Experiments on nearest neighbor searches, regression, dimension estimation
• Random projection trees (RPT): search performance analysis
• RPT: adaptation to the intrinsic dimension
• Concentration phenomena: application to nearest neighbor searches
• Concentration phenomena: key properties

Bibliography

• N. Verma, S. Kpotufe, S. Dasgupta, Which spatial partitions are adaptive to intrinsic dimension? UAI 2009.
• S. Dasgupta and K. Sinha. Randomized partition trees for exact nearest neighbor search. JMLR: Workshop and Conference Proceedings, 30:1--21, 2013.
• S. Vempala, Randomly-oriented kd Trees Adapt to Intrinsic Dimension. FSTTCS, 2012.
• Francois, D., Wertz, V., and Verleysen, M. (2007). The concentration of fractional distances. Knowledge and Data Engineering, IEEE Transactions on, 19(7), 873-886.
• Aggarwal, Charu C., Alexander Hinneburg, and Daniel A. Keim. On the surprising behavior of distance metrics in high dimensional space. Springer Berlin Heidelberg, 2001.
• Beyer, Kevin, et al. When is nearest neighbor meaningful?. Database Theory—ICDT’99. Springer Berlin Heidelberg, 1999. 217-235.

## 3. Dimensionality reduction algorithms

Summary. Dimensionality reduction methods aim at embedding high-dimensional data in lower-dimensional spaces, while preserving specific properties (pairwise distance, data spread, etc).

• Dimensionality reduction algorithms: criteria and classification
• The Johnson-Lindenstrauss lemma
• Multi-dimensional scaling and principal components analysis
• Isomap
• Diffusion maps

Bibliography

• (book) Lee, John A., and Michel Verleysen. Nonlinear dimensionality reduction. Springer, 2007.
• (book) Sariel Har-Peled. Geometric approximation algorithms. No. 173. American Mathematical Soc., 2011.
• (book) Cox, Trevor F., and Michael AA Cox. Multidimensional scaling. CRC Press, 2000.
• Tenenbaum, Joshua B., Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science 290.5500 (2000): 2319-2323.
• Roweis, Sam T., and Lawrence K. Saul. "Nonlinear dimensionality reduction by locally linear embedding." Science 290.5500 (2000): 2323-2326.
• Coifman, Ronald R., et al. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. PNAS (102), 2005.
• Lafon, Stephane, and Ann B. Lee. Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28.9 (2006): 1393-1403.
• Dasgupta, Sanjoy, and Anupam Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms 22.1 (2003): 60-65.
• Frédéric Cazals, Frédéric Chazal, and Joachim Giesen. Spectral Techniques to Explore Point Clouds in Euclidean Space, with Applications to Collective Coordinates in Structural Biology. Nonlinear Computational Geometry. Springer New York, 2010. 1-34.
• Diaconis, Persi, Sharad Goel, and Susan Holmes. Horseshoes in multidimensional scaling and local kernel methods. The Annals of Applied Statistics (2008): 777-807.
• (book) Lee, John A., and Michel Verleysen. Nonlinear dimensionality reduction. Springer, 2007.
• R.R. Coifman and S. Lafon and A.B. Lee and M. Maggioni and B. Nadler and F. Warner and S.W. Zucker, Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS 102, 205.
• S. Lafon and Y. Keller and R.R. Coifman, Data fusion and multi-cue data matching by diffusion maps, IEEE Trans. on Pattern Analysis and Machine Intelligence 28, 2006.
• S. Lafon and A.B. Lee, Diffusion Maps and Coarse-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning and Data Set Parameterization, IEEE Trans. on Pattern Analysis and Machine Intelligence 28, 2006.
• Kannan, Ravi and Vempala, Santosh and Vetta, Adrian, On clusterings: Good, bad and spectral, J. of the ACM 51, 2004.

## 4. Covers and nerves: geometric inference and the Mapper algorithm

Summary. A simple and basic but powerful idea is that the global structure (or shape) of data can be understood by first locally grouping together some observations (through local clustering or some other procedure) and then providing a combinatorial description of the intersection patterns of the obtained groups. This class will introduce the notions of simplicial complexe and nerve of covers that allows to formalize such an idea. It will describe some resulting useful tools, such that the the so-called Mapper algorithm for exploratory data analysis, and well-founded mathematical results to infer relevant topological information about the global structure of data.

• Simplicial complexes
• Nerves and covers of spaces and data sets
• Mapper algorithms and Reeb graphs
• Union of balls, nerve theorems and geometric inference.

Bibliography.

• Carriere, M. and Oudot, S., 2015. Structure and stability of the 1-dimensional mapper. Proc. 32nd International Symposium on Computational Geometry (SoCG), 2016.
• F. Chazal, R. Huang, J. Sun. Gromov-Hausdorff Approximation of Filament Structure Using Reeb-type Graph. Discrete & Computational Geometry 53 (3), 621-649, 2015.
• F Chazal, D Cohen-Steiner, A Lieutier, A Sampling Theory for Compact Sets in Euclidean spaces, Discrete & Computational Geometry 41 (3), 461-479.
• Hatcher, A. (2001). Algebraic Topology. Cambridge Univ. Press.
• Singh, G., Memoli, F. and Carlsson, G.E., 2007, September. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. In SPBG (pp. 91-100).

## 5. Clustering algorithms and introduction to persistent homology

Summary. Clustering is a fundamental problem in data analysis and machine learning that has led to a wide variety of techniques and algorithms. This class will focus on a few clustering methods with a geometric flavor. This will also give us the opportunity to first introduce the notion of persistent homology as a useful tool to improve the performances of some clustering algorithms.

• Quantization and k-means clustering algorithm
• Hierarchical clustering
• Mode-seeking clustering
• 0-dimensional persistent homology, persistence-based clustering.

Bibliography.

• (book) S. Har-Peled; Geometric Approximation Algorithms; Mathematical Surveys and Monographs, AMS, 2011 --- chapter 4
• G. Carlsson, F. Mémoli; Characterization, Stability and Convergence of Hierarchical Clustering Methods; Journal of Machine Learning Research, volume 11, pages 1425--1470, 2010
• F. Chazal, L. J. Guibas, S. Y. Oudot, P. Skraba; Persistence-Based Clustering in Riemannian Manifolds; Journal of the ACM, 60(6):41, 2013.
• L. Wasserman, Topological Data Analysis, 2016. https://arxiv.org/pdf/1609.08227.pdf

## 6. Statistical hypothesis testing and two-sample tests (TST)

Summary. Testing statistical hypothesis is routine in data analysis. Of particular interest for (multi-)variate data are two-sample tests (TST), which aim at checking whether two populations modeled as point clouds in some Euclidean space come from the same distribution (the null hypothesis) or not. This lecture presents basics on hypothesis testing and TST.

• Statistical hypothesis testing and p-values
• One dimensional TST based on permutations
• Permutation tests
• Multi-dimensional tests based on the Maximum Mean Discrepancy
• Goodness of fit tests

Bibliography

• (book) G. Casella and R. Berger, Statistical inference, Duxbury Press, 2001.
• (book) Mittelhammer, Ron C., and Ron C. Mittelhammer. Mathematical statistics for economics and business. Vol. 78. Berlin: Springer, 1996.
• (book) Lehmann, Erich L., and Joseph P. Romano. Testing statistical hypotheses. springer, 2006.
• Gretton, Arthur, et al. A kernel two-sample test. The Journal of Machine Learning Research 13 (2012): 723-773.
• Samory Kpotufe. Escaping the curse of dimensionality with a tree-based regressor.ra Conference on Learning Theory (COLT) 2009.
• Murdoch, Duncan J., Yu-Ling Tsai, and James Adcock. P-values are random variables. The American Statistician 62.3 (2008).
• Sackrowitz, Harold, and Ester Samuel-Cahn. P values as random variables-expected P values. The American Statistician 53.4 (1999): 326-331.
• B. Phipson and G. Smyth, Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn, Statistical Applications in Genetics and Molecular Biology 9(10), 2010.
• J., Wittawat, W. Xu, Z. Szabo, K. Fukumizu, and A. Gretton. A Linear-Time Kernel Goodness-of-Fit Test. NIPS 2017.

## 7. Comparing high-dimensional distributions, comparing clusterings

Summary. Upon performing a two-sample test (lesson 7), if the null is rejected, one if left with a binary piece of information. This class will develop techniques to expand this information, in two directions. First, we shall address the problem of identifying regions in parameter space accounting for the differences revealed by the TST. Second, we shall address the question of selecting the features which account for the differences observed.

• Information theoretic measures
• Localizing data discrepancies upon performing two-sample test
• Comparing clusterings
• Feature selection accounting for differences observed in two-sample tests

Bibliography

• (book) Cover and Thomas, Elements of Information Theory, Wiley, 2006.
• F. Cazals and A. Lheritier, Beyond Two-sample-tests: Localizing Data Discrepancies in High-dimensional Spaces, IEEE/ACM International Conference on Data Science and Advanced Analytics, 2015.
• F. Cazals and D. Mazauric and R. Tetley and R. Watrigant, Comparing two clusterings using matchings between clusters of clusters
• Mueller and Jaakkola, Principal Differences Analysis: Interpretable Characterization of Differences between Distributions, NIPS 2015

## 8. Shape signatures: stability and statistical aspects

Summary. This class will extend the notion of persistent homology introduced in the class on clustering. It will be shown how the theory can be used to build relevant topological information from data that can be converted into features for further data analysis and machine learning tasks.

• Persistent homology for point cloud
• Stable multiscale topological signatures
• Statistical properties of persistence-based signatures
• Feature extraction from persistence diagrams and applications
• Introduction to the Gudhi library

Bibliography.

• (book) H. Edelsbrunner, J. Harer, Computational Topology: an introduction, AMS, 2010.
• (software) The GUDHI library
• F. Chazal, High-Dimensional Topological Data Analysis. To appear in the 3rd edition of the Handbook of Discrete and Computational Geometry. Available here
• F. Chazal, D. Cohen-Steiner, L. J. Guibas, F. M´emoli, S. Oudot, Gromov-Hausdorff Stable Signatures for Shapes using Persistence, Computer Graphics Forum (proc. SGP 2009), pp. 1393-1403,
• 2009.
• C. Li, M. Ovsjanikov, F. Chazal. Persistence-based Structural Recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014).