Data analysis is the process of cleaning, transforming, modelling or comparing data, in order to infer useful information and gain insights into complex phenomena. From a geometric perspective, when an instance (a physical phenomenon, an individual, etc.) is given as a fixed-sized collection of real-valued observations, it is naturally indentified with a geometric point having these observations as coordinates. Any collection of such instances is then seen as a point cloud sampled in some metric or normed space.

This course reviews fundamental constructions related to the manipulation of such point clouds, mixing ideas from computational geometry and topology, statistics, and machine learning. The emphasis is on methods that not only come with theoretical guarantees, but also work well in practice. In particular, software references and example datasets will be provided to illustrate the constructions.

**Course outline.** The course consists of the following eight lectures (3h each):

- 1. Nearest neighbors in Euclidean and metric spaces: search data structures and algorithms [Slides]
- 2. Nearest neighbors in Euclidean and metric spaces: analysis [Slides]
- 3. Dimensionality reduction algorithms [Course notes]
- 4. Covers and nerves: geometric inference and the Mapper algorithm
- 5. Clustering algorithms and introduction to persistent homology
- 6. Statistical hypothesis testing and two-sample tests (TST) [Lecture notes, [slides]
- 7. Comparing high-dimensional distributions, with applications to feature selection
- 8. Shape signatures: stability and statistical aspects

**Summary.** For data modeled as a point cloud, reporting the
nearest neighbors (NN) of a given point is a fundamental operation
consistently reused in data analysis. Efficient NN search methods will
be reviewed, both for Euclidean and metric spaces.

*Euclidean spaces:*

- Exact nearest neighbors via Voronoi diagrams
- Randomized k-d trees and random projection trees

*Metric spaces:*

- Metric trees
- Linear approximating and eliminating search algorithm

*Selected important metrics, with applications:*

- Hausdorff and Frechet distances
- Earth mover distance and optimal transportation
- Earth mover distance with connectivity constraints
- The Mallow distance and its application to compare clusterings

**Bibliography**

- (book) S. Har-Peled; Geometric Approximation Algorithms; Mathematical Surveys and Monographs, AMS, 2011 --- chapters 17 to 20
- (book) G. Shakhnarovich and T. Darrell and P. Indyk (Eds), Nearest-Neighbors Methods in Learning and Vision. Theory and Practice, MIT press, 2005.
- (book) Samet, Hanan. Foundations of multidimensional and metric data structures. Morgan Kaufmann, 2006.
- Arya, S. and Mount, D.M. and Netanyahu, N.S. and Silverman, R. and Wu, A.Y.; An optimal algorithm for approximate nearest neighbor searching fixed dimensions; Journal of the ACM, 45, 1998.
- Muja, Marius, and David G. Lowe. Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration. VISAPP (1). 2009.
- P. Indyk, R. Motwani; Approximate nearest neighbors: towards removing the curse of dimensionality; Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604-613, 1998
- Umeyama, S. (1991). Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on pattern analysis and machine intelligence, 13(4), 376-380.
- Rubner, Yossi, Carlo Tomasi, and Leonidas J. Guibas. The earth mover's distance as a metric for image retrieval. International Journal of Computer Vision 40.2 (2000): 99-121.
- Levina, Elizaveta, and Peter Bickel. The earth mover's distance is the Mallows distance: some insights from statistics. Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on. Vol. 2. IEEE, 2001.
- Zhou, Ding, Jia Li, and Hongyuan Zha. A new mallows distance based metric for comparing clusterings. Proceedings of the 22nd international conference on Machine learning. ACM, 2005.
- F. Cazals and D. Mazauric, Mass Transportation Problems with Connectivity Constraints, with Applications to Energy Landscape Comparison, Inria Tech repot 8611, 2004.

**Summary.** This lecture will develop selected
insights / mathematical analysis related to the data
structures and algorithms seen in the first lecture.

* High dimensional spaces and measure concentration phenomena:
introduction and implications for NN algorithms:*

- The notion of intrinsic dimension
- Experiments on nearest neighbor searches, regression, dimension estimation
- Random projection trees (RPT): search performance analysis
- RPT: adaptation to the intrinsic dimension
- Concentration phenomena: application to nearest neighbor searches
- Concentration phenomena: key properties

**Bibliography**

- N. Verma, S. Kpotufe, S. Dasgupta, Which spatial partitions are adaptive to intrinsic dimension? UAI 2009.
- S. Dasgupta and K. Sinha. Randomized partition trees for exact nearest neighbor search. JMLR: Workshop and Conference Proceedings, 30:1--21, 2013.
- S. Vempala, Randomly-oriented kd Trees Adapt to Intrinsic Dimension. FSTTCS, 2012.
- Francois, D., Wertz, V., and Verleysen, M. (2007). The concentration of fractional distances. Knowledge and Data Engineering, IEEE Transactions on, 19(7), 873-886.
- Aggarwal, Charu C., Alexander Hinneburg, and Daniel A. Keim. On the surprising behavior of distance metrics in high dimensional space. Springer Berlin Heidelberg, 2001.
- Beyer, Kevin, et al. When is nearest neighbor meaningful?. Database Theory—ICDT’99. Springer Berlin Heidelberg, 1999. 217-235.

**Summary.** Dimensionality reduction methods aim at embedding
high-dimensional data in lower-dimensional spaces, while preserving
specific properties (pairwise distance, data spread, etc).

- Dimensionality reduction algorithms: criteria and classification
- The Johnson-Lindenstrauss lemma
- Multi-dimensional scaling and principal components analysis
- Isomap
- Diffusion maps

** Bibliography**

- (book) Lee, John A., and Michel Verleysen. Nonlinear dimensionality reduction. Springer, 2007.
- (book) Sariel Har-Peled. Geometric approximation algorithms. No. 173. American Mathematical Soc., 2011.
- (book) Cox, Trevor F., and Michael AA Cox. Multidimensional scaling. CRC Press, 2000.
- Tenenbaum, Joshua B., Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science 290.5500 (2000): 2319-2323.
- Roweis, Sam T., and Lawrence K. Saul. "Nonlinear dimensionality reduction by locally linear embedding." Science 290.5500 (2000): 2323-2326.
- Coifman, Ronald R., et al. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. PNAS (102), 2005.
- Lafon, Stephane, and Ann B. Lee. Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28.9 (2006): 1393-1403.
- Dasgupta, Sanjoy, and Anupam Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms 22.1 (2003): 60-65.
- Frédéric Cazals, Frédéric Chazal, and Joachim Giesen. Spectral Techniques to Explore Point Clouds in Euclidean Space, with Applications to Collective Coordinates in Structural Biology. Nonlinear Computational Geometry. Springer New York, 2010. 1-34.
- Diaconis, Persi, Sharad Goel, and Susan Holmes. Horseshoes in multidimensional scaling and local kernel methods. The Annals of Applied Statistics (2008): 777-807.
- (book) Lee, John A., and Michel Verleysen. Nonlinear dimensionality reduction. Springer, 2007.
- R.R. Coifman and S. Lafon and A.B. Lee and M. Maggioni and B. Nadler and F. Warner and S.W. Zucker, Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS 102, 205.
- S. Lafon and Y. Keller and R.R. Coifman, Data fusion and multi-cue data matching by diffusion maps, IEEE Trans. on Pattern Analysis and Machine Intelligence 28, 2006.
- S. Lafon and A.B. Lee, Diffusion Maps and Coarse-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning and Data Set Parameterization, IEEE Trans. on Pattern Analysis and Machine Intelligence 28, 2006.
- Kannan, Ravi and Vempala, Santosh and Vetta, Adrian, On clusterings: Good, bad and spectral, J. of the ACM 51, 2004.

** Summary.** A simple and basic but powerful idea is that the
global structure (or shape) of data can be understood by first locally
grouping together some observations (through local clustering or some other
procedure) and then providing a combinatorial description of the
intersection patterns of the obtained groups. This class will
introduce the notions of simplicial complexe and nerve of covers that
allows to formalize such an idea. It will describe some resulting
useful tools, such that the the so-called Mapper algorithm for
exploratory data analysis, and well-founded mathematical results to
infer relevant topological information about the global structure of
data.

- Simplicial complexes
- Nerves and covers of spaces and data sets
- Mapper algorithms and Reeb graphs
- Union of balls, nerve theorems and geometric inference.

** Bibliography.**

- Carriere, M. and Oudot, S., 2015. Structure and stability of the 1-dimensional mapper. Proc. 32nd International Symposium on Computational Geometry (SoCG), 2016.
- F. Chazal, R. Huang, J. Sun. Gromov-Hausdorff Approximation of Filament Structure Using Reeb-type Graph. Discrete & Computational Geometry 53 (3), 621-649, 2015.
- F Chazal, D Cohen-Steiner, A Lieutier, A Sampling Theory for Compact Sets in Euclidean spaces, Discrete & Computational Geometry 41 (3), 461-479.
- Hatcher, A. (2001). Algebraic Topology. Cambridge Univ. Press.
- Singh, G., Memoli, F. and Carlsson, G.E., 2007, September. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. In SPBG (pp. 91-100).

** Summary.** Clustering is a fundamental problem in data
analysis and machine learning that has led to a wide variety of
techniques and algorithms. This class will focus on a few clustering
methods with a geometric flavor. This will also give us the
opportunity to first introduce the notion of persistent homology as a
useful tool to improve the performances of some clustering algorithms.

- Quantization and k-means clustering algorithm
- Hierarchical clustering
- Mode-seeking clustering
- 0-dimensional persistent homology, persistence-based clustering.

** Bibliography.**

- (book) S. Har-Peled; Geometric Approximation Algorithms; Mathematical Surveys and Monographs, AMS, 2011 --- chapter 4
- G. Carlsson, F. Mémoli; Characterization, Stability and Convergence of Hierarchical Clustering Methods; Journal of Machine Learning Research, volume 11, pages 1425--1470, 2010
- F. Chazal, L. J. Guibas, S. Y. Oudot, P. Skraba; Persistence-Based Clustering in Riemannian Manifolds; Journal of the ACM, 60(6):41, 2013.
- L. Wasserman, Topological Data Analysis, 2016. https://arxiv.org/pdf/1609.08227.pdf

**Summary.** Testing statistical hypothesis is routine in data
analysis. Of particular interest for (multi-)variate data are
two-sample tests (TST), which aim at checking whether two populations
modeled as point clouds in some Euclidean space come from the same
distribution (the null hypothesis) or not. This lecture presents
basics on hypothesis testing and TST.

- Statistical hypothesis testing and p-values
- One dimensional TST based on permutations
- Permutation tests
- Multi-dimensional tests based on the Maximum Mean Discrepancy

** Bibliography**

- (book) G. Casella and R. Berger, Statistical inference, Duxbury Press, 2001.
- (book) Mittelhammer, Ron C., and Ron C. Mittelhammer. Mathematical statistics for economics and business. Vol. 78. Berlin: Springer, 1996.
- (book) Lehmann, Erich L., and Joseph P. Romano. Testing statistical hypotheses. springer, 2006.
- Gretton, Arthur, et al. A kernel two-sample test. The Journal of Machine Learning Research 13 (2012): 723-773.
- Samory Kpotufe. Escaping the curse of dimensionality with a tree-based regressor.ra Conference on Learning Theory (COLT) 2009.
- Murdoch, Duncan J., Yu-Ling Tsai, and James Adcock. P-values are random variables. The American Statistician 62.3 (2008).
- Sackrowitz, Harold, and Ester Samuel-Cahn. P values as random variables-expected P values. The American Statistician 53.4 (1999): 326-331.
- B. Phipson and G. Smyth, Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn, Statistical Applications in Genetics and Molecular Biology 9(10), 2010.

**Summary.** Upon performing a two-sample test (lesson 7), if
the null is rejected, one if left with a binary piece of information.
This class will develop techniques to expand this information, in two
directions. First, we shall address the problem of identifying regions
in parameter space accounting for the differences revealed by the
TST. Second, we shall address the question of selecting the features
which account for the differences observed.

- Information theoretic measures
- Localizing data discrepancies upon performing two-sample test
- Feature selection accounting for differences observed in two-sample tests

** Bibliography**

- (book) Cover and Thomas, Elements of Information Theory, Wiley, 2006.
- F. Cazals and A. Lheritier, Beyond Two-sample-tests: Localizing Data Discrepancies in High-dimensional Spaces, IEEE/ACM International Conference on Data Science and Advanced Analytics, 2015.
- Mueller and Jaakkola, Principal Differences Analysis: Interpretable Characterization of Differences between Distributions, NIPS 2015

** Summary.** This class will extend the notion of persistent homology
introduced in the class on clustering. It will be shown how the theory
can be used to build relevant topological information from data that
can be converted into features for further data analysis and machine
learning tasks.

- Persistent homology for point cloud
- Stable multiscale topological signatures
- Statistical properties of persistence-based signatures
- Feature extraction from persistence diagrams and applications
- Introduction to the Gudhi library

** Bibliography.**

- (book) H. Edelsbrunner, J. Harer, Computational Topology: an introduction, AMS, 2010.
- (software) The GUDHI library
- F. Chazal, High-Dimensional Topological Data Analysis. To appear in the 3rd edition of the Handbook of Discrete and Computational Geometry. Available here
- F. Chazal, D. Cohen-Steiner, L. J. Guibas, F. M´emoli, S. Oudot, Gromov-Hausdorff Stable Signatures for Shapes using Persistence, Computer Graphics Forum (proc. SGP 2009), pp. 1393-1403, 2009.
- C. Li, M. Ovsjanikov, F. Chazal. Persistence-based Structural Recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014).