Fred Chazal, Inria DataShape
Data analysis is the process of cleaning, transforming, modelling or
comparing data, in order to infer useful information and gain insights
into complex phenomena. From a geometric perspective, when an instance
(a physical phenomenon, an individual, etc.) is given as a fixed-sized
collection of real-valued observations, it is naturally indentified
with a geometric point having these observations as coordinates. Any
collection of such instances is then seen as a point cloud sampled in
some metric or normed space.
This course reviews fundamental constructions related to the
manipulation of such point clouds, mixing ideas from computational
geometry and topology, statistics, and machine learning. The emphasis
is on methods that not only come with theoretical guarantees, but also
work well in practice. In particular, software references and example
datasets will be provided to illustrate the constructions.
Course outline. The course consists of the following eight lectures (3h each):
1. Nearest neighbors in Euclidean and metric spaces: search: data structures and algorithms
Summary. For data modeled as a point cloud, reporting the
nearest neighbors (NN) of a given point is a fundamental operation
consistently reused in data analysis. Efficient NN search methods will
be reviewed, both for Euclidean and metric spaces.
- Exact nearest neighbors via Voronoi diagrams
- Randomized k-d trees and random projection trees
- Metric trees
- Linear approximating and eliminating search algorithm
Selected important metrics, with applications:
- Hausdorff and Frechet distances
- Earth mover distance and optimal transportation
- Earth mover distance with connectivity constraints
- The Mallow distance and its application to compare clusterings
- (book) S. Har-Peled; Geometric Approximation Algorithms;
Mathematical Surveys and Monographs, AMS, 2011 --- chapters 17 to 20
- (book) G. Shakhnarovich and T. Darrell and P. Indyk (Eds),
Nearest-Neighbors Methods in Learning and Vision. Theory and Practice,
MIT press, 2005.
- (book) Samet, Hanan. Foundations of multidimensional and metric data
structures. Morgan Kaufmann, 2006.
- Arya, S. and Mount, D.M. and Netanyahu, N.S. and Silverman, R. and Wu,
A.Y.; An optimal algorithm for approximate nearest neighbor
searching fixed dimensions; Journal of the ACM, 45, 1998.
- Muja, Marius, and David G. Lowe. Fast Approximate Nearest
Neighbors with Automatic Algorithm Configuration. VISAPP (1). 2009.
- P. Indyk, R. Motwani; Approximate nearest neighbors: towards
removing the curse of dimensionality; Proceedings of the thirtieth
annual ACM symposium on Theory of computing, pages 604-613, 1998
Umeyama, S. (1991). Least-squares estimation of transformation
parameters between two point patterns. IEEE Transactions on pattern
analysis and machine intelligence, 13(4), 376-380.
- Rubner, Yossi, Carlo Tomasi, and Leonidas J. Guibas. The earth
mover's distance as a metric for image retrieval. International
Journal of Computer Vision 40.2 (2000): 99-121.
- Levina, Elizaveta, and Peter Bickel. The earth mover's distance
is the Mallows distance: some insights from statistics. Computer
Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International
Conference on. Vol. 2. IEEE, 2001.
- Zhou, Ding, Jia Li, and Hongyuan Zha. A new mallows distance
based metric for comparing clusterings. Proceedings of the 22nd
international conference on Machine learning. ACM, 2005.
2.Nearest neighbors in Euclidean and metric spaces: analysis
Summary. This lecture will develop selected
insights / mathematical analysis related to the data
structures and algorithms seen in the first lecture.
High dimensional spaces and measure concentration phenomena:
introduction and implications for NN algorithms:
- The notion of intrinsic dimension
- Experiments on nearest neighbor searches, regression, dimension estimation
- Random projection trees (RPT): search performance analysis
- RPT: adaptation to the intrinsic dimension
- Concentration phenomena: application to nearest neighbor searches
- Concentration phenomena: key properties
- N. Verma, S. Kpotufe, S. Dasgupta, Which spatial partitions are adaptive to intrinsic dimension? UAI 2009.
S. Dasgupta and K. Sinha. Randomized partition trees for exact nearest neighbor search. JMLR:
Workshop and Conference Proceedings, 30:1--21, 2013.
S. Vempala, Randomly-oriented kd Trees Adapt to Intrinsic Dimension. FSTTCS, 2012.
Francois, D., Wertz, V., and Verleysen, M. (2007). The concentration of
fractional distances. Knowledge and Data Engineering, IEEE
Transactions on, 19(7), 873-886.
Aggarwal, Charu C., Alexander Hinneburg, and Daniel A. Keim. On the
surprising behavior of distance metrics in high dimensional
space. Springer Berlin Heidelberg, 2001.
- Beyer, Kevin, et al. When is nearest neighbor
meaningful?. Database Theory—ICDT’99. Springer Berlin Heidelberg,
3. Dimensionality reduction algorithms
Summary. Dimensionality reduction methods aim at embedding
high-dimensional data in lower-dimensional spaces, while preserving
specific properties (pairwise distance, data spread, etc).
- Dimensionality reduction algorithms: criteria and classification
- The Johnson-Lindenstrauss lemma
- Multi-dimensional scaling and principal components analysis
- Diffusion maps
- (book) Lee, John A., and Michel Verleysen. Nonlinear dimensionality
reduction. Springer, 2007.
- (book) Sariel Har-Peled. Geometric approximation
algorithms. No. 173. American Mathematical Soc., 2011.
(book) Cox, Trevor F., and Michael AA Cox. Multidimensional
scaling. CRC Press, 2000.
- Tenenbaum, Joshua B., Vin De Silva, and John C. Langford. A global
geometric framework for nonlinear dimensionality reduction. Science
290.5500 (2000): 2319-2323.
- Roweis, Sam T., and Lawrence K. Saul. "Nonlinear dimensionality
reduction by locally linear embedding." Science 290.5500 (2000):
- Coifman, Ronald R., et al. Geometric diffusions as a tool for
harmonic analysis and structure definition of data: Diffusion
maps. PNAS (102), 2005.
- Lafon, Stephane, and Ann B. Lee. Diffusion maps and
coarse-graining: A unified framework for dimensionality reduction,
graph partitioning, and data set parameterization. Pattern Analysis
and Machine Intelligence, IEEE Transactions on 28.9 (2006): 1393-1403.
Dasgupta, Sanjoy, and Anupam Gupta. An elementary proof of a theorem
of Johnson and Lindenstrauss. Random Structures and Algorithms 22.1
- Frédéric Cazals, Frédéric Chazal, and
Joachim Giesen. Spectral Techniques to Explore Point Clouds in
Euclidean Space, with Applications to Collective Coordinates in
Structural Biology. Nonlinear Computational Geometry. Springer New
York, 2010. 1-34.
- Diaconis, Persi, Sharad Goel, and Susan Holmes. Horseshoes in
multidimensional scaling and local kernel methods. The Annals of
Applied Statistics (2008): 777-807.
- (book) Lee, John A., and Michel Verleysen. Nonlinear dimensionality
reduction. Springer, 2007.
R.R. Coifman and S. Lafon and A.B. Lee and M. Maggioni and B. Nadler
and F. Warner and S.W. Zucker, Geometric diffusions as a tool
for harmonic analysis and structure definition of data:
Diffusion maps, PNAS 102, 205.
- S. Lafon and Y. Keller and R.R. Coifman, Data fusion and
multi-cue data matching by diffusion maps, IEEE Trans. on Pattern
Analysis and Machine Intelligence 28, 2006.
- S. Lafon and A.B. Lee,
Diffusion Maps and Coarse-Graining: A Unified Framework for Dimensionality
Reduction, Graph Partitioning and Data Set Parameterization,
IEEE Trans. on Pattern Analysis and Machine Intelligence 28, 2006.
- Kannan, Ravi and Vempala, Santosh and Vetta, Adrian,
On clusterings: Good, bad and spectral, J. of the ACM 51, 2004.
4. Covers and nerves: geometric inference and the Mapper algorithm
Summary. A simple and basic but powerful idea is that the
global structure (or shape) of data can be understood by first locally
grouping together some observations (through local clustering or some other
procedure) and then providing a combinatorial description of the
intersection patterns of the obtained groups. This class will
introduce the notions of simplicial complexe and nerve of covers that
allows to formalize such an idea. It will describe some resulting
useful tools, such that the the so-called Mapper algorithm for
exploratory data analysis, and well-founded mathematical results to
infer relevant topological information about the global structure of
- Simplicial complexes
- Nerves and covers of spaces and data sets
- Mapper algorithms and Reeb graphs
- Union of balls, nerve theorems and geometric inference.
- Carriere, M. and Oudot, S., 2015. Structure and stability of the 1-dimensional mapper. Proc. 32nd International Symposium on Computational Geometry (SoCG), 2016.
- F. Chazal, R. Huang, J. Sun. Gromov-Hausdorff Approximation of Filament Structure Using Reeb-type Graph. Discrete & Computational Geometry 53 (3), 621-649, 2015.
- F Chazal, D Cohen-Steiner, A Lieutier, A Sampling Theory for Compact Sets in Euclidean spaces, Discrete & Computational Geometry 41 (3), 461-479.
- Hatcher, A. (2001). Algebraic Topology. Cambridge Univ. Press.
- Singh, G., Memoli, F. and Carlsson, G.E., 2007, September. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. In SPBG (pp. 91-100).
5. Clustering algorithms and introduction to persistent homology
Summary. Clustering is a fundamental problem in data
analysis and machine learning that has led to a wide variety of
techniques and algorithms. This class will focus on a few clustering
methods with a geometric flavor. This will also give us the
opportunity to first introduce the notion of persistent homology as a
useful tool to improve the performances of some clustering algorithms.
- Quantization and k-means clustering algorithm
- Hierarchical clustering
- Mode-seeking clustering
- 0-dimensional persistent homology, persistence-based clustering.
- (book) S. Har-Peled; Geometric Approximation Algorithms; Mathematical Surveys and Monographs, AMS, 2011 --- chapter 4
- G. Carlsson, F. Mémoli; Characterization, Stability and Convergence of Hierarchical Clustering Methods; Journal of Machine Learning Research, volume 11, pages 1425--1470, 2010
- F. Chazal, L. J. Guibas, S. Y. Oudot, P. Skraba; Persistence-Based Clustering in Riemannian Manifolds; Journal of the ACM, 60(6):41, 2013.
- L. Wasserman, Topological Data Analysis, 2016. https://arxiv.org/pdf/1609.08227.pdf
6. Statistical hypothesis testing and two-sample tests (TST)
Summary. Testing statistical hypothesis is routine in data
analysis. Of particular interest for (multi-)variate data are
two-sample tests (TST), which aim at checking whether two populations
modeled as point clouds in some Euclidean space come from the same
distribution (the null hypothesis) or not. This lecture presents
basics on hypothesis testing and TST.
- Statistical hypothesis testing and p-values
- One dimensional TST based on permutations
- Permutation tests
- Multi-dimensional tests based on the Maximum Mean Discrepancy
- Goodness of fit tests
(book) G. Casella and R. Berger, Statistical inference, Duxbury Press, 2001.
(book) Mittelhammer, Ron C., and Ron C. Mittelhammer. Mathematical statistics
for economics and business. Vol. 78. Berlin: Springer, 1996.
(book) Lehmann, Erich L., and Joseph P. Romano. Testing statistical
hypotheses. springer, 2006.
Gretton, Arthur, et al. A kernel two-sample test. The Journal of
Machine Learning Research 13 (2012): 723-773.
Samory Kpotufe. Escaping the curse of dimensionality with a tree-based regressor.ra
Conference on Learning Theory (COLT) 2009.
Murdoch, Duncan J., Yu-Ling Tsai, and James Adcock. P-values are
random variables. The American Statistician 62.3 (2008).
- Sackrowitz, Harold, and Ester Samuel-Cahn. P values as random
variables-expected P values. The American Statistician 53.4 (1999):
B. Phipson and G. Smyth, Permutation P-values should never be zero:
calculating exact P-values when permutations are randomly drawn,
Statistical Applications in Genetics and Molecular Biology 9(10),
- J., Wittawat, W. Xu, Z. Szabo, K. Fukumizu, and A. Gretton. A
Linear-Time Kernel Goodness-of-Fit Test. NIPS 2017.
7. Comparing high-dimensional distributions, comparing clusterings
Summary. Upon performing a two-sample test (lesson 7), if
the null is rejected, one if left with a binary piece of information.
This class will develop techniques to expand this information, in two
directions. First, we shall address the problem of identifying regions
in parameter space accounting for the differences revealed by the
TST. Second, we shall address the question of selecting the features
which account for the differences observed.
- Information theoretic measures
- Localizing data discrepancies upon performing two-sample test
- Comparing clusterings
- Feature selection accounting for differences observed in two-sample tests
- (book) Cover and Thomas, Elements of Information Theory, Wiley, 2006.
- F. Cazals and A. Lheritier,
Beyond Two-sample-tests: Localizing Data Discrepancies in High-dimensional Spaces,
IEEE/ACM International Conference on Data Science and Advanced Analytics, 2015.
F. Cazals and D. Mazauric and R. Tetley and R. Watrigant,
Comparing two clusterings using matchings between clusters of clusters
- Mueller and Jaakkola, Principal Differences Analysis: Interpretable
Characterization of Differences between Distributions, NIPS 2015
8. Shape signatures: stability and statistical aspects
Summary. This class will extend the notion of persistent homology
introduced in the class on clustering. It will be shown how the theory
can be used to build relevant topological information from data that
can be converted into features for further data analysis and machine
- Persistent homology for point cloud
- Stable multiscale topological signatures
- Statistical properties of persistence-based signatures
- Feature extraction from persistence diagrams and applications
- Introduction to the Gudhi library
- (book) H. Edelsbrunner, J. Harer, Computational Topology: an introduction, AMS, 2010.
The GUDHI library
- F. Chazal, High-Dimensional Topological Data Analysis. To appear in the 3rd edition of the Handbook of Discrete and Computational Geometry. Available
- F. Chazal, D. Cohen-Steiner, L. J. Guibas, F. M´emoli, S. Oudot, Gromov-Hausdorff Stable Signatures for Shapes using Persistence, Computer Graphics Forum (proc. SGP 2009), pp. 1393-1403,
- C. Li, M. Ovsjanikov, F. Chazal. Persistence-based Structural Recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2014).