EVALUATION: LIST OF PROJECTS AND INSTRUCTIONS HERE

Data analysis is the process of cleaning, transforming, modelling or comparing data, in order to infer useful information and gain insights into complex phenomena. From a geometric perspective, when an instance (a physical phenomenon, an individual, etc.) is given as a fixed-sized collection of real-valued observations, it is naturally indentified with a geometric point having these observations as coordinates. Any collection of such instances is then seen as a point cloud sampled in some metric or normed space.

This course reviews fundamental constructions related to the manipulation of such point clouds, mixing ideas from computational geometry and topology, statistics, and machine learning. The emphasis is on methods that not only come with theoretical guarantees, but also work well in practice. In particular, software references and example datasets will be provided to illustrate the constructions.

**Course outline.** The course consists of the following eight lectures (3h each):

- 1. Nearest neighbors in Euclidean and metric spaces: search data structures and algorithms [Slides]
- 2. Nearest neighbors in Euclidean and metric spaces: analysis [Slides]
- 3. Dimensionality reduction algorithms: basic methods [Course notes]
- 4. Dimensionality reduction algorithms: advanced methods
- 5. Clustering algorithms [Slides] [Lecture notes] [ToMATo]
- 6. Statistical hypothesis testing and two-sample tests (TST) [Lecture notes, [slides]
- 7. Comparing high-dimensional distributions, with applications to feature selection
- 8. Reeb graphs and mapper [Slides] [Lecture notes]

**Summary.** For data modeled as a point cloud, reporting the
nearest neighbors (NN) of a given point is a fundamental operation
consistently reused in data analysis. Efficient NN search methods will
be reviewed, both for Euclidean and metric spaces.

*Euclidean spaces:*

- Exact nearest neighbors via Voronoi diagrams
- Randomized k-d trees and random projection trees

*Metric spaces:*

- Metric trees
- Linear approximating and eliminating search algorithm

*Selected important metrics, with applications:*

- Hausdorff and Frechet distances
- Earth mover distance and optimal transportation
- Earth mover distance with connectivity constraints
- The Mallow distance and its application to compare clusterings

**Bibliography**

- (book) S. Har-Peled; Geometric Approximation Algorithms; Mathematical Surveys and Monographs, AMS, 2011 --- chapters 17 to 20
- (book) G. Shakhnarovich and T. Darrell and P. Indyk (Eds), Nearest-Neighbors Methods in Learning and Vision. Theory and Practice, MIT press, 2005.
- (book) Samet, Hanan. Foundations of multidimensional and metric data structures. Morgan Kaufmann, 2006.
- Arya, S. and Mount, D.M. and Netanyahu, N.S. and Silverman, R. and Wu, A.Y.; An optimal algorithm for approximate nearest neighbor searching fixed dimensions; Journal of the ACM, 45, 1998.
- Muja, Marius, and David G. Lowe. Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration. VISAPP (1). 2009.
- P. Indyk, R. Motwani; Approximate nearest neighbors: towards removing the curse of dimensionality; Proceedings of the thirtieth annual ACM symposium on Theory of computing, pages 604-613, 1998
- Umeyama, S. (1991). Least-squares estimation of transformation parameters between two point patterns. IEEE Transactions on pattern analysis and machine intelligence, 13(4), 376-380.
- Rubner, Yossi, Carlo Tomasi, and Leonidas J. Guibas. The earth mover's distance as a metric for image retrieval. International Journal of Computer Vision 40.2 (2000): 99-121.
- Levina, Elizaveta, and Peter Bickel. The earth mover's distance is the Mallows distance: some insights from statistics. Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on. Vol. 2. IEEE, 2001.
- Zhou, Ding, Jia Li, and Hongyuan Zha. A new mallows distance based metric for comparing clusterings. Proceedings of the 22nd international conference on Machine learning. ACM, 2005.
- F. Cazals and D. Mazauric, Mass Transportation Problems with Connectivity Constraints, with Applications to Energy Landscape Comparison, Inria Tech repot 8611, 2004.

**Summary.** This lecture will develop selected
insights / mathematical analysis related to the data
structures and algorithms seen in the first lecture.

* High dimensional spaces and measure concentration phenomena:
introduction and implications for NN algorithms:*

- The notion of intrinsic dimension
- Experiments on nearest neighbor searches, regression, dimension estimation
- Random projection trees (RPT): search performance analysis
- RPT: adaptation to the intrinsic dimension
- Concentration phenomena: application to nearest neighbor searches
- Concentration phenomena: key properties

**Bibliography**

- N. Verma, S. Kpotufe, S. Dasgupta, Which spatial partitions are adaptive to intrinsic dimension? UAI 2009.
- S. Dasgupta and K. Sinha. Randomized partition trees for exact nearest neighbor search. JMLR: Workshop and Conference Proceedings, 30:1--21, 2013.
- S. Vempala, Randomly-oriented kd Trees Adapt to Intrinsic Dimension. FSTTCS, 2012.
- Francois, D., Wertz, V., and Verleysen, M. (2007). The concentration of fractional distances. Knowledge and Data Engineering, IEEE Transactions on, 19(7), 873-886.
- Aggarwal, Charu C., Alexander Hinneburg, and Daniel A. Keim. On the surprising behavior of distance metrics in high dimensional space. Springer Berlin Heidelberg, 2001.
- Beyer, Kevin, et al. When is nearest neighbor meaningful?. Database Theory—ICDT’99. Springer Berlin Heidelberg, 1999. 217-235.

**Summary.** Dimensionality reduction methods aim at embedding
high-dimensional data in lower-dimensional spaces, while preserving
specific properties (pairwise distance, data spread, etc).

- Dimensionality reduction algorithms: criteria and classification
- The Johnson-Lindenstrauss lemma
- Multi-dimensional scaling and principal components analysis
- Isomap

** Bibliography**

- (book) Lee, John A., and Michel Verleysen. Nonlinear dimensionality reduction. Springer, 2007.
- (book) Sariel Har-Peled. Geometric approximation algorithms. No. 173. American Mathematical Soc., 2011.
- (book) Cox, Trevor F., and Michael AA Cox. Multidimensional scaling. CRC Press, 2000.
- Tenenbaum, Joshua B., Vin De Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science 290.5500 (2000): 2319-2323.
- Roweis, Sam T., and Lawrence K. Saul. "Nonlinear dimensionality reduction by locally linear embedding." Science 290.5500 (2000): 2323-2326.
- Coifman, Ronald R., et al. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. PNAS (102), 2005.
- Lafon, Stephane, and Ann B. Lee. Diffusion maps and coarse-graining: A unified framework for dimensionality reduction, graph partitioning, and data set parameterization. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28.9 (2006): 1393-1403.
- Dasgupta, Sanjoy, and Anupam Gupta. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures and Algorithms 22.1 (2003): 60-65.
- Frédéric Cazals, Frédéric Chazal, and Joachim Giesen. Spectral Techniques to Explore Point Clouds in Euclidean Space, with Applications to Collective Coordinates in Structural Biology. Nonlinear Computational Geometry. Springer New York, 2010. 1-34.
- Diaconis, Persi, Sharad Goel, and Susan Holmes. Horseshoes in multidimensional scaling and local kernel methods. The Annals of Applied Statistics (2008): 777-807.

**Summary.** Following the previous class, this class will dive
into more advanced dimensionality reduction techniques, in particular
spectral techniques. Along the way, connexions with other important
problems will be studied, in particular graph partitioning problems.

- Diffusion maps
- Applications to data partitioning
- Applications in biophysics

** Bibliography**

- (book) Lee, John A., and Michel Verleysen. Nonlinear dimensionality reduction. Springer, 2007.
- R.R. Coifman and S. Lafon and A.B. Lee and M. Maggioni and B. Nadler and F. Warner and S.W. Zucker, Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS 102, 205.
- S. Lafon and Y. Keller and R.R. Coifman, Data fusion and multi-cue data matching by diffusion maps, IEEE Trans. on Pattern Analysis and Machine Intelligence 28, 2006.
- S. Lafon and A.B. Lee, Diffusion Maps and Coarse-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning and Data Set Parameterization, IEEE Trans. on Pattern Analysis and Machine Intelligence 28, 2006.
- Kannan, Ravi and Vempala, Santosh and Vetta, Adrian, On clusterings: Good, bad and spectral, J. of the ACM 51, 2004.

**Summary.** An ubiquitous problem in the Sciences. Here we focus on
methods that have a highly geometric flavor. It is also an opportunity
to introduce the basics of Morse and size theories.

- Clustering based on energy minimization: k-means and its variants, worst-case exponential convergence, kmeans++
- Hierarchical clustering: single-linkage, maximum-linkage, average-linkage, chaining effect
- Mode-seeking: basics of Morse theory (critical points/peaks, stable manifolds/basins of attraction), Mean-Shift, graph-based hill climbing

** Bibliography**

- (book) S. Har-Peled; Geometric Approximation Algorithms; Mathematical Surveys and Monographs, AMS, 2011 --- chapter 4
- D. Arthur, S. Vassilvitskii; k-means++: the advantages of careful seeding; Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027--1035, 2007
- D. Arthur and S. Vassilvitskii; How slow is the k-means method?; Proceedings of the twenty-second annual symposium on Computational geometry, pages 144--153, 2006
- G. Carlsson, F. Mémoli; Characterization, Stability and Convergence of Hierarchical Clustering Methods; Journal of Machine Learning Research, volume 11, pages 1425--1470, 2010
- F. Chazal, L. J. Guibas, S. Y. Oudot, P. Skraba; Persistence-Based Clustering in Riemannian Manifolds; Journal of the ACM, 60(6):41, 2013

**Summary.** Testing statistical hypothesis is routine in data
analysis. Of particular interest for (multi-)variate data are
two-sample tests (TST), which aim at checking whether two populations
modeled as point clouds in some Euclidean space come from the same
distribution (the null hypothesis) or not. This lecture presents
basics on hypothesis testing and TST.

- Statistical hypothesis testing and p-values
- One dimensional TST based on permutations
- Permutation tests
- Multi-dimensional tests based on the Maximum Mean Discrepancy

** Bibliography**

- (book) G. Casella and R. Berger, Statistical inference, Duxbury Press, 2001.
- (book) Mittelhammer, Ron C., and Ron C. Mittelhammer. Mathematical statistics for economics and business. Vol. 78. Berlin: Springer, 1996.
- (book) Lehmann, Erich L., and Joseph P. Romano. Testing statistical hypotheses. springer, 2006.
- Gretton, Arthur, et al. A kernel two-sample test. The Journal of Machine Learning Research 13 (2012): 723-773.
- Samory Kpotufe. Escaping the curse of dimensionality with a tree-based regressor.ra Conference on Learning Theory (COLT) 2009.
- Murdoch, Duncan J., Yu-Ling Tsai, and James Adcock. P-values are random variables. The American Statistician 62.3 (2008).
- Sackrowitz, Harold, and Ester Samuel-Cahn. P values as random variables-expected P values. The American Statistician 53.4 (1999): 326-331.
- B. Phipson and G. Smyth, Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn, Statistical Applications in Genetics and Molecular Biology 9(10), 2010.

**Summary.** Upon performing a two-sample test (lesson 7), if
the null is rejected, one if left with a binary piece of information.
This class will develop techniques to expand this information, in two
directions. First, we shall address the problem of identifying regions
in parameter space accounting for the differences revealed by the
TST. Second, we shall address the question of selecting the features
which account for the differences observed.

- Information theoretic measures
- Localizing data discrepancies upon performing two-sample test
- Feature selection accounting for differences observed in two-sample tests

** Bibliography**

- (book) Cover and Thomas, Elements of Information Theory, Wiley, 2006.
- F. Cazals and A. Lheritier, Beyond Two-sample-tests: Localizing Data Discrepancies in High-dimensional Spaces, IEEE/ACM International Conference on Data Science and Advanced Analytics, 2015.
- Mueller and Jaakkola, Principal Differences Analysis: Interpretable Characterization of Differences between Distributions, NIPS 2015

**Summary.** A powerful data analysis paradigm currently
emerging is that of *data shape analysis*, as critical information on
discrete data (e.g. a point cloud) may be obtained from geometric and
topological descriptors of the data. This class will describe the
extraction of such descriptors, using the Mapper, a construction which
may be seen as a pixelized version of the Reeb graph, a classical
object encountered in computational topology.

- Functions on manifolds and their levels sets
- Reeb graphs
- The Mapper construction
- Structure and Stability of the 1-Dimensional Mapper

** Bibliography**

- (book) J. W. Milnor; Morse Theory; Princeton University Press, 1963
- (book) Fomenko, Anatolij T., and Tosiyasu L. Kunii. Topological modeling for visualization. Springer, 1997.
- Singh, G., Memoli, F. and Carlsson, G.E., 2007, September. Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition. In SPBG (pp. 91-100).
- Carriere, M. and Oudot, S., 2015. Structure and stability of the 1-dimensional mapper. arXiv preprint arXiv:1511.05823.