ENS Research Course | Foundations of Data Analytics | Academic Year 2023

Foundations of Data Analytics

École Normale Supérieure de Lyon
Computer Science Department
Master 2 – 2024 – 5 ECTS

Course Description

The course ‘‘Foundations of Data Analytics’’ is a selection of topics from several areas aiming to form a mathematical foundation of data analytics. Such topics include measure theory, information theory, statistics, and game theory to study classical problems in data analytics and statistical machine learning.

The course starts by reviewing foundational aspects of measure and information theory to introduce the Radon-Nikodym derivate (RND). The RND is used to build the notion of relative entropy from which all the other information measures are derived. This provides a unified view of information theory, statistics, and all mathematical tools used in the course. Equipped with these tools, the problem of statistical hypothesis testing is formulated without particular conditions on the random variables taking part in the problem. This paves the way for studying the problem of empirical risk minimization and its declinations obtained via information theoretic regularizations.

Within this framework, the generalization capabilities of machine learning algorithms are studied. In particular, explicit expressions for the variation of the expected empirical risk to deviations of the probability measure from which data is sampled are introduced. A similar analysis is conducted for generalization error, which leads to insightful connections to mutual information and lautum information. Finally, the robustness of machine learning algorithms is studied using the worse-case data-generating probability measure and elements of zero-sum games with noisy observations.

Finally, data integrity is studied within this mathematical framework and the focus is on two topics: Data injection Attacks and Covert Information. Both problems are shown to exhibit interesting properties in terms of the trade-off between probability of detection and data distortion. All those trade-offs are characterized in terms of information measures.

Evaluation

  • Weekly homeworks (40 %)

  • In-class work (10 %) – In the form of oral questions

  • Final Exam (50 %) – Two options: (a) presentation on the blackboard of a given topic; or (b) submission of a research paper to the Information Theory Workshop (ITW) or International Symposium on Information Theory (ISIT).

Part I: Theoretical Foundations

Lecture Notes

The lecture notes are available here.

  • Lecture 1: Elements of Measure Theory - Part 1/2

    • Measures and Measurable Spaces

    • Measurable Functions and Integration

    • The Radon-Nikodym Derivative

  • Lecture 2: Elements of Measure Theory - Part 2/2

    • Exponential Families

    • The Gibbs Probability Measure

    • Divergences

Homework 1: Deadline by TBD

  • Lecture 3: Elements of Information Theory - Part 1/3

    • Relative Entropy

    • Relative Entropy Extended to sigma-finite measures

    • Entropy, Mutual Information, and Lautum Information

  • Lecture 4: Elements of Information Theory - Part 2/3

    • Fenchel Transform

    • Variational formulation of Relative Entropy

    • Sensitivity of the Expectation to Variations of the Measure

Homework 2: Deadline by TBD

  • Lecture 5: Elements of Information Theory - Part 3/3

    • Optimization in the Space of Probability Measures

    • Regularization by f-Divergences

    • Regularization by Relative Entropy

  • Lecture 6: Elements of Game Theory

    • Zero-Sum Games

    • Games with Incomplete Information

    • Noisy Observations

Homework 3: Deadline by TBD

  • Lecture 7: Hypothesis Testing - Part 1/3

    • Bayesian Methods

    • Minmax Methods

    • The Neyman-Pearson Method

  • Lecture 8: Hypothesis Testing - Part 2/3

    • Method of Types

    • Sanov's Theorem

    • Chernoff-Stein Lemma

Homework 4: Deadline by TBD - Part 3/3*

  • Lecture 9: Hypothesis Testing - Part 2/3

  • Hoeffding Hypothesis Tests

  • Decentralized Hypothesis Tests

  • Hypothesis Testing with Mismatched Statistics

Part II: Applications

  • Lecture 10: Statistical Machine Learning

    • Priors and the Role of Relative Entropy

    • The Notion of Probably Approximately Correct (PAC)

    • Generalization Guarantees

  • Lecture 11: Empirical Risk Minimization (ERM) - Part 1/3

    • The ERM Problem in the Space of Probability Measures

    • The Gibbs Algorithm

    • The Worst-Case Data-Generating Probability Measure

Homework 5: Deadline by TBD - Part 3/3*

  • Lecture 12: Empirical Risk Minimization (ERM) - Part 2/3

    • Moments of the Empirical Risk

    • Sub-Gaussianity and Sample Complexity

    • Generalization Capabilities

  • Lecture 13: Empirical Risk Minimization (ERM) - Part 3/3

    • Generalization Error

    • Generalization Guarantees in Expectation

    • Generalization Guarantees in Probability

Homework 6: Deadline by TBD

  • Lecture 14: Data Integrity

    • Data Injection and Attack Detection

    • Covert Information

    • Data Injection Attacks in Machine Learning

Student Contest (Final Exams)

  • Lecture 15: Student Contest (Final Exams) – TDB

    • Students’ Project Presentations

    • Final proofread of papers to be submitted to ITW/ISIT

  • Lecture 16: Student Contest (Final Exams) – TDB

    • Students’ Project Presentations

    • Final proofread of papers to be submitted to ITW/ISIT