D I S T R I B U T E D O P T I M I Z A T I O N A N D G A M E S

Distributed Optimization and Games, 2019-2020

The purpose of this course is to introduce students to large scale machine learning. The focus is as much on optimization algorithms as on distributed systems.

Teachers: Giovanni Neglia, Chuan Xu, Tareq Si-Salem, Othmane Marfoq.

Main references:

Léon Bottou, Frank E. Curtis, Jorge Nocedal, Optimization Methods for Large-Scale Machine Learning, available here
Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, available here
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, see dedicated page
Mu Li, David G. Andersen, Alexander Smola, and Kai Yu, Communication Efficient Distributed Machine Learning with the Parameter Server, NIPS 2014 available here
Abadi et al, TensorFlow: A System for Large-Scale Machine Learning, OSDI 2016, [pdf]
Feng Niu, Benjamin Recht, Christopher Ré and Stephen J. Wright, Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent, available here
Eric P. Xing, Qirong Ho, Pengtao Xie, Wei Dai, Strategies and Principles of Distributed Machine Learning on Big Data, available here
Giovanni Neglia, Gianmarco Calbi, Don Towsley, and Gayane Vardoyan, The Role of Network Topology for Distributed Machine Learning, Infocom 2019, available here
Giovanni Neglia, Chuan Xu, Don Towsley, and Gianmarco Calbi, Decentralized gradient methods: does topology matter?, AISTATS 2020, available here
Chuan Xu, Giovanni Neglia, Nicola Sebastianelly, Dynamic Backup Workers, under submission, available here
Dan Alistarh's keynote talk at DISC 2019 [slides].

Other resources:

Sébastien Bubeck, Convex Optimization: Algorithms and Complexity, available here
Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola, Dive into Deep Learning, available here
Joseph E. Gonzalez, Emerging Systems for Large-Scale Machine Learning, invited tutorial at ICML 2014, slides [pdf], [pptx]
Yuxin Chen's course on Large-Scale Optimization for Data Science at Princeton
Alex Smola's course on Scalable Machine Learning at Berkeley

Evaluation
- For Data Science students: 25% classwork (a 10-minute test at every lesson, only 5 best marks will be considered), 25% individual/pair project to be delivered by March 8th, 40% theoretical exam, 10% lab evaluation.
Lessons
Lessons will be from 9.00 to 12.15.
- First lesson (January 8, room TD8, Lucioles): introduction to the course, math refresher (gradient, hessian, convex sets and functions), introduction to ML optimization (empirical risk vs expected risk, training/validation/test sets), presentation of full batch gradient and stochastic gradient methods. Sections 1-3.2 of Bottou et al, section 4.3 of Goodfellow et al.
- Second lesson (January 15, O+317, Templiers): why stochastic gradient descent (SGD) may outperform batch gradient (qualitative explanation, time to minimize the empirical error), overview of noise reduction and second order methods, mini-batch methods, convergence results (expected decrease after one iteration). Sections 3.3 and 4.1 of Bottou et al.
- Third lesson (January 22, O+307, Templiers): definition of strongly convexity, convergence results of stochastic gradient methods for strongly convex functions (both constant and decreasing learning rates). Section 4.2 of Bottou et al.
- Fourth lesson (January 29, E+142): the role of the condition number, trade-offs of mini-batch, convergence results of stochastic gradient methods for non-convex functions, noise reduction methods (dynamic sample size, gradient aggregation, iterate averaging), overview of second-order methods. Sections 4.2, 4.3, 4.5, 5, 6 of Bottou et al.
- Fifth lesson (February 5, O+309, Templiers): other optimization methods (momentum, Nesterov, coordinate descent method), optimization for neural networks (back-propagation, critical points, learning rate tuning, AdaGrad, RMSProp, Adam, weight initialization, batch normalization), minibatch size and communication delay, data vs model parallelism, synchronous parameter server, ring all-reduce. Section 7 of Bottou et al, sections 6.5.1-6.5.8, 8.1-8.5 of Goodfellow et al, and paper Li et al.
- Sixth lesson (February 12, C4, Lucioles): Parameter server architecture, Apache Hadoop/Spark for distributed ML?, characteristics of ML optimization-centric programs (first part), Hogwild's example, Backup workers, consensus approaches. Niu et al, Xing et al, Neglia et al papers, Xu et al.
- Seventh lesson (February 19, O+229, Templiers): data parallelism vs model parallelism, characteristic of ML optimization-centric programs (second part), four decisions for distributed systems: 1) scheduling and balancing workloads (who does what), 2) bridging computation and communication (a time for talk, a time for action), 3)compressing communications (what to say), 4) managing communication topologies (whom to talk to), Xing et al, Alistarh's talk.
Labs
During these practical sessions students will have the opportunity to train ML models in a distributed way on Inria scientific cluster.
Sessions are organized by Chuan Xu and Othmane Marfoq.
Students need to carry out some administrative/configuration steps before the start of the labs.
Information is at DOG labs
Labs will be from 9.00 to 12.15 on February 13 and February 20 in room Lucioles 281.
Participation to the labs will be graded by the teachers.

Individual/Pair project
The project is an opportunity for the student to actively use the material taught in the course. Students are free to choose the goal of their project, but are invited to discuss it with the teacher. Possible goals are
- reproduce an experimental result in a paper,
- design or perform an experiment to support/confute a statement in a paper,
- apply some of the optimization algorithms described in the course to a specific problem the student is interested in (e.g. for another course, his/her final project, etc.),
- compare different algorithms,
- implement an algorithm in a distributed system (PyTorch, Spark, TensorFlow, …),
- …
It is possible for a pair of students to carry out together the project. A list of possible projects is provided below.
Submission rules
- The student(s) will provide 1) a 4-page report formatted according to ICLR template, with unlimited additional pages for bibliography and eventual unlimited appendices to contain proofs, description of code or additional experiments, 2) code developed, 3) a readme file containing instructions to run the code and reproduce the experiments in the report
- The report must clearly describe and motivate the goal of the project, provide any relevant background and explain the original contribution of the student. What is explained in the course can be considered of general knowledge and should not be repeated in the report.
- The code must be well commented.
- The student will made the material above available online in a zipped folder named with his/her name, and will send the link to the teacher.
- Each student carrying out the project together with another one must send an email to the teacher indicating a percentage estimation of his/her contribution to the project (e.g. "I estimate my contribution to have been 60%").
Grading
- The mark will take into account: originality of the project, presentation quality, technical correctness, and task difficulty. More will be expected by projects carried out by a pair of students.
- For projects carried out by a pair of students up to 20% of the final mark will depend on the self-evaluated contribution of the student.
- Any form of plagiarism will lead to reduction of the final mark.
- Failure to respect the submission rules indicated above (e.g.~report exceeding the maximum length, report sent as attachment) will lead to reduction to the final mark.
Ideas for possible projects
In random order, the list is extended during the course.
1. Compare stochastic gradient descent and gradient aggregation methods in different regimes of condition-number versus dataset-size. See Bottou et al, section 5.3.3.
2. Survey and compare iterage averaging methods. See Bottou et al, section 5.4.
3. Survey the dynamic sampling techniques described in references [28,73] of Bottou et al (section 5.2.1). Implement and test at least one of them (including the basic one described in section 5.2.1).
4. Implement the (inexact) Newton method without inverting the Hessian, but using the conjugate gradient method. Evaluate the effect of stopping the conjugate method after i ≤ d steps, where d is the dimension of the parameter vector. See Bottou et al section 6.1 and Bubeck section 2.4.
5. Second-order techniques for neural network training. See Bottou et al section 6.1.2 and references [12,100] as a starting point. Better avoid this project if you are not familiar with backpropagation method for neural networks.
6. Back to the classics: understand backpropagation from the original sources. Start from [134,135] in Bottou et al.
7. Survey, implement and test L-BFGS. See Bottou et al section 6.2 and references [97,113].
8. Asynchronicity as a momentum: ''Asynchrony begets Momentum, with an Application to Deep Learning'' by Mitliagkas et al, arxiv.
9. ''Deep learning with Elastic Averaging SGD'' by Zhang et al, arxiv.
10. Simulate a parameter server (PS) with backup workers introducing some communication constraints between the workers and the PS (e.g. in the form of queue).
11. Consider workers are nodes in a graph with communication delays. Compare the convergence speed of consensus methods with a parameter server (PS) approach where the PS is the root of a tree.
12. Francis Bach's blog post on acceleration can be a project inspiration. See also the blog post on Chebyshev acceleration.
13. Francis Bach's blog post on averaging can be a project inspiration.
14. Compare Adam, AdamW, and amsgrad. A good starting point is this fast.ai blog post.
Exam
The exam will be in room E+132, Templiers.

Last modified: February 18, 2020

Distributed Optimization and Games, 2019-2020

Lessons

Labs

Individual/Pair project

Submission rules

Grading

Ideas for possible projects

Exam