Léon Bottou, Frank E. Curtis, Jorge Nocedal, Optimization Methods for Large-Scale Machine Learning, available here
Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning, available here
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, see dedicated page
Mu Li, David G. Andersen, Alexander Smola, and Kai Yu,
Communication Efficient Distributed Machine Learning with the Parameter Server, NIPS 2014
available here
Abadi et al, TensorFlow: A System for Large-Scale Machine Learning, OSDI 2016, [pdf]
Feng Niu, Benjamin Recht, Christopher Ré and Stephen J. Wright,
Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent,
available here
Eric P. Xing, Qirong Ho, Pengtao Xie, Wei Dai, Strategies and Principles of Distributed Machine Learning on Big Data, available here
Giovanni Neglia, Gianmarco Calbi, Don Towsley, and Gayane Vardoyan, The Role of Network Topology for Distributed Machine Learning, Infocom 2019, available here
Giovanni Neglia, Chuan Xu, Don Towsley, and Gianmarco Calbi, Decentralized gradient methods: does topology matter?, AISTATS 2020, available here
Chuan Xu, Giovanni Neglia, Nicola Sebastianelly, Dynamic Backup Workers, under submission, available here
Dan Alistarh's keynote talk at DISC 2019 [slides].
Other resources:
Sébastien Bubeck, Convex Optimization: Algorithms and Complexity, available here
Aston Zhang, Zack C. Lipton, Mu Li, Alex J. Smola, Dive into Deep Learning, available here
Joseph E. Gonzalez, Emerging Systems for Large-Scale Machine Learning, invited tutorial at ICML 2014,
slides [pdf], [pptx]
For Data Science students: 20% classwork (a 10-minute test at every lesson, only 4 best marks will be considered), 25% individual/pair project to be delivered by March 14th, 40% theoretical exam, 15% lab evaluation.
Lessons
Lessons will be from 9.00 to 12.15.
First lesson (Giovanni Neglia, January 19, online): introduction to the course,
introduction to ML optimization (empirical risk vs expected risk, training/validation/test sets). Sections 1-3.1 of Bottou et al.
Second lesson (Giovanni Neglia, January 26, online):
math refresher (gradient, hessian, convex sets and functions),
presentation of full batch gradient and stochastic gradient methods,
why stochastic gradient descent (SGD) may outperform batch gradient (qualitative explanation, time to minimize the empirical error), overview of noise reduction and second order methods, mini-batch methods. Section 4.3 of Goodfellow et al, Sections 3.2, 3.3, introduction to Section 4 of Bottou et al.
Third lesson (Tareq Si Salem, February 2, online): convergence results (expected decrease after one iteration), definition of strongly convexity, convergence results of stochastic gradient methods for strongly convex functions (both constant and decreasing learning rates). Sections 4.1 and 4.2 of Bottou et al.
Fourth lesson (Giovanni Neglia, February 9, online): the role of the condition number, trade-offs of mini-batch, convergence results of stochastic gradient methods for non-convex functions, noise reduction methods (dynamic sample size, gradient aggregation, iterate averaging), overview of second-order methods. Sections 4.2, 4.3, 4.5, 6 of Bottou et al.
Fifth lesson (Giovanni Neglia, February 16, online): trade-offs of mini-batch, noise reduction methods (dynamic sample size, gradient aggregation, iterate averaging), other optimization methods (momentum, Nesterov, coordinate descent method). Section 5 and 7 of Bottou et al.
Sixth lesson (Tareq Si Salem, February 23, online): optimization for neural networks (back-propagation, critical points, learning rate tuning, AdaGrad, RMSProp, Adam, weight initialization, batch normalization), minibatch size and communication delay, data vs model parallelism, synchronous parameter server, ring all-reduce. Section 5 and 7 of Bottou et al, sections 6.5.1-6.5.8, 8.1-8.5 of Goodfellow et al, and paper Li et al.
Labs
During these practical sessions students will have the opportunity to train ML models in a distributed way on Inria scientific cluster.
Sessions are organized by Chuan Xu and Othmane Marfoq.
Students need to carry out some administrative/configuration steps before the start of the labs. Labs website.
Participation to the labs will be graded by the teachers.
Individual/Pair project
The project is an opportunity for the student to actively use the material taught in the course.
Students are free to choose the goal of their project, but are invited to discuss it with the teacher.
Possible goals are
reproduce an experimental result in a paper,
design or perform an experiment to support/confute a statement in a paper,
apply some of the optimization algorithms described in the course to a specific problem the student is interested in (e.g. for another course, his/her final project, etc.),
compare different algorithms,
implement an algorithm in a distributed system (PyTorch, Spark, TensorFlow, …),
…
It is possible for a pair of students to carry out together the project.
A list of possible projects is provided below.
Submission rules
The student(s) will provide
1) a 4-page report formatted according to ICLR template, with unlimited additional pages for bibliography and eventual unlimited appendices to contain proofs, description of code or additional experiments,
2) code developed,
3) a readme file containing instructions to run the code and reproduce the experiments in the report
The report must clearly describe and motivate the goal of the project, provide any relevant background and explain the original contribution of the student.
What is explained in the course can be considered of general knowledge and should not be repeated in the report.
The code must be well commented.
The student will made the material above available online in a zipped folder named with his/her name, and will send the link to the teacher.
Each student carrying out the project together with another one must send an email to the teacher indicating a percentage estimation of his/her contribution to the project (e.g. "I estimate my contribution to have been 60%").
Grading
The mark will take into account: originality of the project, presentation quality, technical correctness, and task difficulty. More will be expected by projects carried out by a pair of students.
For projects carried out by a pair of students up to 20% of the final mark will depend on the self-evaluated contribution of the student.
Any form of plagiarism will lead to reduction of the final mark.
Failure to respect the submission rules indicated above (e.g.~report exceeding the maximum length, report sent as attachment) will lead to reduction to the final mark.
Ideas for possible projects
In random order, the list is extended during the course.
Compare stochastic gradient descent and gradient aggregation methods in different regimes of condition-number versus dataset-size. See Bottou et al, section 5.3.3.
Survey and compare iterage averaging methods. See Bottou et al, section 5.4.
Survey the dynamic sampling techniques described in references [28,73] of Bottou et al (section 5.2.1). Implement and test at least one of them (including the basic one described in section 5.2.1).
Implement the (inexact) Newton method without inverting the Hessian, but using the conjugate gradient method. Evaluate the effect of stopping the conjugate method after i ≤ d steps, where d is the dimension of the parameter vector. See Bottou et al section 6.1 and Bubeck section 2.4.
Second-order techniques for neural network training. See Bottou et al section 6.1.2 and references [12,100] as a starting point. Better avoid this project if you are not familiar with backpropagation method for neural networks.
Back to the classics: understand backpropagation from the original sources. Start from [134,135] in Bottou et al.
Survey, implement and test L-BFGS. See Bottou et al section 6.2 and references [97,113].
Asynchronicity as a momentum: ''Asynchrony begets Momentum, with an Application to Deep Learning'' by Mitliagkas et al, arxiv.
''Deep learning with Elastic Averaging SGD'' by Zhang et al, arxiv.
Simulate a parameter server (PS) with backup workers introducing some communication constraints between the workers and the PS (e.g. in the form of queue).
Consider workers are nodes in a graph with communication delays. Compare the convergence speed of consensus methods with a parameter server (PS) approach where the PS is the root of a tree.
Francis Bach's blog post on averaging can be a project inspiration.
Compare Adam, AdamW, and amsgrad. A good starting point is this fast.ai blog post.
Compare Nesterov's accelerated gradient method with classic momentum method on specific toy examples and for neural network training.
Exam
The final exam will be an oral interview with the teachers Giovanni Neglia and Tareq Si Salem on Friday March 19th between 14.30 and 19.30.
Last modified: March 18, 2021