The purpose of this course is to introduce students to large scale machine learning. The focus is as much on optimization algorithms as on distributed systems.
Teacher: Giovanni Neglia
Main references:
Léon Bottou, Frank E. Curtis, Jorge Nocedal, Optimization Methods for Large-Scale Machine Learning, available here
Joseph E. Gonzalez, Emerging Systems for Large-Scale Machine Learning, invited tutorial at ICML 2014,
slides [pdf], [pptx]
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers, see dedicated page
Mu Li, David G. Andersen, Alexander Smola, and Kai Yu,
Communication Efficient Distributed Machine Learning with the Parameter Server, NIPS 2014
available here
Abadi et al, TensorFlow: A System for Large-Scale Machine Learning, OSDI 2016, [pdf]
Feng Niu, Benjamin Recht, Christopher Ré and Stephen J. Wright,
Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent,
available here
Eric P. Xing, Qirong Ho, Pengtao Xie, Wei Dai, Strategies and Principles of Distributed Machine Learning on Big Data, available here
Giovanni Neglia, Gianmarco Calbi, Don Towsley, and Gayane Vardoyan, The Role of Network Topology for Distributed Machine Learning, Infocom 2019
Other resources:
Sébastien Bubeck, Convex Optimization: Algorithms and Complexity, available here
Evaluation: 30% classwork (a 10-minute test at every lesson, only 5 best marks will be considered), 30% individual project to be delivered at week 7, 40% final exam.
Lessons
You can freely use the slides below for your presentations, but I would like to be informed and please acknowledge the source in your presentation. Any comment is welcome.
All lessons will be in Templiers room O+309 from 9.00 to 12.15.
First lesson (December 19): introduction to the course, math refresher (gradient, hessian, convex sets and functions), introduction to ML optimization (empirical risk vs expected risk, training/validation/test sets), presentation of full batch gradient and stochastic gradient methods. Sections 1-3.2 of Bottou et al.
Second lesson (January 9): why stochastic gradient descent (SGD) may outperform batch gradient (qualitative explanation, time to minimize the empirical error), overview of noise reduction and second order methods, mini-batch methods, convergence results (expected decrease after one iteration). Sections 3.3 and 4.1 of Bottou et al.
Third lesson (January 16): definition of strongly convexity, convergence results of stochastic gradient methods for strongly convex functions (both constant and decreasing learning rates), the role of the condition number. Section 4.2 of Bottou et al.
Fourth lesson (January 23): the role of the condition number (practical considerations), trade-offs of mini-batch, convergence results of stochastic gradient methods for non-convex functions, noise reduction methods (dynamic sample size, gradient aggregation, iterate averaging), overview of second-order methods. Sections 4.2, 4.3, 4.5, 5, 6 of Bottou et al.
Fifth lesson (January 30): popular optimization methods, minibatch size and communication delay, data vs model parallelism, synchronous parameter server, ring all-reduce, backup workers. Section 7 of Bottou et al and paper Li et al.
Sixth lesson (February 6): Apache Hadoop/Spark for distributed ML?, characteristics of ML optimization-centric programs, Hogwild's example, Four decisions for distributed systems: 1) scheduling and balancing workloads (who does what), 2) bridging computation and communication (a time for talk, a time for action), 3) managing communication topologies (whom to talk to), 4) compressing communications (what to say). Niu et al, Xing et al
Seventh lesson (February 13):
Individual project
The individual project is an opportunity for the student to actively use the material taught in the course.
The student is free to choose the goal of its project, but is invited to discuss it with the teacher.
Possible goals are
reproduce an experimental result in a paper,
design or perform an experiment to support/confute a statement in a paper,
apply some of the optimization algorithms described in the course to a specific problem the student is interested in (e.g. for another course, his/her final project, etc.),
compare different algorithms,
implement an algorithm in a distributed system (Spark, TensorFlow, …),
…
The mark will take into account: originality of the project, presentation quality, technical correctness, and task difficulty. Any form of plagiarism will lead to reduction of the final mark.
A list of possible projects is provided below.
Submission rules
The student will provide
1) a 3-page report formatted according to ICLR template, with unlimited additional pages for bibliography and eventual unlimited appendices to contain proofs, description of code or additional experiments,
2) code developed,
3) a readme file containing instructions to run the code and reproduce the experiments in the report
The report must clearly describe and motivate the goal of the project, provide any relevant background and explain the original contribution of the student.
What is explained in the course can be considered of general knowledge and should not be repeated in the report.
The code must be well commented.
The student will made the material above available online in a zipped folder named with his/her name, and will send the link to the teacher
Ideas for possible projects
In random order, the list is extended during the course.
Compare stochastic gradient descent and gradient aggregation methods in different regimes of condition-number versus dataset-size. See Bottou et al, section 5.3.3.
Survey and compare iterage averaging methods. See Bottou et al, section 5.4.
Survey the dynamic sampling techniques described in references [28,73] of Bottou et al (section 5.2.1). Implement and test at least one of them (including the basic one described in section 5.2.1).
Implement the (inexact) Newton method without inverting the Hessian, but using the conjugate gradient method. Evaluate the effect of stopping the conjugate method after i ≤ d steps, where d is the dimension of the parameter vector. See Bottou et al section 6.1 and Bubeck section 2.4.
Second-order techniques for neural network training. See Bottou et al section 6.1.2 and references [12,100] as a starting point. Better avoid this project if you are not familiar with backpropagation method for neural networks.
Back to the classics: understand backpropagation from the original sources. Start from [134,135] in Bottou et al.
Survey, implement and test L-BFGS. See Bottou et al section 6.2 and references [97,113].
Asynchronicity as a momentum: ''Asynchrony begets Momentum, with an Application to Deep Learning'' by Mitliagkas et al, arxiv.
''Deep learning with Elastic Averaging SGD'' by Zhang et al, arxiv.