For Data Science students: 20% classwork (a 10-minute test at every lesson, only 4 best marks will be considered), 25% individual/pair project to be delivered by March 14th, 40% theoretical exam, 15% lab evaluation.
Lessons will be from 9.00 to 12.15.
First lesson (Giovanni Neglia, January 19, online): introduction to the course,
introduction to ML optimization (empirical risk vs expected risk, training/validation/test sets). Sections 1-3.1 of Bottou et al.
Second lesson (Giovanni Neglia, January 26, online):
math refresher (gradient, hessian, convex sets and functions),
presentation of full batch gradient and stochastic gradient methods,
why stochastic gradient descent (SGD) may outperform batch gradient (qualitative explanation, time to minimize the empirical error), overview of noise reduction and second order methods, mini-batch methods. Section 4.3 of Goodfellow et al, Sections 3.2, 3.3, introduction to Section 4 of Bottou et al.
Third lesson (Tareq Si Salem, February 2, online): convergence results (expected decrease after one iteration), definition of strongly convexity, convergence results of stochastic gradient methods for strongly convex functions (both constant and decreasing learning rates). Sections 4.1 and 4.2 of Bottou et al.
Fourth lesson (Giovanni Neglia, February 9, online): the role of the condition number, trade-offs of mini-batch, convergence results of stochastic gradient methods for non-convex functions, noise reduction methods (dynamic sample size, gradient aggregation, iterate averaging), overview of second-order methods. Sections 4.2, 4.3, 4.5, 6 of Bottou et al.
Fifth lesson (Giovanni Neglia, February 16, online): trade-offs of mini-batch, noise reduction methods (dynamic sample size, gradient aggregation, iterate averaging), other optimization methods (momentum, Nesterov, coordinate descent method). Section 5 and 7 of Bottou et al.
Sixth lesson (Tareq Si Salem, February 23, online): optimization for neural networks (back-propagation, critical points, learning rate tuning, AdaGrad, RMSProp, Adam, weight initialization, batch normalization), minibatch size and communication delay, data vs model parallelism, synchronous parameter server, ring all-reduce. Section 5 and 7 of Bottou et al, sections 6.5.1-6.5.8, 8.1-8.5 of Goodfellow et al, and paper Li et al.
During these practical sessions students will have the opportunity to train ML models in a distributed way on Inria scientific cluster.
Sessions are organized by Chuan Xu and Othmane Marfoq.
Students need to carry out some administrative/configuration steps before the start of the labs. Labs website.
Participation to the labs will be graded by the teachers.
The project is an opportunity for the student to actively use the material taught in the course.
Students are free to choose the goal of their project, but are invited to discuss it with the teacher.
Possible goals are
reproduce an experimental result in a paper,
design or perform an experiment to support/confute a statement in a paper,
apply some of the optimization algorithms described in the course to a specific problem the student is interested in (e.g. for another course, his/her final project, etc.),
compare different algorithms,
implement an algorithm in a distributed system (PyTorch, Spark, TensorFlow, …),
It is possible for a pair of students to carry out together the project.
A list of possible projects is provided below.
The student(s) will provide
1) a 4-page report formatted according to ICLR template, with unlimited additional pages for bibliography and eventual unlimited appendices to contain proofs, description of code or additional experiments,
2) code developed,
3) a readme file containing instructions to run the code and reproduce the experiments in the report
The report must clearly describe and motivate the goal of the project, provide any relevant background and explain the original contribution of the student.
What is explained in the course can be considered of general knowledge and should not be repeated in the report.
The code must be well commented.
The student will made the material above available online in a zipped folder named with his/her name, and will send the link to the teacher.
Each student carrying out the project together with another one must send an email to the teacher indicating a percentage estimation of his/her contribution to the project (e.g. "I estimate my contribution to have been 60%").
The mark will take into account: originality of the project, presentation quality, technical correctness, and task difficulty. More will be expected by projects carried out by a pair of students.
For projects carried out by a pair of students up to 20% of the final mark will depend on the self-evaluated contribution of the student.
Any form of plagiarism will lead to reduction of the final mark.
Failure to respect the submission rules indicated above (e.g.~report exceeding the maximum length, report sent as attachment) will lead to reduction to the final mark.
Ideas for possible projects
In random order, the list is extended during the course.
Compare stochastic gradient descent and gradient aggregation methods in different regimes of condition-number versus dataset-size. See Bottou et al, section 5.3.3.
Survey and compare iterage averaging methods. See Bottou et al, section 5.4.
Survey the dynamic sampling techniques described in references [28,73] of Bottou et al (section 5.2.1). Implement and test at least one of them (including the basic one described in section 5.2.1).
Implement the (inexact) Newton method without inverting the Hessian, but using the conjugate gradient method. Evaluate the effect of stopping the conjugate method after i ≤ d steps, where d is the dimension of the parameter vector. See Bottou et al section 6.1 and Bubeck section 2.4.
Second-order techniques for neural network training. See Bottou et al section 6.1.2 and references [12,100] as a starting point. Better avoid this project if you are not familiar with backpropagation method for neural networks.
Back to the classics: understand backpropagation from the original sources. Start from [134,135] in Bottou et al.
Survey, implement and test L-BFGS. See Bottou et al section 6.2 and references [97,113].
Asynchronicity as a momentum: ''Asynchrony begets Momentum, with an Application to Deep Learning'' by Mitliagkas et al, arxiv.
''Deep learning with Elastic Averaging SGD'' by Zhang et al, arxiv.
Simulate a parameter server (PS) with backup workers introducing some communication constraints between the workers and the PS (e.g. in the form of queue).
Consider workers are nodes in a graph with communication delays. Compare the convergence speed of consensus methods with a parameter server (PS) approach where the PS is the root of a tree.