Description
Broadly speaking, Machine learning (ML) is the scientific field aiming at building models and inferring knowledge by applying algorithms to data. Therefore, the process involves the (statistical) analysis of data, and the design of models, possibly predictive. These tasks are fundamental ones in modern science in general, biology and medicine in particular. This course will develop an introduction to ML by reviewing the fundamental principles and methods.
In order to guarantee the same theoretical backgroud, frontal lectures will be the same for DS4H, SPECTRUM and LIFE students. Each lecture will be accompanied by a handson practical (in Python3), during which LIFE students will work with datasets of biological and/or medical importance. The hybrid nature of this course provides a unique opportunity to assess the performances of the various methods studied (running time, stability, sensitivity to noise/outliers, etc), and to think critically about the quality of models in biology and medicine.
The datasets used during the practicals will cover the main classes of data used in modern biology, at all scales (individual molecules, cells, organs, individuals), and will reference current, realworld problems taken directly from scientific studies.
Close
Outline
The course consists of the following eight lectures (2h + 2h practicals each):
1. Machine Learning: Models and Data
2. Regression with linear models
3. Classification and Logistic Regression
4. Support Vector Machines
5. Bayesian Methods: Naive Bayes and Linear Discriminant Analysis
6. Decision Trees and Random Forests
7. Clustering
8. Dimensionality reduction
Close
Class 1
Machine Learning: Models and Data
Summary
Artificial Intelligence is now part of our daytoday life, and its range of applications does not seem to stop extending. But how can a computer program learn, and how is machine learning useful for solving vast classes of computational problems? This lecture introduces the fundamental actors in Machine Learning: data and learning algorithms.
 Data:
 Features
 Input and output
 Quantitative vs. qualitative data
 Structured vs. nonstructured data
 Algorithms:
 Supervised vs. unsupervised
 Regression vs. classification
 Reduction vs. clustering
 Pipelines
Although formally presented as distinct subjects, one should view these two concept as deeply intertwined, each one strongly affecting the other
An introduction to the main Python packages and Jupyter notebooks is given, along with an exhaustive list of bibliographic references
Practical
The first practical is a Python3 refresher and an introduction to the use of Jupyter notebooks.
Assignment
Close
Class 2
Regression with linear models
Summary
Regression is used every time we want to estimate the relationship between a dependent variable and one or more independent variables (or features). From a ML point of view, we are thus interested in methodologies for predicting a response value from a certain number of input variables.
This lesson will cover linar regression models, and in particular:
 Simple linear prediction and least squares
 Linear regression model
 Least squares criterion
 Model adequacy for simple linear regression
 Mean squared error
 Variance analysis
 Relative criterion
 Residuals analysis
 Finding and handling outliers with leverage analysis
 Multivariate linear regression
 Polynomial trends
 Finding optimal parameters
 Hyperplanes
 Regression in interaction models
 Model adequacy for multivariate linear regression
 MSE and RMSE in multiple dimensions
 Colinearity test
Practical
We will apply the acquired knowledge on linear regression for investigating the relationship between genome length, number of coded proteins, and extension of coding/noncoding parts in a very diverse set of organisms, from animals to bacteria, viruses and considering even mitochondrial genetic material. The student will be guided through progressively more complex analyses, and will have to learn to use the fundamental linear regression methods and efficiently visualize the results.
Jupyter notebook  LIFE
Close
Class 3
Classification and Logistic Regression
Summary
Logistic regression is a class of supervised classification algorithms used to model the probability of an observation to belong to one among a given set of classes. We will explore the following subjects:
 Linear classifiers
 Binary linear classifiers
 Multiclass linear classifiers
 Logistic regression
 The logistic function and its parameters
 Learning and predicting with logistic classifiers
 Nonlinear transformations
 Linear separability
 Kernel methods
 Model adequacy
 Accuracy
 The confusion matrix
 Regularized logistic regression
Practical
During this lesson we will work with one of the traditional AI dataset for medicine applications: the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset. 30 features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass describe characteristics of the cell nuclei present in the image. The aim is to predict whether the cells have benign or malign cancer. The students will have to employ the logistic classificators studied in the course in order to maximize the accuracy of the prediction.
Jupyter notebook  LIFE
Data  LIFE
Close
Class 4
Support Vector Machines
Summary
SVMs are a popular and robust class of models for supervised classification that optimizes the position of hyperplanes in the space of input data so as to divide them in subspaces corresponding to the input classes. The following subjects will be detailed:
 Hardmargin SVM for binary prediction
 Formulation as constrained minimization
 Formulation as convex optimization
 Dual formulation
 Support vector calculation
 Softmargin SVM for binary prediction
 Convex optimization and dual formulation
 Study of the parameter C
 Nonlinear SVM
 Feature transformation
 Kernels
 Multiclass SVM
Practical
As the techniques studies in the course become more effective, realworld problems become more challenging. In this lesson we will work with data from single cell transcriptomics (SCT), where a set of cells from a blood sample have been analyzed, and are described using a certain number of feactures. Do SVMs manage to tell apart different types of cells?
Jupyter notebook  LIFE
Data  LIFE
Close
Class 5
Bayesian Methods: Naive Bayes and Linear Discriminant Analysis
Summary
Bayesian statistics can be employed for classification purposes as well. The most trivial application is called Naive Bayes classifier, and will be useful for introducing all the main characteristics of Bayesian analysis. Possibly the most fundamental concept to retain is the one of likelihood, both for its importance and its ubiquitous use. We will then also study a more advanced Bayesian algorithm called Linear Discriminiant Analysis, and we will see its points in common with other techniques we encountered.
 Introduction: probabilistic classification
 Naive Bayes
 Conditional probabilities and Bayes theorem
 Maximum a priori rule
 Naive Bayes algorithm
 Examples
 Discrete case: weather conditions for playing tennis
 Continuous case: monitoring the temperature of a system
 Linear Discriminant Analysis
 Case n=1: LDA proof
 Case n=1, L=2
 Case n>1: an example
 Model validation
 Crossvalidation
 Hyperparameter optimization
Practical
This lession will use synthetic data generated by the students. The practical will indeed focus not only on applying the two prediction algorithms, rather also on getting familiar with the use of crossvalidation testing and hyperparameter optimization. We thus need the data to be as flexible as possible, in order to try these validation techniques in different situations
Jupyter notebook
Close
Class 6
Decision Trees and Random Forests
Summary
Decision tree learning uses a tree data structure in order to classify sample data by probing their features with a sequence of threshold values. It is a very intelligible and simple class of methods, but a single decision tree usually does not provide an accurate enough classification. This lecture will thus also treat random forests, which are collections of differently initialized random trees.
 Decision trees: an example
 Hunt's algorithm
 Tree induction
 Best split
 Split based on nominal attributes
 Split based on ordinal attributes
 Split based on continuous attributes
 How to determine the best split
 Measuring impurity with Gini index
 Stopping criteria
 Practical issues and overfitting
 Random forest
Practicals
During this lesson we will work with another classical dataset for ML applied to biology: the Iris dataset, where different iris specimens of three different iris species must be correctly classified given their sepal and petal length and width. We will train and test a decision tree and a random forest, and calculate the accuracy of the two methods.
Jupyter notebook
Close
Class 7
Clustering
Summary
In a non supervised context, clustering aims at grouping the data in homogeneous groups by minimizing the intragroup variance. This fundamental task is surprisingly challenging due to several difficulties: the (generally) unknown number of clusters, clusters whose boundaries have a complex geometry, dealing with overlapping clusters (due to noise), dealing with high dimensional data, etc. This class will present two main clustering techniques: kmeans and hierarchical clustering
 Introduction to unsupervised learning
 Kmeans
 Kmeans algorithm
 Elbow method
 Silhouette method
 Advantages and drawbacks

 Kmedoids
 Hierarchical clustering
 Agglomerative algorithm
 Dendrograms
 Group average
 Ward's method
 Complexity

Practical
We will apply clustering methodologies for analyzing the structural differences of membrane proteins, and see whether we can infer something about their classification and their evolutionary history.
Jupyter notebook  LIFE
Data  LIFE
Close
Class 8
Dimensionality reduction
Summary
Dimensionality reduction methods aim at embedding highdimensional data into a lowerdimensional space, while preserving specific properties such as pairwise distances, the data spread, etc. Originating with the celebrated Principal Components Analysis method, recent methods have focused on data located on non linear spaces.
 The curse of dimensionality
 Dimensionality reduction
 Feature extraction
 Principal component analysis
 tSNE
 LDA for dimensionality reduction
Practical
We will try to do image classification with the help of dimentionality reduction techniques
Jupyter notebook
Close
Location
Close
Validation
Instructions for validating the class will be available later.
Close