Introduction to Artificial Intelligence

Description
Outline
- Class 1
- Class 2
- Class 3
- Class 4
- Class 5
- Class 6
- Class 7
- Class 8
Validation
Location

Description

Broadly speaking, Machine learning (ML) is the scientific field aiming at building models and inferring knowledge by applying algorithms to data. Therefore, the process involves the (statistical) analysis of data, and the design of models, possibly predictive. These tasks are fundamental ones in modern science in general, biology and medicine in particular. This course will develop an introduction to ML by reviewing the fundamental principles and methods.

In order to guarantee the same theoretical backgroud, frontal lectures will be the same for DS4H, SPECTRUM and LIFE students. Each lecture will be accompanied by a hands-on practical (in Python3), during which LIFE students will work with datasets of biological and/or medical importance. The hybrid nature of this course provides a unique opportunity to assess the performances of the various methods studied (running time, stability, sensitivity to noise/outliers, etc), and to think critically about the quality of models in biology and medicine.

The datasets used during the practicals will cover the main classes of data used in modern biology, at all scales (individual molecules, cells, organs, individuals), and will reference current, real-world problems taken directly from scientific studies.

Description
Outline
- Class 1
- Class 2
- Class 3
- Class 4
- Class 5
- Class 6
- Class 7
- Class 8
Validation
Location

Outline

The course consists of the following eight lectures (2h + 2h practicals each):
1. Machine Learning: Models and Data
2. Regression with linear models
3. Classification and Logistic Regression
4. Support Vector Machines
5. Bayesian Methods: Naive Bayes and Linear Discriminant Analysis
6. Decision Trees and Random Forests
7. Clustering
8. Dimensionality reduction

Description
Outline
- Class 1
- Class 2
- Class 3
- Class 4
- Class 5
- Class 6
- Class 7
- Class 8
Validation
Location

Class 1

Machine Learning: Models and Data

Summary

Artificial Intelligence is now part of our day-to-day life, and its range of applications does not seem to stop extending. But how can a computer program learn, and how is machine learning useful for solving vast classes of computational problems? This lecture introduces the fundamental actors in Machine Learning: data and learning algorithms.

Data:
- Features
- Input and output
- Quantitative vs. qualitative data
- Structured vs. non-structured data
Algorithms:
- Supervised vs. unsupervised
- Regression vs. classification
- Reduction vs. clustering
- Pipelines

Although formally presented as distinct subjects, one should view these two concept as deeply intertwined, each one strongly affecting the other

An introduction to the main Python packages and Jupyter notebooks is given, along with an exhaustive list of bibliographic references

Practical

The first practical is a Python3 refresher and an introduction to the use of Jupyter notebooks.

Assignment

Description
Outline
- Class 1
- Class 2
- Class 3
- Class 4
- Class 5
- Class 6
- Class 7
- Class 8
Validation
Location

Class 2

Regression with linear models

Summary

Regression is used every time we want to estimate the relationship between a dependent variable and one or more independent variables (or features). From a ML point of view, we are thus interested in methodologies for predicting a response value from a certain number of input variables.

This lesson will cover linar regression models, and in particular:

Simple linear prediction and least squares

Linear regression model
Least squares criterion

Model adequacy for simple linear regression

Mean squared error
Variance analysis
Relative criterion
Residuals analysis
Finding and handling outliers with leverage analysis

Multivariate linear regression

Polynomial trends
Finding optimal parameters
Hyperplanes
Regression in interaction models

Model adequacy for multivariate linear regression

MSE and RMSE in multiple dimensions
Colinearity test

Practical

We will apply the acquired knowledge on linear regression for investigating the relationship between genome length, number of coded proteins, and extension of coding/noncoding parts in a very diverse set of organisms, from animals to bacteria, viruses and considering even mitochondrial genetic material. The student will be guided through progressively more complex analyses, and will have to learn to use the fundamental linear regression methods and efficiently visualize the results.

Jupyter notebook - LIFE

Description
Outline
- Class 1
- Class 2
- Class 3
- Class 4
- Class 5
- Class 6
- Class 7
- Class 8
Validation
Location

Class 3

Classification and Logistic Regression

Summary

Logistic regression is a class of supervised classification algorithms used to model the probability of an observation to belong to one among a given set of classes. We will explore the following subjects:

Linear classifiers

Binary linear classifiers
Multi-class linear classifiers

Logistic regression

The logistic function and its parameters
Learning and predicting with logistic classifiers

Nonlinear transformations

Linear separability
Kernel methods

Model adequacy

Accuracy
The confusion matrix

Regularized logistic regression

Practical

During this lesson we will work with one of the traditional AI dataset for medicine applications: the UCI ML Breast Cancer Wisconsin (Diagnostic) dataset. 30 features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass describe characteristics of the cell nuclei present in the image. The aim is to predict whether the cells have benign or malign cancer. The students will have to employ the logistic classificators studied in the course in order to maximize the accuracy of the prediction.

Jupyter notebook - LIFE

Data - LIFE

Description
Outline
- Class 1
- Class 2
- Class 3
- Class 4
- Class 5
- Class 6
- Class 7
- Class 8
Validation
Location

Class 4

Support Vector Machines

Summary

SVMs are a popular and robust class of models for supervised classification that optimizes the position of hyperplanes in the space of input data so as to divide them in subspaces corresponding to the input classes. The following subjects will be detailed:

Hard-margin SVM for binary prediction

Formulation as constrained minimization
Formulation as convex optimization
Dual formulation
Support vector calculation

Soft-margin SVM for binary prediction

Convex optimization and dual formulation
Study of the parameter C

Nonlinear SVM

Feature transformation
Kernels

Multi-class SVM

One vs one
One vs rest

Practical

As the techniques studies in the course become more effective, real-world problems become more challenging. In this lesson we will work with data from single cell transcriptomics (SCT), where a set of cells from a blood sample have been analyzed, and are described using a certain number of feactures. Do SVMs manage to tell apart different types of cells?

Jupyter notebook - LIFE

Data - LIFE

Description
Outline
- Class 1
- Class 2
- Class 3
- Class 4
- Class 5
- Class 6
- Class 7
- Class 8
Validation
Location

Class 5

Bayesian Methods: Naive Bayes and Linear Discriminant Analysis

Summary

Bayesian statistics can be employed for classification purposes as well. The most trivial application is called Naive Bayes classifier, and will be useful for introducing all the main characteristics of Bayesian analysis. Possibly the most fundamental concept to retain is the one of likelihood, both for its importance and its ubiquitous use. We will then also study a more advanced Bayesian algorithm called Linear Discriminiant Analysis, and we will see its points in common with other techniques we encountered.

Introduction: probabilistic classification
Naive Bayes

Conditional probabilities and Bayes theorem
Maximum a priori rule
Naive Bayes algorithm

Examples

Discrete case: weather conditions for playing tennis
Continuous case: monitoring the temperature of a system

Linear Discriminant Analysis

Case n=1: LDA proof
Case n=1, L=2
Case n>1: an example

Model validation

Cross-validation
Hyperparameter optimization

Practical

This lession will use synthetic data generated by the students. The practical will indeed focus not only on applying the two prediction algorithms, rather also on getting familiar with the use of cross-validation testing and hyperparameter optimization. We thus need the data to be as flexible as possible, in order to try these validation techniques in different situations

Jupyter notebook

Description
Outline
- Class 1
- Class 2
- Class 3
- Class 4
- Class 5
- Class 6
- Class 7
- Class 8
Validation
Location

Class 6

Decision Trees and Random Forests

Summary

Decision tree learning uses a tree data structure in order to classify sample data by probing their features with a sequence of threshold values. It is a very intelligible and simple class of methods, but a single decision tree usually does not provide an accurate enough classification. This lecture will thus also treat random forests, which are collections of differently initialized random trees.

Decision trees: an example
Hunt's algorithm
Tree induction
Best split

Split based on nominal attributes
Split based on ordinal attributes
Split based on continuous attributes
How to determine the best split

Measuring impurity with Gini index
Stopping criteria
Practical issues and overfitting
Random forest

Practicals

During this lesson we will work with another classical dataset for ML applied to biology: the Iris dataset, where different iris specimens of three different iris species must be correctly classified given their sepal and petal length and width. We will train and test a decision tree and a random forest, and calculate the accuracy of the two methods.

Jupyter notebook

Description
Outline
- Class 1
- Class 2
- Class 3
- Class 4
- Class 5
- Class 6
- Class 7
- Class 8
Validation
Location

Class 7

Clustering

Summary

In a non supervised context, clustering aims at grouping the data in homogeneous groups by minimizing the intra-group variance. This fundamental task is surprisingly challenging due to several difficulties: the (generally) unknown number of clusters, clusters whose boundaries have a complex geometry, dealing with overlapping clusters (due to noise), dealing with high dimensional data, etc. This class will present two main clustering techniques: k-means and hierarchical clustering

Introduction to unsupervised learning
K-means

K-means algorithm
Elbow method
Silhouette method
Advantages and drawbacks

K-medoids
Hierarchical clustering
Agglomerative algorithm

Dendrograms
Group average
Ward's method
Complexity

Practical

We will apply clustering methodologies for analyzing the structural differences of membrane proteins, and see whether we can infer something about their classification and their evolutionary history.

Jupyter notebook - LIFE

Data - LIFE

Description
Outline
- Class 1
- Class 2
- Class 3
- Class 4
- Class 5
- Class 6
- Class 7
- Class 8
Validation
Location

Class 8

Dimensionality reduction

Summary

Dimensionality reduction methods aim at embedding high-dimensional data into a lower-dimensional space, while preserving specific properties such as pairwise distances, the data spread, etc. Originating with the celebrated Principal Components Analysis method, recent methods have focused on data located on non linear spaces.

The curse of dimensionality
Dimensionality reduction
Feature extraction
Principal component analysis
t-SNE
LDA for dimensionality reduction

Practical

We will try to do image classification with the help of dimentionality reduction techniques

Jupyter notebook

Description
Outline
- Class 1
- Class 2
- Class 3
- Class 4
- Class 5
- Class 6
- Class 7
- Class 8
Validation
Location

Location

Description
Outline
- Class 1
- Class 2
- Class 3
- Class 4
- Class 5
- Class 6
- Class 7
- Class 8
Validation
Location

Validation

Instructions for validating the class will be available later.

Introduction to Artificial Intelligence - EUR LIFE

Academic year: 2021-2022

Instructors: Michel Riveill, Rodrigo Cabral Farias, Lionel Fillatre, Ali Ballout, Hugo Schmutz, Edoardo Sarti

LIFE instructor: Edoardo Sarti

Description

Outline

Class 1

Class 2

Class 3

Class 4

Class 5

Class 6

Class 7

Class 8

Location

Validation