{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to AI: assignment 4 - SVM classifiers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Support vector machines have promoted one of the most radical advances in machine learning classification and prediction. In this assignment you are going to explore several types of SVMs to analyze the various types of mononuclear cells in the blood sample of a healthy person. Easier said than done!\n", "\n", "### Motivation\n", "For decades, gene expression measurements had been done by averaging over a whole population of cells, thus revealing average activities but hiding minority behaviors. Single-cell transcriptomics (SCT) is a recent technique able to examine the gene expression level of individual cells in a given population by simultaneously measuring the messenger RNA (mRNA) concentration of hundreds to thousands of genes. For each of the thousands of cells analyzed, SCT tells how frequently some hundreds of its genes get transcribed in mRNA, thus revealing which chemical pathways are active and which others are dormant or inactive. Such very detailed pedigree can be used for example to distinguish different types of cells in a blood sample: **is it possible to tell apart different types of cells in a blood sample using SCT?**\n", "\n", "### Resources\n", "The database has been adapted from an open source data collection by 10x Genomics 1: it is a sample of peripheral blood mononuclear cells (PBMC) from a healthy patient. PBMCs consist of lymphocytes and monocytes, and are thus essential for research in immunology, especially in autoimmune studies and vaccine development. The dataset consists of 2638 cells for which the expression of 207 genes has been monitored. The dataset also includes a column (\"louvain\") where the cell type is specified.\n", "\n", "### Aim\n", "Our aim is to correctly classify the different types of cells.\n", "\n", "1Download link: [PBMC3k](https://support.10xgenomics.com/single-cell-gene-expression/datasets/1.1.0/pbmc3k?)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A. Dimensional reduction and preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each cell is described by the relative expression of 207 of its genes, which makes for a 207-dimensional representation of this dataset. First of all, we want to see whether a smart 2-dimensional representation can suffice to have a pretty good idea of how the data is structured. In the following, we will apply the UMAP2 method. \n", "\n", "2 Leland McInnes, John Healy, James Melville, *UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction* (2018) [arxiv.org/abs/1802.03426](arxiv.org/abs/1802.03426)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import pandas, read the .csv dataset and display it " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The last column of the dataset contains the correct classification for each one of the cells. Print a statistics on their types" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before using UMAP, let's scale the data so that the ranges of each feature look similar (this is always a very useful thing to do when using any machine learning algorithm!). Create a matrix $\\mathbf{X}$ containing all input features of each data point, then use the `fit_transform` method of the `StandardScaler` class in `sklearn.preprocessing` in order to obtain a new scaled matrix $\\mathbf{X_{s}}$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "****IMPORTANT****\n", "You will probably have to download the `umap` package via `pip` or `conda`. In case you get installation errors, run this notebook on *Google Colab*: `https://colab.research.google.com/`.\n", "\n", "If you do, uncomment the first line of the next codeblock in order to tell Colab to install the `umap` package." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "#%pip install umap\n", "import umap\n", "from sklearn.preprocessing import StandardScaler\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's use UMAP: import the `umap` package and define an instance of the `UMAP` class. Use its `fit_transform` method for generating a 2D embedding." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Represent the data points in the new 2D embedding space: make a scatterplot with `matplotlib`. Color each type of cell differently and show an appropriate legend. Does it look like a viable representation? Is the resulting 207->2 dimensionality reduction appropriate for our purposes?" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Hard margin linear SVM " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some pairs of clusters can be easily told apart: look for example at `NK cells` and `B cells`. What would be a good hard threshold for distinguishing between them?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Stack the feature matrices in $\\mathbf{X}$ and outputs in $\\mathbf{y}$ for cell types `NK cells` and `B cells`" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import the `svm` model from `sklearn`" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "from sklearn import svm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define hard margin linear SVM model. For now, we will only need a `linear` kernel. All SVM methods are actually soft-margin: this means we have to force the method to reproduce a hard margin SVM by tweaking the `C` parameter. What would be a good value for `C` in this case? " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fit on training data, evaluate the corresponding predictions, calculate and print the accuracy (you can use the `accuracy_score` function from `sklearn.metrics`)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By using the method `decision_function` from the SVM model and pyplot’s command `contour`, draw the border line between the two predicted classes and 2 dashed lines corresponding to the 2 borders of the margin. \n", "\n", "Draw the scatter plot with the features on top of the contours and indicate with circles which data points are sup- port vectors. The support vectors can be found in the attribute `support_vectors_` of the fitted SVM model." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A hard margin SVM would not be suited for non-linearly separable clusters: even very slight deviations from linear separability can drastically compromise its predictiveness. Take for example the `FCGRA3+ Monocytes` and `CD8 T cells` and repeat the analysis. How is the accuracy? Is there any specific data point leading the predicted threshold? In this case, is the accuracy enough to conclude the quality of the training?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Soft margin linear SVM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's try to apply a softer margin to the case `FCGR3A+ Monocytes` - `CD8 T cells` seen above. Define an SVM model, this time with `C=1` (allowing for a much softer margin), and repeat the steps you have taken in the previous section. Is the accuracy substantially better? Does the threshold look more convincing to you?" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 4. Effect of $C$ in soft margin linear SVM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A yet more difficult case is telling `CD4 T cells` and `CD8 T cells` apart. Repeat the steps for this pair and allow yourself to change the magnitude of the `C` parameter. What do you observe?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Overall, as $C$ increases the margin reduces and the accuracy increases." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 5. Kernel (soft margin) SVM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now consider the full dataset and we want to have a good predictor for discriminating `CD8 T cells` from *any* other cell type. We soon realize no linear threshold is going to work in this case: we need a nonlinear approach.\n", "\n", "For this, we will use a *kernel SVM*. Let's stack in the input matrix the `CD8 T cells` and `non-CD8 T cells` classes, change the kernel of the method to `rbf` and repeat the usual steps. The interesting parameters are now two: $C$ and $\\gamma$. Test different values of $\\gamma$: what happens? Which one gives the best accuracy? Which one gives the most appropriate model?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multi-class classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Application of one vs. one and one vs. rest multi-class classification of three different clusters. Linear SVM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now work with more than two clusters and try to correctly classify three types of cells that appear as contiguous point clouds in our 2D embedding space: `CD4 T cells`, `CD8 T cells`, and `NK cells`. Stack the feature matrices in $\\mathbf{X}$ and outputs in $\\mathbf{y}$." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Data visualization" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will test one vs. one (OVO) and one vs. rest (OVR) strategies using a soft margin linear SVM for the underlying binary classification. Look into `sklearn` documentation to see how to apply these strategies with SVM. For the binary soft SVM classifier you can use parameter value `C = 1`.\n", "Evaluate the accuracy of both strategies and draw with different colors the corresponding predicted classes regions behind a scatterplot of the data. To draw the different predicted classes regions as different background colors you can use the function `pcolormesh` from `matplotlib.pyplot`, in a similar way you've already done in the previous labworks.\n", "Is there a difference in performance between the strategies? Why?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One vs. one linear SVM classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import the one vs. one classifier and set up the correct parameters" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "from sklearn.multiclass import OneVsOneClassifier\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fit on training data, evaluate the corresponding predictions, calculate and print the accuracy " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the predicted class regions" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One vs. rest linear SVM classifier" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Repeat the same steps with the OVR classifier" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from sklearn.multiclass import OneVsRestClassifier\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OVO strategy has the best performance. For the OVR strategy to work well with a linear classifier, each class has \n", "to be linearly separable from the union of the other classes. This is not the case here for the observations `CD8 T cells`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Application of one vs. one and one vs. rest multi-class classification. Kernel SVM" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's repeat the analysis, this time with a kernel SVM. Is there any noticeable difference?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One vs. one linear SVM classifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One vs. rest linear SVM classifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There is no significant difference in performance between the two strategies.\n", "\n", "In this case, the observations of `CD8 T cells` can be easily separated from the observations of the union of the two other classes with a non linear classifier, therefore the OVR strategy does not suffer from the issue observed for the linear classifiers." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.8" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "12px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 1 }