{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to machine learning: assignment 3 - Logistic regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this assignment you are going to apply logistic regression methods to analyze a famous dataset on breast cancer prediction1: *breast_cancer.tsv*.\n", "\n", "### Motivation\n", "After skin cancer, breast cancer is the most common cancer diagnosed in women in western countries. Telling benign and malignant breast cancer apart is thus of paramount importance, yet very challenging: although there are combinations of cellular traits that characterize malignant cancer, there are no simple criteria for discriminating them exactly.\n", "**Is there a way to successfully predict whether a skin cancer is benign or malignant?**\n", "\n", "### Resources\n", "The database proposed in this labwork collects 30 features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. The last column contains the corresponding diagnosis (benign=0, malignant=1).\n", "\n", "### Aim\n", "Our aim is to spatially separate (as much as possible) the distributions of benign and malignant breast cancers, and then fit a logistic regression able to predict if a cancer cell is benign or malign with maximum efficiency.\n", "\n", "1This is a copy of [UCI ML Breast Cancer Wisconsin (Diagnostic) datasets](https://goo.gl/U2Uwz2)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A. Loading and visualizing the dataset for two given classes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import numpy and pandas, read in the dataset and print it. \n", "Be careful: this time the file is tab-separated, not comma-separated. Find the appropriate function argument to correctly parse the file." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Verify the sizes of the features and outputs, check the feature names and their ranges (min and max of each feature)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the feature matrices $\\mathbf{X}_0$, $\\mathbf{X}_1$ and outputs $\\mathbf{y}_0$, $\\mathbf{y}_1$ for class 0 (benign) and class 1 (malignant) separately, and check their sizes" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now create a single feature matrix $\\mathbf{X}$ and output vector $\\mathbf{y}$ by concatenating the two matrices/vectors. Why is such a procedure useful?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's choose any two features - for example feature 1 and 4 - and try to plot them, coloring benign and malignant entries in two different colors. To do so, import pyplot (don't forget the magic \"inline\" command), and then use *scatter* to plot the two clouds. How do they look?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Working with so many different features is hard: we need a way to visualize the data. Let's try with the most standard dimensionality reduction technique: principal component analysis (PCA).\n", "Load the PCA package from sklearn, initialize a PCA object (set the dimensionality to 2!), then apply the *fit* and *transform* methods to obtain a matrix with only 2 columns (check it)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scatter plot the feature matrix (one feature for each axis), and color in two different colors the benign and malignant entries. Is it getting any better?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we will see in the next labwork, machine learning techniques largely benefit from range standardization (i.e. transform your feature variables in order to have roughly the same range for each variable). The *preprocessing* module in sklearn contains the class *StandardScaler*. Use the *fit* method from this class for scaling the feature matrix. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Repeat the 2D PCA analysis and plot it. What do you find? Have we eventually managed to have two distributions that can be easily told apart?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## B. Logistic regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import logistic regression from Scikit-learn (*linear_model*)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a logistic regression model. There are two important attributes to consider: \"*penalty*\" defines the type of regularization of the model. We will keep the default l2 regularization here. \"*C*\" (positive float) defines the strength of the regularization. The smaller C, the stronger the regularization. Let's try with C=1e6 (a fairly small C = strong regularization).\n", "\n", "*The logistic regression model implemented in Scikit-learn includes a regularization term on the l2-norm of β′ in order to avoid diverging solutions when solving the optimization problem corresponding to parameter learning. The parameter C divides this regularization term, so if you want to approximately test the standard logistic regression method (without regularization), you should set C to a high value (default is C=1.0).*" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fit the logistic regression model to the 2D features, predict on the training data and compare with the training output labels." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluate the accuracy of the prediction on the training data with the fitted logistic regression model. Print the result. Does it correspond to the expected result?\n", "\n", "To evaluate the accuracy, use the *accuracy_score* function from *sklearn.metrics* (from now on)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By using the method *predict_proba* from logistic regression model and pyplot’s command *contour*, draw the separating line. \n", "\n", "The way to do it is simple: take the extrema of the domain of X (the min and max values of each feature), and make a grid covering all the domain. Evaluate the probability of each point (coordinate vector) of this grid with *predict_proba*. Once you have a grid of probabilities, you can use the *contour* command to plot the p = 0.5 separator.\n", "\n", "Draw the scatter plot with the observations on the top of the separating line." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Display the confusion matrix for this classifier when it is applied to the training data. A function for evaluating the confusion matrix\n", "confusion matrix can be found in the module metrics from sklearn ." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## C. Different probability thresholds" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Draw in the same picture the scatter plot of the dataset,the separating line from the previous exercise and the separating line obtained with threshold p = 0.2.\n", "Generate the confusion matrix for the classifier obtained with p = 0.2 and comment on the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display the confusion matrix for p = 0.2" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Redo the previous item for p = 0.8 et p = 0.99." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display the confusion matrices for p = 0.8 and p = 0.99" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## D. Polynomial logistic regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we didn't have that much information? Redefine the matrix X by only considering the first 3 features, redo the PCA as you did before (don't forget the scaling), and plot the scatter plot. How does it look? Make a linear regression and print the accuracy." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can increase the accuracy by fitting a polynomial regression instead of a linear one. We will perform a trick that will be of great use in the following labworks: instead of switching from a linear to a polynomial regression, we will keep performing a linear regression, but applied on a transformed feature matrix. In order to pretend to fit an Nth-degree polynomial function, the first thing you have to do is thus defining a transformed matrix that contains all interaction terms between features up to the Nth degree. For example, for 2 features a and b, the 2nd degree interaction matrix will have the columns [a, b, a^2, ab, b^2]. Here, let's consider N=3.\n", "\n", "Import the *PolynomialFeatures* class from *sklearn.preprocessing*, and define a polynomial transformation of degree 3. Make sure the new feature matrix has the right number of columns!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now define the usual linear model. Limit the number of iterations to 1000. Then, use it to fit the new matrix, and calculate the accuracy. Is it better than the one of the simple linear regression?" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Lastly, plot the separator p = 0.5. How is this line expected to look?\n", "In order to plot this line, we replicate the trick: this time, instead of transforming the matrix X, we will transform the usual grid before using *predict_proba*, then we will generate the contour as usual." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Display the confusion matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Do higher-degree polynomials fit even better? Make an analysis until degree = 20 and plot a degree vs. accuracy graph. What do you observe? What could be a good explanation for this effect? (Try to plot a high-degree case to confirm your hypothesis)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.1" }, "toc": { "colors": { "hover_highlight": "#DAA520", "running_highlight": "#FF0000", "selected_highlight": "#FFD700" }, "moveMenuLeft": true, "nav_menu": { "height": "12px", "width": "252px" }, "navigate_menu": true, "number_sections": true, "sideBar": true, "threshold": 4, "toc_cell": false, "toc_section_display": "block", "toc_window_display": false, "widenNotebook": false } }, "nbformat": 4, "nbformat_minor": 1 }