{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## your name" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1\n", "If your training error is low, but your validation error is large, what is a possible cause? What could you do to improve your model?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 2\n", "Why do we use gradient descent methods? How does stochastic gradient descent work?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Binary classifier for IRIS dataset \n", "Create a binary classifier to recognize the specie Iris-Virginica " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn import datasets\n", "\n", "iris = datasets.load_iris()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Iris dataset contains sepal and petal length and width of iris flowers of three different species: Iris-Setosa, Iris-Versicolor, and Iris-Virginica.\n", "We check it below" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris.keys()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris.target_names" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "iris.feature_names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "iris['data'] contains the input features, iris['target'] their classification.\n", "\n", "How many samples are in the dataset?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#to fill\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We want to build a binary classifier to distinguish Iris Virginica from the other ones. We start then splitting the input features in X and the binary labels (Virginica or not) in y" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X = np.array(iris[\"data\"])\n", "y = np.array(iris[\"target\"]==2).reshape(150,1)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fulldata = np.concatenate((X, y),axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now fulldata contains the input features in the first 4 columns and the label (0 or 1) in the 5th column" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Shuffle the dataset" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#to fill\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "split 'fulldata' in training (80%) and testing (20%). Call the two parts 'train_set' and 'test_set'" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#to fill\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now split again the input features and the labels" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train = train_set[:,[0,1,2,3]]\n", "y_train = train_set[:,[4]].ravel()\n", "X_test = test_set[:,[0,1,2,3]]\n", "y_test = test_set[:,[4]].ravel()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Build a stochastic gradient descent classifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#to fill\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Use the classifier to predict the labels for the samples 50 to 55. Compare the predictions with the actual values" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#to fill\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the accuracy of the classifier, using cross validation with 4 folds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#to fill\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute the confusion matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#to fill\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute precision and recall directly from the matrix" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#to fill\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compute precision and recall using the apposite functions precision_score and recall_score" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#to fill\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "An SGDClassifier has many possible parameters like \n", "- 'penalty' may have values ‘none’, ‘l2’, ‘l1’, or ‘elasticnet’\n", "- 'alpha' is a constant that multiplies the regularization term. Defaults to 0.0001 \n", "- 'learning_rate' may have values \"constant\", \"adaptive\", \"optimal\" and \"invscaling\"\n", "- 'eta0' is the value of the initial learning rate \n", "- ...\n", "\n", "Use GridSearchCV to find a good parameter setting\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#to fill\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given your choice of the parameters to test, how many models have been tested in total?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are the best parameters found for the classifier?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#to fill\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "check accuracy, precision and recall of the best classifier" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#to fill\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Evaluate the same metrics for the best classifier on the test set" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#to fill\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Are there significative differences in comparison to the same quantities evaluated on the train dataset? Can you explain the difference?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }