{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "63bd6c3d",
   "metadata": {},
   "source": [
    "# Clustering"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a36fbf12",
   "metadata": {},
   "source": [
    "### Motivation\n",
    "Membrane proteins are very mysterious: their shape and folding is rigidly constrained by the geometry of the lipid bilayer they are inserted in, and yet they carry out 50% of the functions in any type of cell. Slight structural changes, coupled with just the right changes in the biochemistry of the amino acid sequence, can give rise to extremely diverse behaviors. This is for sure the case with the very large group of [G protein-coupled receptors][https://en.wikipedia.org/wiki/G_protein-coupled_receptor], which couple with Guanine nucleotide-binding (G) proteins and have a distinct structural trademark: they all have exactly 7 transmembrane helices. Among the many protein families in this group, [Rhodopsin-like receptors][https://en.wikipedia.org/wiki/Rhodopsin-like_receptors] also share a similar active site. Their functions remain nonetheless very diverse: the targets of these receptors can be neuropeptides, neurotransmitters, and even light (like the Rhodopsin itself).\n",
    "The function and evolution of many of these proteins remain not well ascertained: in this labwork we will try to **analyze the proteins' structural differences and see whether we can infer something about their classification and their evolutionary history**.\n",
    "\n",
    "### Data\n",
    "Protein structures can be aligned (i.e. carefully and somehow flexibly superposed) by means of the many available structure alignment algorithms. Moreover, in order to appreciate the slight structural differences in this group, we will need a good measure for calculating the similarity between the aligned structures. We will work with data taken from [EncoMPASS - the Encyclopedia of Membrane Proteins Analyzed by Structure and Symmetry][https://encompass.ninds.nih.gov/]. The metric we will use is the TM-score, and the TM-score of a target protein aligned to a template structure is defined as\n",
    "\n",
    "$\\mathrm{TM-score} = \\max\\left[\\frac{1}{L_{\\mathrm{target}}}\\sum^{L_{\\mathrm{common}}}_i\\frac{1}{1 + \\left(\\frac{d_i}{d_0(L_{\\mathrm{target}})}\\right)^2}\\right]$\n",
    "\n",
    "Where $L_{\\mathrm{target}}$ is the length of the sequence of the target protein, and $L_{\\mathrm{common}}$ is the number of amino acids the two proteins have in common. A TM-score of 1 indicates a perfect alignment, whereas a TM-score of 0 a complete misalignment. A TM-score of >0.5 is a good indicator that the two structures are related to each other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1383968e",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:14:30.652382Z",
     "start_time": "2021-11-13T15:14:25.774122Z"
    }
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc15a068",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:14:38.398191Z",
     "start_time": "2021-11-13T15:14:30.660517Z"
    }
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e18d82e",
   "metadata": {},
   "source": [
    "## Learn about the data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "74669ec9",
   "metadata": {},
   "source": [
    "The file *rhodopsins.txt* contains all the TM-scores associated with any pair of rhodopsin-like proteins whose structure has been experimentally determined.\n",
    "The first technical problem we encounter is that the TM-score is not a distance: can you identify all the reasons why it isn't?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46c195d4",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "9d1ec6d0",
   "metadata": {},
   "source": [
    "Among the reasons why the TM-score cannot be used as a distance, there is one which is even more fundamental than the others: state it and add here the code for respecting this condition."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca6fc5b4",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9aa461f5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Reads the file\n",
    "dist = {}\n",
    "keys = set()\n",
    "with open('rhodopsinlike.txt') as f:\n",
    "    for line in f:\n",
    "        fields = line.split()\n",
    "        dist[(fields[0]+'_'+fields[1], fields[2]+'_'+fields[3])] = float(fields[8])\n",
    "        keys.add(fields[0]+'_'+fields[1])\n",
    "\n",
    "# Creates an ordered list of labels\n",
    "lkeys = sorted(list(keys))\n",
    "\n",
    "# Creates the TM-score matrix\n",
    "X = np.zeros((len(lkeys), len(lkeys)))\n",
    "for i1, k1 in enumerate(lkeys):\n",
    "    for q, k2 in enumerate(lkeys[i1+1:]):\n",
    "        i2 = q + i1 + 1\n",
    "        X[i1, i2] = dist[(k1, k2)]\n",
    "        X[i2, i1] = dist[(k1, k2)]\n",
    "\n",
    "'''Add your code here'''\n",
    "\n",
    "'''End of your code'''\n",
    "\n",
    "# Plot the \"distance\" matrix you have obtained\n",
    "import seaborn as sns\n",
    "\n",
    "plt.figure(figsize=(11,8.5))\n",
    "sns.heatmap(X)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66bc3c81",
   "metadata": {},
   "source": [
    "## Dimensionality reduction"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06309fa2",
   "metadata": {},
   "source": [
    "A large set of pairwise distances generally lies in a high-dimensional space <sup>1</sup>. In order to visualize it, we will need as usual to reduce the dimensionality, this time with an algorithm that can accept distances instead of coordinates: [Multidimensional scaling][https://en.wikipedia.org/wiki/Multidimensional_scaling].\n",
    "Remember: all the consideration we will make will be exact in the original, N-dimensional space! In 2D, they hopefully will not be too off, but they will for sure be an approximation.\n",
    "\n",
    "<sup>1</sup> *As an example, think about the simple case where you have three points, each of which is distant 1 from all the others. The only possible arrangement is when the points form an equilateral triangle. If now you add a fourth point which has to respect the same condition, 2 dimensions will not be enough! The only possible arrangement in 3 dimensions is the regular tetrahedron.*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c8d049e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.manifold import MDS\n",
    "\n",
    "embedding = MDS(n_components=2, dissimilarity='precomputed', metric=True)\n",
    "X_2D = embedding.fit_transform(X)\n",
    "plt.scatter(X_2D[:,0], X_2D[:,1])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcbd3bcc",
   "metadata": {},
   "source": [
    "## Clustering with K-means\n",
    "\n",
    "Kmeans is a classical, workhorse clustering algorithm, and a common place to start. It assumes there are K centers and, starting from random guesses, algorithmically improves its guess about where the centers must be."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4193c157",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-danger\">\n",
    "Using the KMeans function of sklearn, propose a first clustering in 3 clusters with 25 runs\n",
    "\n",
    "To help you, you must use the parameters:\n",
    "    \n",
    "* `n_clusters`: The number of clusters to form as well as the number of centroids to generate.\n",
    "* `n_init`: Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.\n",
    "* `random_state`: Determines random number generation for centroid initialization. Use an int to make the randomness deterministic.\n",
    "    \n",
    "Using the `cluster_centers` attribute, print the coodinate of each cluster centers.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f448bb3",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:19:57.346081Z",
     "start_time": "2021-11-13T15:19:57.279411Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.cluster import KMeans\n",
    "colors = ['red', 'blue', 'green', 'yellow', 'purple', 'orange', 'brown', 'black', '0.5', 'pink', 'cyan', 'magenta', 'beige', 'lightblue', 'lightgreen']\n",
    "\n",
    "''' TO DO '''\n",
    "struct_km = KMeans(n_clusters=,  # Complete here\n",
    "                   n_init=,    # Complete here\n",
    "                   random_state= # Complete here\n",
    "                  ).fit(X_2D)\n",
    "\n",
    "#struct_km.cluster_centers_\n",
    "color_labels = [colors[i] for i in struct_km.labels_]\n",
    "plt.scatter(X_2D[:,0], X_2D[:,1], c=color_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e79444f8",
   "metadata": {},
   "source": [
    "We now calculate the average of the points of each cluster to obtain the centroid found by K-means."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9991fce2",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:19:58.979798Z",
     "start_time": "2021-11-13T15:19:58.965679Z"
    }
   },
   "outputs": [],
   "source": [
    "# Calculate and print the centroids of the clusters\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7de1ec5b",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:12:18.376369Z",
     "start_time": "2021-11-13T15:12:18.366326Z"
    }
   },
   "source": [
    "<div class=\"alert alert-block alert-danger\">\n",
    "Do we get the same value as when looking at the cluster_centers_ attribute ?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3f177d3b",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:19:59.997885Z",
     "start_time": "2021-11-13T15:19:59.989356Z"
    }
   },
   "outputs": [],
   "source": [
    "# Compare the centroids calculated previously with the ones of struct_km.cluster_centers_\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9eed4066",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-12T16:53:55.979093Z",
     "start_time": "2021-11-12T16:53:55.962526Z"
    }
   },
   "source": [
    "Plot the dataset with the cluster attributes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea00148d",
   "metadata": {},
   "outputs": [],
   "source": [
    "color_labels = [colors[i] for i in struct_km.labels_]\n",
    "plt.scatter(X_2D[:,0], X_2D[:,1], c=color_labels)\n",
    "for i, c in enumerate(centroids):\n",
    "    plt.scatter(c[0], c[1], marker='$%d$' % int(i), alpha=1,\n",
    "                    s=50, edgecolor='k')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ff8176bb",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-12T16:56:18.642444Z",
     "start_time": "2021-11-12T16:56:18.638926Z"
    }
   },
   "source": [
    "### Plot the distortions of K-means"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c890e44b",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-12T17:00:44.589115Z",
     "start_time": "2021-11-12T17:00:44.581598Z"
    }
   },
   "source": [
    "<div class=\"alert alert-block alert-danger\">\n",
    "You can easily run K-Means with several run for a range of clusters using a for loop and collecting the distortions into a list.\n",
    "    \n",
    "You can collect the distortions using the `inertia_`attribute. `inertia_`is the sum of squared distances of samples to their closest cluster center, weighted by the sample weights if provided.\n",
    "    \n",
    "Then plots the distortions of K-means.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fda063e8",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:18:54.365189Z",
     "start_time": "2021-11-13T15:18:53.887647Z"
    }
   },
   "outputs": [],
   "source": [
    "distortions = []\n",
    "for k in range(1,11):\n",
    "    km = KMeans().fit(X_2D)  # Complete here\n",
    "    distortions.append()  # Complete here\n",
    "    \n",
    "plt.figure(figsize=(16,8))\n",
    "plt.plot(range(1,11), distortions, 'bx-')\n",
    "plt.xlabel('k')\n",
    "plt.ylabel('Distortion')\n",
    "plt.title('The Elbow Method showing the optimal k')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b7a28a93",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-block alert-danger\">\n",
    "What is the best number of clusters when using the Elbow method?\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37abcb60",
   "metadata": {},
   "source": [
    "### Silhouette Plots\n",
    "Silhouette plots give rich information on the quality of a clustering\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a4279b2",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:18:54.664495Z",
     "start_time": "2021-11-13T15:18:54.646323Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import silhouette_samples, silhouette_score\n",
    "import matplotlib.cm as cm\n",
    "#modified code from http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html\n",
    "\n",
    "def silplot(X, cluster_labels, clusterer, pointlabels=None):\n",
    "    n_clusters = clusterer.n_clusters\n",
    "\n",
    "    # Create a subplot with 1 row and 2 columns\n",
    "    fig, (ax1, ax2) = plt.subplots(1, 2)\n",
    "    fig.set_size_inches(11,8.5)\n",
    "\n",
    "    # The 1st subplot is the silhouette plot\n",
    "    # The silhouette coefficient can range from -1, 1 but in this example all\n",
    "    # lie within [-0.1, 1]\n",
    "    ax1.set_xlim([-0.1, 1])\n",
    "\n",
    "    # The (n_clusters+1)*10 is for inserting blank space between silhouette\n",
    "    # plots of individual clusters, to demarcate them clearly.\n",
    "    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])\n",
    "\n",
    "    # The silhouette_score gives the average value for all the samples.\n",
    "    # This gives a perspective into the density and separation of the formed\n",
    "    # clusters\n",
    "    silhouette_avg = silhouette_score(X, cluster_labels)\n",
    "    print(\"For n_clusters = \", n_clusters,\n",
    "          \", the average silhouette_score is \", silhouette_avg,\".\",sep=\"\")\n",
    "\n",
    "    # Compute the silhouette scores for each sample\n",
    "    sample_silhouette_values = silhouette_samples(X, cluster_labels)\n",
    "\n",
    "    y_lower = 10\n",
    "    for i in range(0,n_clusters+1):\n",
    "        # Aggregate the silhouette scores for samples belonging to\n",
    "        # cluster i, and sort them\n",
    "        ith_cluster_silhouette_values = \\\n",
    "            sample_silhouette_values[cluster_labels == i]\n",
    "\n",
    "        ith_cluster_silhouette_values.sort()\n",
    "\n",
    "        size_cluster_i = ith_cluster_silhouette_values.shape[0]\n",
    "        y_upper = y_lower + size_cluster_i\n",
    "\n",
    "        color = cm.nipy_spectral(float(i) / n_clusters)\n",
    "        ax1.fill_betweenx(np.arange(y_lower, y_upper),\n",
    "                          0, ith_cluster_silhouette_values,\n",
    "                          facecolor=color, edgecolor=color, alpha=0.7)\n",
    "\n",
    "        # Label the silhouette plots with their cluster numbers at the middle\n",
    "        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))\n",
    "\n",
    "        # Compute the new y_lower for next plot\n",
    "        y_lower = y_upper + 10  # 10 for the 0 samples\n",
    "\n",
    "    ax1.set_title(\"The silhouette plot for the various clusters.\")\n",
    "    ax1.set_xlabel(\"The silhouette coefficient values\")\n",
    "    ax1.set_ylabel(\"Cluster label\")\n",
    "\n",
    "    # The vertical line for average silhouette score of all the values\n",
    "    ax1.axvline(x=silhouette_avg, color=\"red\", linestyle=\"--\")\n",
    "\n",
    "    ax1.set_yticks([])  # Clear the yaxis labels / ticks\n",
    "    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])\n",
    "\n",
    "    # 2nd Plot showing the actual clusters formed\n",
    "    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)\n",
    "\n",
    "    # axes \n",
    "    ax2.scatter(X_2D[:, 0], X_2D[:, 1], marker='.', s=200, lw=0, alpha=0.7,\n",
    "                c=colors, edgecolor='k')\n",
    "    xs = X_2D[:, 0]\n",
    "    ys = X_2D[:, 1]\n",
    "\n",
    "\n",
    "    if pointlabels is not None:\n",
    "        for i in range(len(xs)):\n",
    "            plt.text(xs[i],ys[i],pointlabels[i])\n",
    "\n",
    "    # Labeling the clusters (transform to PCA space for plotting)\n",
    "    centers = clusterer.cluster_centers_\n",
    "    # Draw white circles at cluster centers\n",
    "    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',\n",
    "                c=\"white\", alpha=1, s=200, edgecolor='k')\n",
    "\n",
    "    for i, c in enumerate(centers):\n",
    "        ax2.scatter(c[0], c[1], marker='$%d$' % int(i), alpha=1,\n",
    "                    s=50, edgecolor='k')\n",
    "\n",
    "    ax2.set_title(\"The visualization of the clustered data.\")\n",
    "    ax2.set_xlabel(\"PC1\")\n",
    "    ax2.set_ylabel(\"PC2\")\n",
    "\n",
    "    plt.suptitle((\"Silhouette analysis for KMeans clustering on sample data \"\n",
    "                  \"with n_clusters = %d\" % n_clusters),\n",
    "                 fontsize=14, fontweight='bold')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a39a393a",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:18:55.373437Z",
     "start_time": "2021-11-13T15:18:54.857755Z"
    }
   },
   "outputs": [],
   "source": [
    "fitted_km = KMeans().fit(X_2D)  # Complete here\n",
    "cluster_labels = # Complete here\n",
    "\n",
    "silplot(X_2D, cluster_labels, fitted_km)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0615e2ce",
   "metadata": {},
   "source": [
    "### Silhouette Score"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2900cd9e",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-12T17:12:26.475324Z",
     "start_time": "2021-11-12T17:12:26.467696Z"
    }
   },
   "source": [
    "<div class=\"alert alert-block alert-danger\">\n",
    "Now plot the score of the silhouette as the number of clusters varies.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e7975445",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:18:56.818670Z",
     "start_time": "2021-11-13T15:18:56.210885Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.metrics import silhouette_score\n",
    "\n",
    "scores = [0]  # silhouette score for 1 cluster\n",
    "for i in range(2,11):\n",
    "    km = KMeans().fit(X_2D)  # Complete here\n",
    "    sil = # Complete here\n",
    "    scores.append(sil)\n",
    "\n",
    "print(\"Optimized at\", max(range(len(scores)), key=scores.__getitem__)+1, \"clusters\")\n",
    "\n",
    "plt.figure(figsize=(16,8))\n",
    "plt.plot(range(1,11), scores, 'bx-')\n",
    "plt.xlabel('Number of clusters $k$')\n",
    "plt.ylabel('Average Silhouette')\n",
    "plt.title('The Silhouette Method showing the optimal $k$')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c999ea7f",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-12T17:12:34.169433Z",
     "start_time": "2021-11-12T17:12:34.155932Z"
    }
   },
   "source": [
    "<div class=\"alert alert-block alert-danger\">\n",
    "What is the best number of clusters when using the Silhouet method?\n",
    "\n",
    "Is it the same as with the Elbow method?\n",
    "    \n",
    "Cluster the dataset with the best number of clusters and try to explain each cluster.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8fcb892",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:22:27.369676Z",
     "start_time": "2021-11-13T15:22:26.894326Z"
    }
   },
   "outputs": [],
   "source": [
    "struct_km = KMeans().fit(X_2D)  # Complete here\n",
    "color_labels = [colors[i] for i in struct_km.labels_]\n",
    "plt.scatter(X_2D[:,0], X_2D[:,1], c=color_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d2449348",
   "metadata": {},
   "source": [
    "#### Gap Statistic\n",
    "\n",
    "The gap statistic is a method for estimating the number of clusters in a dataset. The techniques uses the output of any clustering algorithm, comparing the change in within-cluster dispersion with that expected under an appropriate reference null distribution. The original paper is [here](https://web.stanford.edu/~hastie/Papers/gap.pdf).\n",
    "\n",
    "* Githup page: [https://github.com/milesgranger/gap_statistic]\n",
    "* [K-Means Clustering and the Gap-Statistics](https://towardsdatascience.com/k-means-clustering-and-the-gap-statistics-4c5d414acd29)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4a88086d",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:35:32.672869Z",
     "start_time": "2021-11-13T15:35:32.669417Z"
    }
   },
   "outputs": [],
   "source": [
    "# need to install library 'gap-stat'\n",
    "#!pip install gap-stat"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "59a54757",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:35:34.292336Z",
     "start_time": "2021-11-13T15:35:33.300687Z"
    }
   },
   "outputs": [],
   "source": [
    "from gap_statistic import OptimalK\n",
    "\n",
    "gs_obj = OptimalK()\n",
    "\n",
    "n_clusters = gs_obj(X_2D,                                # The dataset\n",
    "                    n_refs=100,                          # Number of runs\n",
    "                    cluster_array=np.arange(1, 15))      # Range of clusterization\n",
    "print('Optimal clusters: ', n_clusters)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5bb8f864",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:35:34.350495Z",
     "start_time": "2021-11-13T15:35:34.307155Z"
    }
   },
   "outputs": [],
   "source": [
    "gs_obj.gap_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51d59587",
   "metadata": {},
   "source": [
    "The columns of the dataframe are:\n",
    "\n",
    "* n_clusters - The number of clusters for which the statistics in this row were calculated.\n",
    "* gap_value - The Gap value for this n.\n",
    "* gap* - The Gap* value for this n.\n",
    "* ref_dispersion_std - The standard deviation of the reference distributions for this n.\n",
    "* sk - The standard error of the Gap statistic for this n.\n",
    "* sk* - The standard error of the Gap* statistic for this n.\n",
    "* diff - The diff value for this n (see the methodology section for details).\n",
    "* diff* - The diff* value for this n (corresponding to the diff value for Gap*).\n",
    "\n",
    "Additionally, the relation between the above measures and the number of clusters can be plotted by calling the OptimalK.plot_results() method (meant to be used inside a Jupyter Notebook or a similar IPython-based notebook), which prints four plots:\n",
    "\n",
    "* A plot of the Gap value versus n, the number of clusters.\n",
    "* A plot of diff versus n.\n",
    "* A plot of the Gap* value versus n, the number of clusters.\n",
    "* A plot of the diff* value versus n."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9000f0b0",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:35:35.283871Z",
     "start_time": "2021-11-13T15:35:34.353529Z"
    }
   },
   "outputs": [],
   "source": [
    "gs_obj.plot_results()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2a556ab",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:34:33.210437Z",
     "start_time": "2021-11-13T15:34:33.202451Z"
    }
   },
   "source": [
    "<div class=\"alert alert-block alert-danger\">    \n",
    "Cluster the dataset with the best number of clusters givent by gap-stat and try to explain each cluster.\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6426f47a",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:35:35.788834Z",
     "start_time": "2021-11-13T15:35:35.290129Z"
    }
   },
   "outputs": [],
   "source": [
    "''' TO DO '''\n",
    "struct_km = KMeans().fit(X_2D) # Complete here\n",
    "color_labels = [colors[i] for i in struct_km.labels_]\n",
    "plt.scatter(X_2D[:,0], X_2D[:,1], c=color_labels)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47dd3746",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T15:33:09.636279Z",
     "start_time": "2021-11-13T15:33:09.629159Z"
    }
   },
   "source": [
    "<div class=\"alert alert-block alert-danger\">    \n",
    "Try to explain each cluster\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cfbb756d",
   "metadata": {},
   "source": [
    "## Hierarchical clustering\n",
    "\n",
    "K-means is a very 'hard' clustering: points belong to exactly one cluster, no matter what. A hierarchical clustering creates a nesting of clusters as existing clusters are merged or split.\n",
    "\n",
    "Dendograms (literally: branch graphs) can show the pattern of splits/merges. Unfortunately dendograms are not implemented in sklearn. We have to use the scipy library."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "143249bd",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T17:58:07.376851Z",
     "start_time": "2021-11-13T17:58:07.372501Z"
    }
   },
   "outputs": [],
   "source": [
    "import scipy.cluster.hierarchy as hac\n",
    "from scipy.spatial.distance import pdist"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8a2c8757",
   "metadata": {},
   "source": [
    "The following code shows different dendograms build with different algorithms:\n",
    "\n",
    "1. Min (or single linkage) algorithm with euclidian distance: the distance of the new cluster, is the min.\n",
    "1. Max (or complete linkage) algorithm with euclidian distance: : the distance of the new cluster, is the max.\n",
    "1. Ward algorithm with euclidean distance: the distance of the new cluster, is the average of the sum of the square of the distance.\n",
    "\n",
    "![three linkage type](https://www.researchgate.net/profile/Pamela-Guevara/publication/281014334/figure/fig57/AS:418517879934980@1476793847581/The-three-linkage-types-of-hierarchical-clustering-single-link-complete-link-and.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "af3e9dda",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T18:01:29.575608Z",
     "start_time": "2021-11-13T18:01:28.915189Z"
    }
   },
   "outputs": [],
   "source": [
    "# Min (or single linkage) algorithm with euclidian distance\n",
    "plt.figure(figsize=(11,8.5))\n",
    "dist_mat = pdist(X_2D, metric=\"minkowski\", p=2) # Euclidian\n",
    "ward_data = hac.single(dist_mat) # min\n",
    "\n",
    "hac.dendrogram(ward_data)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ce645418",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T18:01:30.194875Z",
     "start_time": "2021-11-13T18:01:29.579512Z"
    }
   },
   "outputs": [],
   "source": [
    "# Max (or complete linkage) algorithm with euclidian distance\n",
    "plt.figure(figsize=(11,8.5))\n",
    "dist_mat = pdist(X_2D, metric=\"minkowski\", p=2) # Euclidian\n",
    "ward_data = hac.complete(dist_mat) # max\n",
    "\n",
    "hac.dendrogram(ward_data)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cdd2bce3",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T18:01:31.450023Z",
     "start_time": "2021-11-13T18:01:30.197720Z"
    }
   },
   "outputs": [],
   "source": [
    "# Ward algorithm with euclidean distance\n",
    "plt.figure(figsize=(11,8.5))\n",
    "dist_mat = pdist(X_2D, metric=\"minkowski\", p=2) # Euclidian\n",
    "ward_data = hac.ward(dist_mat) # average\n",
    "\n",
    "hac.dendrogram(ward_data)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d4e6505",
   "metadata": {},
   "source": [
    "If sklearn can't draw dendogrames directly. It is nevertheless possible :\n",
    "\n",
    "* to do agglomerative clustering with sklearn ([sklearn.cluster.AgglomerativeClustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html))\n",
    "* to draw dendograms with sklearn ([doc](https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html#sphx-glr-auto-examples-cluster-plot-agglomerative-dendrogram-py))\n",
    "\n",
    "In the following example, we search for 3 clusters using a Euclidean distance and ward as a linkage criterion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "29471770",
   "metadata": {
    "ExecuteTime": {
     "end_time": "2021-11-13T18:11:56.121554Z",
     "start_time": "2021-11-13T18:11:55.781741Z"
    }
   },
   "outputs": [],
   "source": [
    "from sklearn.cluster import AgglomerativeClustering\n",
    "\n",
    "struct_ac = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')  \n",
    "struct_ac.fit_predict(X_2D)\n",
    "color_labels = [colors[i] for i in struct_ac.labels_]\n",
    "plt.scatter(X_2D[:,0], X_2D[:,1], c=color_labels)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "nbTranslate": {
   "displayLangs": [
    "*"
   ],
   "hotkey": "alt-t",
   "langInMainMenu": true,
   "sourceLang": "en",
   "targetLang": "fr",
   "useGoogleTranslate": true
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}