{ "cells": [ { "cell_type": "markdown", "id": "63bd6c3d", "metadata": {}, "source": [ "# Clustering" ] }, { "cell_type": "markdown", "id": "a36fbf12", "metadata": {}, "source": [ "### Motivation\n", "Membrane proteins are very mysterious: their shape and folding is rigidly constrained by the geometry of the lipid bilayer they are inserted in, and yet they carry out 50% of the functions in any type of cell. Slight structural changes, coupled with just the right changes in the biochemistry of the amino acid sequence, can give rise to extremely diverse behaviors. This is for sure the case with the very large group of [G protein-coupled receptors][https://en.wikipedia.org/wiki/G_protein-coupled_receptor], which couple with Guanine nucleotide-binding (G) proteins and have a distinct structural trademark: they all have exactly 7 transmembrane helices. Among the many protein families in this group, [Rhodopsin-like receptors][https://en.wikipedia.org/wiki/Rhodopsin-like_receptors] also share a similar active site. Their functions remain nonetheless very diverse: the targets of these receptors can be neuropeptides, neurotransmitters, and even light (like the Rhodopsin itself).\n", "The function and evolution of many of these proteins remain not well ascertained: in this labwork we will try to **analyze the proteins' structural differences and see whether we can infer something about their classification and their evolutionary history**.\n", "\n", "### Data\n", "Protein structures can be aligned (i.e. carefully and somehow flexibly superposed) by means of the many available structure alignment algorithms. Moreover, in order to appreciate the slight structural differences in this group, we will need a good measure for calculating the similarity between the aligned structures. We will work with data taken from [EncoMPASS - the Encyclopedia of Membrane Proteins Analyzed by Structure and Symmetry][https://encompass.ninds.nih.gov/]. The metric we will use is the TM-score, and the TM-score of a target protein aligned to a template structure is defined as\n", "\n", "$\\mathrm{TM-score} = \\max\\left[\\frac{1}{L_{\\mathrm{target}}}\\sum^{L_{\\mathrm{common}}}_i\\frac{1}{1 + \\left(\\frac{d_i}{d_0(L_{\\mathrm{target}})}\\right)^2}\\right]$\n", "\n", "Where $L_{\\mathrm{target}}$ is the length of the sequence of the target protein, and $L_{\\mathrm{common}}$ is the number of amino acids the two proteins have in common. A TM-score of 1 indicates a perfect alignment, whereas a TM-score of 0 a complete misalignment. A TM-score of >0.5 is a good indicator that the two structures are related to each other." ] }, { "cell_type": "code", "execution_count": null, "id": "1383968e", "metadata": { "ExecuteTime": { "end_time": "2021-11-13T15:14:30.652382Z", "start_time": "2021-11-13T15:14:25.774122Z" } }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": null, "id": "bc15a068", "metadata": { "ExecuteTime": { "end_time": "2021-11-13T15:14:38.398191Z", "start_time": "2021-11-13T15:14:30.660517Z" } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "markdown", "id": "2e18d82e", "metadata": {}, "source": [ "## Learn about the data" ] }, { "cell_type": "markdown", "id": "74669ec9", "metadata": {}, "source": [ "The file *rhodopsins.txt* contains all the TM-scores associated with any pair of rhodopsin-like proteins whose structure has been experimentally determined.\n", "The first technical problem we encounter is that the TM-score is not a distance: can you identify all the reasons why it isn't?" ] }, { "cell_type": "markdown", "id": "46c195d4", "metadata": {}, "source": [] }, { "cell_type": "markdown", "id": "9d1ec6d0", "metadata": {}, "source": [ "Among the reasons why the TM-score cannot be used as a distance, there is one which is even more fundamental than the others: state it and add here the code for respecting this condition." ] }, { "cell_type": "markdown", "id": "ca6fc5b4", "metadata": {}, "source": [] }, { "cell_type": "code", "execution_count": null, "id": "9aa461f5", "metadata": {}, "outputs": [], "source": [ "# Reads the file\n", "dist = {}\n", "keys = set()\n", "with open('rhodopsinlike.txt') as f:\n", " for line in f:\n", " fields = line.split()\n", " dist[(fields[0]+'_'+fields[1], fields[2]+'_'+fields[3])] = float(fields[8])\n", " keys.add(fields[0]+'_'+fields[1])\n", "\n", "# Creates an ordered list of labels\n", "lkeys = sorted(list(keys))\n", "\n", "# Creates the TM-score matrix\n", "X = np.zeros((len(lkeys), len(lkeys)))\n", "for i1, k1 in enumerate(lkeys):\n", " for q, k2 in enumerate(lkeys[i1+1:]):\n", " i2 = q + i1 + 1\n", " X[i1, i2] = dist[(k1, k2)]\n", " X[i2, i1] = dist[(k1, k2)]\n", "\n", "'''Add your code here'''\n", "\n", "'''End of your code'''\n", "\n", "# Plot the \"distance\" matrix you have obtained\n", "import seaborn as sns\n", "\n", "plt.figure(figsize=(11,8.5))\n", "sns.heatmap(X)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "66bc3c81", "metadata": {}, "source": [ "## Dimensionality reduction" ] }, { "cell_type": "markdown", "id": "06309fa2", "metadata": {}, "source": [ "A large set of pairwise distances generally lies in a high-dimensional space 1. In order to visualize it, we will need as usual to reduce the dimensionality, this time with an algorithm that can accept distances instead of coordinates: [Multidimensional scaling][https://en.wikipedia.org/wiki/Multidimensional_scaling].\n", "Remember: all the consideration we will make will be exact in the original, N-dimensional space! In 2D, they hopefully will not be too off, but they will for sure be an approximation.\n", "\n", "1 *As an example, think about the simple case where you have three points, each of which is distant 1 from all the others. The only possible arrangement is when the points form an equilateral triangle. If now you add a fourth point which has to respect the same condition, 2 dimensions will not be enough! The only possible arrangement in 3 dimensions is the regular tetrahedron.*" ] }, { "cell_type": "code", "execution_count": null, "id": "c8d049e6", "metadata": {}, "outputs": [], "source": [ "from sklearn.manifold import MDS\n", "\n", "embedding = MDS(n_components=2, dissimilarity='precomputed', metric=True)\n", "X_2D = embedding.fit_transform(X)\n", "plt.scatter(X_2D[:,0], X_2D[:,1])" ] }, { "cell_type": "markdown", "id": "bcbd3bcc", "metadata": {}, "source": [ "## Clustering with K-means\n", "\n", "Kmeans is a classical, workhorse clustering algorithm, and a common place to start. It assumes there are K centers and, starting from random guesses, algorithmically improves its guess about where the centers must be." ] }, { "cell_type": "markdown", "id": "4193c157", "metadata": {}, "source": [ "