{"cells": [{"cell_type": "markdown", "metadata": {}, "source": ["# Exercise: Molecular Dynamics Analysis"]}, {"cell_type": "markdown", "metadata": {}, "source": ["In this exercise we will work on simulation data for the *solution NMR structure of a small peptide* (PDB-ID 6a5j). Refer to the last exercise for details on the preparation of the simulation. You can download the trajectory file `simulation.dcd` (~150 ns total length, 20 ps time step) [here](md_analysis/simulation.dcd). A topology of the simulated system can be constructed from `simulation.pdb`, available [here](md_analysis/simulation.pdb).\n", "\n", "<div class='alert alert-info'>\n", "    \n", "**Info:** The system was simulated in water but the trajectory only contains atom coordinates of the protein to save disk space. We can not perform any analysis involving solvent on our system anymore.\n", "</div>\n", "\n", "Similar to MD simulation setups and runs, we have many choices for how to analyse a molecular trajectory. Of the many programs and packages out there to perform various \n", "types of analyses, we will use [MDTraj](http://mdtraj.org/1.9.3/index.html) which you can install via conda: \n", "\n", "```bash\n", "conda install -c conda-forge mdtraj \n", "```\n", "\n", "We will also need the universal [scikit-learn](https://scikit-learn.org/stable/) package:\n", "\n", "```bash\n", "conda install scikit-learn \n", "```"]}, {"cell_type": "code", "execution_count": 1, "metadata": {"ExecuteTime": {"end_time": "2020-07-01T08:58:57.243178Z", "start_time": "2020-07-01T08:58:56.426214Z"}}, "outputs": [], "source": ["# You will need the following modules\n", "import matplotlib\n", "import mdtraj\n", "import numpy as np\n", "from sklearn.cluster import DBSCAN\n", "from sklearn.decomposition import PCA"]}, {"cell_type": "markdown", "metadata": {}, "source": ["__1__ Open VMD and load the structure file `simulation.pdb`. The trajectory file `simulation.dcd` is in a compressed binary format. It can be read by VMD, but it can not be visualized on its own because it stores only coordinates and no information about atom types and alike. Load `simulation.dcd` into the same molecule as `simulation.pdb` (e.g. via `File > Load Data Into Molecule`). Choose a representation of the molecule that emphasises the secondary structure and watch the trajectory as a movie.\n", "\n", "  - __a)__ Does the helical structure of the molecule remain stable over the simulation?"]}, {"cell_type": "markdown", "metadata": {}, "source": ["  - __b)__ Include a figure of the last frame of the simulation in your report."]}, {"cell_type": "markdown", "metadata": {}, "source": ["__2__ You may have noticed that the molecule *jumps* across the simulation box when you\n", "visualise the trajectory in VMD. This is due to the periodic-boundary-conditions applied. For the visualisation and also for certain types of analyses, we should remove those jumps. We should also remove translational and rotational degrees of freedom from the simulation by re-centering and re-rotating the molecule at every simulation step. Those motions normally do not contain much information, despite how the molecule floats through the box."]}, {"cell_type": "markdown", "metadata": {}, "source": ["  - __a)__ Load the trajectory using `mdtraj.load_dcd`. You will need to specify the path to the trajectory file as first argument. Use the keyword argument `top` to specify `simulation.pdb` as supporting topology file. Print the trajectory object for an overview and print the atomic coordinates (`Trajectory.xyz`) of the first atom in the second simulation frame."]}, {"cell_type": "markdown", "metadata": {"ExecuteTime": {"end_time": "2020-06-30T08:40:59.258608Z", "start_time": "2020-06-30T08:40:59.247384Z"}}, "source": ["  - __b)__ Now we can use the method `superpose` of our trajectory object to fit every frame of the trajectory to a reference structure. As we have only one molecule in the system, we will get away with that. If we would have also solvent in it, we would need to tinker with the periodic image as well. Use `superpose` with the trajectory itself as first argument and use `frame=0` to specify that the first simulation frame should be our reference. It is often a good idea to use a rigid reference group of atoms, so lets use only the backbone atoms of the protein and pass them with the keyword argument `atom_indices`. You can get the atom indices with `Trajectory.topology.select(\"protein and backbone\")`. MDTraj understands (almost) the same selections as VMD. You can also use the `center_coordinates` method of the trajectory afterwards to center the molecule in the simulation box."]}, {"cell_type": "markdown", "metadata": {}, "source": ["  - __c)__ Save the fitted trajectory to disk with the `save_dcd` method of the     trajectory object. If you now visualise the trajectory in VMD you will notice that the jumps are gone and the trajectory snapshots are well aligned. Include a figure in your report that shows that the fit was successful, e.g. show 10 equidistant frames from the simulation and choose a different representation for the rigid backbone and the flexible sidechains (in the *Graphical Representation window* in VMD, there is a *Trajectory* tab in which you can control which frames are drawn)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["__3 RMSF__ Looking at the MD trajectory as a movie is often not very telling. Let's collect a few quantitative measures from the data instead to see what is going on during the simulation. First, we can calculate the root-mean-square *fluctuation* of atomic positions averaged over the course of the simulation to see which atoms move the most:\n", "\n", "$$\\sigma_i = \\sqrt{\\frac{1}{t_n}\\sum_{t=0}^{t_n}(x_i(t)-\\mu(x_i))^2}$$\n", "\n", "with $\\sigma_i$ being the RMSF of an atom, $x_i(t)$ being a current set of coordinates at time $t$ and $\\mu(x_i)$ denoting the average atomic coordinates (coordinates in a reference structure). The RMSF is basically the standard deviation of atomic positions."]}, {"cell_type": "markdown", "metadata": {}, "source": ["  - __a)__ Calculate the RMSF for the trajectory using `mdtraj.rmsf` with the trajectory both as target and reference. You can use `precentered=True` as we fitted the structures before ourselves. The output should be a 1D NumPy Array of length 260 (one RMSF value per atom in the system). "]}, {"cell_type": "markdown", "metadata": {}, "source": ["  - __b)__ Plot the RMSF vs. the atom indices as a barplot. Plot the RMSF for sidechain atoms in a different color than the backbone atoms."]}, {"cell_type": "markdown", "metadata": {}, "source": ["__4 RMSD__ Next, we can calculate the root-mean-square *deviation* of atomic positions (RMSD) to see how much the atomic coordinates change over time:\n", "\n", "$$\\tilde{r}(t) = \\sqrt{\\frac{1}{n}\\sum_{i=1}^{n}(x_i(t)-\\mu(x_i))^2}$$\n", "\n", "with $\\tilde{r}$ being the RMSD of a group of atoms at a given point in time."]}, {"cell_type": "markdown", "metadata": {}, "source": ["  - __a)__ Calculate the RMSD over the trajectory using `mdtraj.rmsd` with the trajectory both as target and reference. You can use `precentered=True` as we fitted the structures before ourselves. The output should be a 1D NumPy Array of the same length as the trajectory (one RMSD value per simulation frame). Calculate the RMSD separately for the whole molecule and backbone atoms only."]}, {"cell_type": "markdown", "metadata": {}, "source": [" - __b)__ Plot the RMSDs vs. time (in ns) as a time series."]}, {"cell_type": "markdown", "metadata": {}, "source": ["  - __c)__ Compute a histogram (`numpy.histogram`) from the backbone RMSD and plot the distribution of RMSD values. Choose a suitable data range (e.g. 0.1 to 0.4 nm) and an appropriate number of bins (e.g. 60) and normalise the histogram (`density=True`)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["__5 Hydrogen bonds__ are often a very important feature for the stability of protein structures. The presence of hydrogen bonds can be determined by using a distance and an angle cutoff for pairwise hydrogen\u2013polar atom interactions (e.g. hydrogen\u2013oxygen). If a hydrogen atom (attached to a polar atom) is close enough to another polar atom and if the angles of the partaking bonds meet a criterion, one can assume that a hydrogen bond is formed. "]}, {"cell_type": "markdown", "metadata": {}, "source": ["  - __a)__ Collect hydrogen bond information from the trajectory data using the Wernet-Nilsson scheme (`mdtraj.wernet_nilsson`). You can use `periodic=False`. The result will be a list of the same length as the trajectory (one element per frame). The list contains 2D NumPy arrays, representing the bonds found, of shape `(#bonds, 3)`. Each bond will occupy a row of the array with three elements that are the donor, hydrogen, and acceptor atom indices. For example the entry `[103, 109, 46]` describes a bond between the amino acid residues SER6 and LYS3. Atom indices can be mapped to atom names with `Trajectory.topology.atom`."]}, {"cell_type": "code", "execution_count": 2, "metadata": {"ExecuteTime": {"end_time": "2020-07-01T08:59:07.383913Z", "start_time": "2020-07-01T08:59:07.371322Z"}}, "outputs": [], "source": ["def hb_index_to_name(trajectory, hbond):\n", "    \"\"\"Convert hydrogen bond index tuple to name\"\"\"\n", "    return f\"{trajectory.topology.atom(hbond[0])}\u2013{trajectory.topology.atom(hbond[2])}\""]}, {"cell_type": "markdown", "metadata": {"ExecuteTime": {"end_time": "2020-07-01T08:59:08.643018Z", "start_time": "2020-07-01T08:59:08.472244Z"}}, "source": ["```python\n", ">>> hb_index_to_name(trajectory, [103, 109, 46])\n", "'SER6-N\u2013LYS3-O'\n", "```"]}, {"cell_type": "markdown", "metadata": {}, "source": ["  - __b)__ Calculate the frequency with which the bonds are found over the simulation. You can use a dictionary using the hydrogen bond index tuples as keys and an integer counter as values. Loop through the found hydrogen bonds and increase the counter if a bond is present in the frame. Which bonds are present in at least 10 % of the simulation?"]}, {"cell_type": "markdown", "metadata": {"ExecuteTime": {"end_time": "2020-06-30T11:30:59.230144Z", "start_time": "2020-06-30T11:30:59.218661Z"}}, "source": ["  - __c)__ Include an image in your report showing one or more of these frequent hydrogen bonds. In VMD there is a representation style *HBonds*. Which of these bonds would you consider most important for the stability of the helix?"]}, {"cell_type": "markdown", "metadata": {}, "source": ["__6 PCA__ The Cartesian space in which our molecule lives is already quite high dimensional (260 * 3 coordinates) and it can be hard to identify the different conformations, i.e. energy minima, it can be in. Principle component analysis gives us the opportunity to project the trajectory from the complex Cartesian space into a transformed low dimensional space. We can use the `sklearn.decomposition.PCA` class from the scikit-learn package for this. This will automise the task of constructing a covariance matrix from the trajectory and diagonalising it. Let's represent our trajectory in the 2D space obtained by PCA, which will entail the molecular motion of the highest spatial variance, i.e. the largest collective conformational change."]}, {"cell_type": "code", "execution_count": 3, "metadata": {"ExecuteTime": {"end_time": "2020-07-01T09:00:55.154047Z", "start_time": "2020-07-01T09:00:55.141562Z"}}, "outputs": [], "source": ["# Initialise sklearn PCA\n", "pca = PCA(n_components=2)"]}, {"cell_type": "markdown", "metadata": {}, "source": ["The scikit-learn analysis is agnostic of the nature of our data. As we saw before the coordinates in our trajectory are stored in a 3D array with shape `(#frames, #atoms, #coordinates)`. The analysis, however, expects a 2D array of shape `(#frames, #coordinates)` which means we need to forget about the atoms in the array structure and treat every coordinate the same (in the following it is assumed that our trajectory object was named `trajectory`)."]}, {"cell_type": "code", "execution_count": 4, "metadata": {"ExecuteTime": {"end_time": "2020-06-30T14:28:47.667446Z", "start_time": "2020-06-30T14:28:47.652129Z"}}, "outputs": [{"data": {"text/plain": ["(7789, 260, 3)"]}, "execution_count": 22, "metadata": {}, "output_type": "execute_result"}], "source": ["trajectory.xyz.shape"]}, {"cell_type": "code", "execution_count": 5, "metadata": {"ExecuteTime": {"end_time": "2020-06-30T14:28:53.869088Z", "start_time": "2020-06-30T14:28:53.864883Z"}}, "outputs": [{"data": {"text/plain": ["(7789, 780)"]}, "execution_count": 23, "metadata": {}, "output_type": "execute_result"}], "source": ["pca_input = trajectory.xyz.reshape(trajectory.n_frames, trajectory.n_atoms * 3)\n", "pca_input.shape"]}, {"cell_type": "markdown", "metadata": {}, "source": ["Now we can use the trajectory as input to the PCA estimator."]}, {"cell_type": "code", "execution_count": 6, "metadata": {"ExecuteTime": {"end_time": "2020-06-30T14:29:36.452851Z", "start_time": "2020-06-30T14:29:36.303428Z"}}, "outputs": [{"data": {"text/plain": ["(7789, 2)"]}, "execution_count": 24, "metadata": {}, "output_type": "execute_result"}], "source": ["trajectory_pca = pca.fit_transform(pca_input)\n", "trajectory_pca.shape"]}, {"cell_type": "markdown", "metadata": {}, "source": ["As you can recognise by the shape of the transformed trajectory, we now have a trajectory with the same number of frames as before but in only two dimensions."]}, {"cell_type": "markdown", "metadata": {}, "source": ["  - __a)__ Plot the dimensionality reduced trajectory in 2D as a scatter plot. Color the markers by the time step and add a colorbar. Can you spot different states in the projection? Optionally, convert the projected trajectory to a probability density (2D histogram) and plot it as a contour plot as well."]}, {"cell_type": "markdown", "metadata": {"ExecuteTime": {"end_time": "2020-06-30T13:16:17.871142Z", "start_time": "2020-06-30T13:16:17.861089Z"}}, "source": ["  - __b)__ In low dimensional spaces, conformational states can be spotted as regions in space where conformational snapshots accumulate, that is in low energy regions where the observation probability is high. Separating structures according to these states by hand can be, however, very tedious and error prone. This is what we can use cluster algorithms for. Cluster algorithms put a *label* on every point in a data set that indicates the membership of this point to a cluster. How a cluster \u2013 so a group of related points with similar properties \u2013 is defined, depends on the used algorithm. Here we could use the DBSCAN (density based clustering for applications with noise) method (available in the scikit-learn package as `sklearn.cluster.DBSCAN` ) to identify clusters as regions of high point density."]}, {"cell_type": "code", "execution_count": 7, "metadata": {"ExecuteTime": {"end_time": "2020-06-30T14:34:02.653265Z", "start_time": "2020-06-30T14:34:02.487259Z"}}, "outputs": [], "source": ["# Initialise scikit-learn clustering\n", "# DBSCAN - Points are in a dense region in space if they have a minimum\n", "#     number of neighbouring points\n", "\n", "# Cluster parameters: \n", "#    eps: Neighbour search radius\n", "#    min_samples: Minimum number of neighbours\n", "clustering = DBSCAN(eps=0.4, min_samples=18)\n", "labels = clustering.fit(trajectory_pca).labels_"]}, {"cell_type": "markdown", "metadata": {}, "source": ["  This will give us an array of labels with the same length as our trajectory (one    label/cluster assignment for every time step). Plot the trajectory in the PCA space again as a scatter plot but this time color the markers by cluster membership (hint: Use a discrete colormap like `tab20`). Note, that the label `-1` is used for noise points (not assigned to any cluster)."]}, {"cell_type": "markdown", "metadata": {}, "source": ["__c)__ For the two biggest identified clusters, find the point that is closest to the arithmetic mean of the cluster. Include an image in your report that shows the two representative cluster structures in comparison."]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.10"}, "toc": {"base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false}}, "nbformat": 4, "nbformat_minor": 4}