3DM advanced: 3DM applied to GPCRs

Index

Before doing this exercise it is best to first do the https://bioprodict.atlassian.net/wiki/spaces/DOC/pages/2992209927course.

General

The complete course consists out of three parts:

To follow these courses, you need a Bio-Prodict account. If you do not have one yet, please start by filling out the form on the registration page. Are you eligible for an academic license? Make sure to register with your academic email address. Once you submit the registration form, you will receive a confirmation email with further instructions on how to activate your account.

For some exercises in this section, you need either PyMOL or YASARA installed which has the 3DM plugin. If you do not have YASARA or PyMOL or you are missing the 3DM functionality, please consult the https://bioprodict.atlassian.net/wiki/spaces/DOC/pages/8323108. Make sure you have the latest version installed before you start this exercise.

In case you have any questions about this course, please get in touch with our support team via support@bio-prodict.com.

Introduction

After entering the login details on the login page, you land on the 3DM dashboard page. Here, you can find an overview of all available systems. In the 3DM COURSE tab, click on the GPCR (2016) system. This is the 3DM system we will be working with during this course. 

By default, the GPCR 3DM database shows data of the GPCR family class A sequences. There are other classes of GPCRs too which you will investigate later. Unless specifically requested otherwise, always use the GPCR class A protein family to answer the questions in this course.

Correlated mutations

Open 3DMs correlated mutation analysis tool CorNet by clicking on Correlated mutations in the left menu. You can see two big networks and a few smaller ones. Start thinking about whether the two bigger networks really are two separate networks, or if they are simply disconnected because of the fact that the default cut-off for the number of positions in the network is 25.

When two networks are independent, the residues in one of the networks mutate in sync with each other. Residues from the other network mutate in a different rate.

You can explore this idea by making an alignment with two groups of proteins. Each group should contain functionally similar proteins, but the function differs between the groups. For instance, a group can be made with enzymes that all convert a specific substrate, and a second group that converts another substrate. The correlated mutations will likely reveal hotspots that determine the substrate specificity change.

A very important note is that correlated mutations reflect functional changes and not environmental changes. This can easily be explained, because you can not use correlated mutations to find hotspots for thermostability. You could make an alignment of two groups: one composed of proteins from thermophilic organisms and another made from non-thermophilic organisms. Then, the differences between these groups should show up as correlated mutations, right? This is not the case, because in fact these two groups cover all sequences of the complete superfamily alignment. You can investigate the differences between these groups just by comparing the amino acid contents between the groups. This concept will be discussed later.

Another important note: The problem with functional grouping is that the functional annotations of proteins is often based only on sequence similarity and can therefore be incorrect. You need to make sure that at least a high percentage of the sequences share the same function to make functional based hotspot detection work. Examples are known in which the alignment was successfully grouped simply by using keyword searches in the protein descriptions and was used to find novel enzymes. The group of professor Uwe Bornscheuer has found R-selective amine transferases (only S-selective enzymes where known) by deleting groups of proteins from the protein family that had other activities based on motifs that were associated with the different functions. This exercise was repeated with 3DM. The protein family was first grouped by giving keywords that were related to enzymes with unwanted reaction mechanism. The keywords “lyase”, for instance, was used to select enzymes that likely had the unwanted lyase activity. Then, this set of sequences was used to find a sequence motif specific for lyase activity. All sequences that had this lyase specific sequence motif were subsequently deleted from the alignment. This step ensures that all lyases that were not annotated as such still got deleted from the alignment. This procedure was repeated for all different unwanted enzyme activities. This exercise resulted in 42 sequences that are highly likely all R-selective amine transferases.

There are multiple ways of doing this. One easy way would be to look at the amino acids distributions in each network. If the number of different motifs/different groups in the networks differ (e.g. the positions of the first network all have only two different residue types and the positions of the other network all have five different residues), you know the two networks mutate independently. We will come back to this in question 6. Structural separation of the two networks might also provide a hint.

Yes, only one network clearly contains hotspots for specificity. This means that changes in specificity were the driving evolutionary force behind the correlated mutation behavior of this sub-network. For the other network, there is no literature in the 3DM database describing mutations effecting specificity. As this database contains huge amounts of mutation data, it is unlikely that these positions are hotspots for specificity.

Yes, it is likely they are also important for specificity because they are in the same network. If you change the default Keyword cut-off setting from 2 to 1, you see that more nodes are coloured light blue including position 93. This means that one reported mutation at position 93 is related to specificity. You could have a look in the 3D structure to see if they are close to the other members of this network (and/or close to the ligand binding pocket). However, the only way to really make sure is to make mutations in the lab and test the binding affinity of the ligand. In enzymes, such mutations would have an effect on the Km but not so much on the Kcat. Such mutations typically indicate a hotspot for specificity. The Km on other substrates might improve at such positions.

Correlated mutations often divide the superfamily into several groups based on sequence motifs. Sometimes this is just an evolutionary separation, for example in the case of bacterial sequences versus those of eukaryotes. However, because 3DM uses superfamily alignments containing proteins with different functions, the correlations are often a result of the changes between the different functional groups.

Usually, 3DM alignments are composed of sequences from organisms from the entire tree of life. If you would make a phylogenic tree of the whole alignment and you see that it nicely overlays the tree of life, then correlated mutations can reflect the different domains (evolutionary grouping). However, multiple sequences from the same organism are generally in a 3DM alignment and these sequences can sequentially be very different. They almost always have different functions. Because the alignment holds proteins with different functions, the different groups of the corresponding phylogenetic tree are function based. Proteins with the same function form the large groups/clusters in the tree, instead of the different domains or kingdoms of the tree of life. The different groups in such alignments may result in correlated mutations that lead to functional changes during evolution. Note that you can have both types of differentiation in a single alignment.

Click on the main nodes of the two big networks (position 88 and 74). Clicking on a node gives information about this position, such as the amino acid distributions.

You can investigate the correlation between residues by selecting them simultaneously. This can be done by making a selection box around the positions, or by selecting them while holding the Ctrl or command key.

Again, enter the keyword “specificity” in the keyword search box of the Literature & Mutations section. Now, look at the enrichment-score (overall enrichment) below the search box.

In the Literature & Mutations section you can change the Keyword cut-off. With the default value of 2, 3DM will only show positions in the network for which at least two different mutations have been published that had an effect on specificity. This is to ensure no false positives are included. In protein families for which a lot of mutations are available, it is usually smart to leave this at at least two. If there is not so much data available you can change it to 1 to get reliable E-scores.

Different protein features (e.g. activity, dimerisation) can be the underlying cause of an evolutionary pressure that has resulted in correlated mutations. In 3DM we pre-calculate the E-scores of the different protein features at different correlated mutation cut-offs. The resulting plot shows quickly which evolutionary pressure is behind a correlated mutation network. At high correlated mutation cut-offs specificity mutations are clearly over-represented in the GPCR CorNet network (Figure 1).

image-20241023-053454.png
Figure 1. E-scores of different keywords.

With the keyword cut-off set back to the default value of 2, select the other big network (with position 74 in its center) and color the nodes blue with the node coloring option that appears on the right when you select positions in the window. Select positions 247 and 248 and make them yellow. Select positions 42 and 32 and make them Magenta. Finally, select positions 60, 84, and 98 and make them red. Click on ALL NODES in the Visualize alignment position in structure menu. Keep the default settings on the Visualize page and create your scene by clicking on VISUALIZE. Open the downloaded file in YASARA or PyMol.

In case you are using YASARA, load the 1F88A co-crystalized compound using via 3DM → Structures → Load structure from 3DM.

If you use Pymol, start the 3DM plugin using plugin → initialize plugin system. Then, plugin → legacy plugins → 3DM. A new window should start. Use your 3DM credentials to login and click in this window on 3DM menu → structures → load structures from 3DM, select 1F88A and only select Download drug.

Families and numbering schemes

There are different families in the GPCR database. An alignment has been generated for each distinct GPCR class. These alignments are structure based and all have their own structural conserved core. Because these cores differ in length, the different alignments have their own numbering schemes. Via the Alignment menu at the top of the 3DM website you can switch between the different families.

In the left menu, click on System → System info to open a nice overview of the different family alignments.

Now using the Alignment scroll down menu select Superfamily 3DM and look at where the conserved residues are (which 3D numbers) using the Alignment statistics page. Now do the same for the group A sequences (this you can do best in another tab so you can switch between the tabs).

Select from the Numbering scheme scroll-down menu at the top of 3DM the gpcr_A numbering in both alignments.

Realise that between different 3D alignments, the 3D numbers can not be compared (you can noy select gprc_a numbers if you have the GPCR class B alignment selected). Because the size of the structural conserved cores differ between the different families, different numbering schemes have been developed for the different subclasses. Each of these numbering schemes indicates for which subclass the numbering is. Although it would have been even better if they had used a common numbering for the transmembrane parts that are shared by all classes, such that the conserved Cys would have had the same number in all classes. Although this is not the case, the commonly used numbering schemes designed for the GPCRs is a luxury that is only available in the GPCR protein family and has not been developed for any other protein family (as far as we know).

Making a structural model with 3DM

3DM contains an automated homology-modeling module. It uses the 3DM alignment between a sequence and a structure to generate models. At the protein detail page of a sequence you can find a MODELS tab. Here you can select a suitable template to generate a homology model of the protein. 3DM already makes a pre-selection of suitable templates based on sequence similarity. You can choose any of the pre-selected templates. Which of these templates is best is for you to decide. This decision should be based on what you want to do with your model. For instance, enzymes can be in the open or closed conformation. If you want your model in the closed conformation, then start with a template that is also in the closed conformation. Other factors can be e.g. with or without ligand, with an activating or inhibiting ligand bound in the ligand binding pocket, etc. So, always first investigate the different starting templates to see which one fits best to your needs.

Hirschsprung disease is a disease that is caused by mutations in a GPCR.

 

Select a template and make a model. Make sure that you have the GPCR class A alignment selected. Use YASARA or PyMol to open it. Creating the model may take a few minutes. If 3DM was not able to model parts of the sequence, these parts will be missing from your model. These missing positions are indicated as purple dots in YASARA and as lines in PyMol. Usually, this is because the alignment between the sequence and the template can not reliably be made (often the parts outside the core) due to very low sequence similarities. Realise that those parts can not be modeled reliably using the selected template, because the sequence similarity is so low that the two proteins will likely fold differently in those parts. Sometimes it helps to make a model choosing a different template (if available), but usually this means that those parts can simply not be modeled reliably.

Visualize data in structures

To visualize data in a structure you can use the Visualize tool in 3DM. Here you can select multiple structures that are all superimposed. You can select one or more of the templates that are used to generate the alignment if you select the Show only templates option from the quick filter menu. If you activate the advanced setting, a PDB type can be selected. The Aligned setting gives any of the other structures that could be superimposed on the templates. The Non-aligned setting contains structures that were detected by BLAST but which 3DM could not superimpose on the templates. These structures usually do not belong to the superfamily, or are structurally too distantly related. Models that you have generated with 3DM can be selected from the Model PDB type setting. The name of a model indicates which protein is modeled, and between brackets you see which structure was used to make the model. In the Compounds section you can select any co-crystalized compound. These are also superimposed, so usually they will be in the pocket of the structures. You can select any combination of protein structure and ligand, so you can insert any ligand into any structure.

Navigate to Alignment statistics in 3DM. Scroll down to the Keyword mutations plot. Here, you can enter a keyword to see how many mutations related to this specific keyword are found for a given position.

Make sure you have the GPCR class A family selected under Alignment at the top of the page and consult the Human variation/Position histogram. Compare this histogram with the Amino acid conservation histogram using the compare with option above the histogram. Set the cut-off for conservation on 50% and the SNP cut-off on 0,3.

Note that the  button can be found at many places in 3DM for direct visualization of data in.

There are other export buttons too, such as the Export to hotspot basket button. Let's see how this works. Put the cut-off of the SNP graph again on 1 and open the hotspot basket tool by opening the Hotspots menu in the top right corner of 3DM. At the SNP histogram this sign will appear (possibly with a different numnber): . This icon is for the insertion of selected positions in a hotspot basket. Click this button to upload the SNP positions into a new hotspot basket, give the basket a name, and save the basket.

A hotspot basket is nothing more than a selection of alignment positions. You can generate hotspots for different protein features (e.g. correlated mutations, specificity hotspots, thermostability hotspots, etc.) and those can be found using 3DM. At a later stage you can open the basket in different 3DM tools. Let's see how this works.

A good trick for increasing the thermostability of proteins is to make mutations at flexible regions. Go to the Alignment Statistics module of 3DM. Here you will find two histograms that display flexible positions, the RMSD (a measure for how tight the structures superimposed) and the average B-factor. The B-factor is a measure for how sharp the X-ray diffraction was for the atoms of a residue. A very sharp X-ray indicates that the amino acid is tightly positioned in the structure. The average B-factor is the average over all amino acids from all structures at an alignment position. Often these two plots show a similar pattern. If at a position both plots are high usually this is a good hotspot for changing thermostability.

Panel design

The idea of the panel design tool is to select sequences from the alignment such that the selected sequences are maximally distributed over the superfamily. This is done in two steps: First, the sequences of the superfamily are grouped. This can simply be based on sequence similarity (similar sequences are within the same group). Groups can also be based on sequence motifs found at user selected positions. The last option is used to group sequences based on a protein feature. For instance, the user can pick positions important for specificity. The idea is that sequences that have the exact same residues (the same motif) at those positions they are likely to have the same specificity. Both methods can be combined. In the second step a user defined number of sequences (usually one or two) are selected from each group. The selection step contains all kinds of options to maximise the chance that these proteins are likely to express. Let's see how it works. 

Select the Panel design option in 3DM. First, we will divide the super-family based on sequence motifs. Because we want maximise the specificity range in the panel, we will use the "specificity hotspot" basket you have generated in question 30. This way, all sequences with the same motif at our specificity hotspots (thus with likely the same specificity) will be in one group. Use the Add hotspots option to select the hotspot basket that you made in question 30.

Once they are selected, click on SHOW GROUPS.

You can also divide the alignment phylogenetically. This separation can be combined to the motif grouping. In the box under Phylogenetic groups you can give a number. The superfamily will then be divided into this number of groups purely based on phylogenetic distances. Reload the panel design tool (refresh the page), reload the hotspot basket, type 10 in the Phylogenetic groups size box and click again on SHOW GROUPS.

After defining the groups, you have to select from each group the sequences that you would like to have in the panel. There are several selection options that can be used to pick sequences from each group. First, you select the number of sequences per group with the Proteins per group option in the expanded Protein selection panel. Then, several options can be used to determine which sequences are selected. These options are there to maximize the chance that selected sequences can be expressed. If literature is available for a sequence for instance, or if there is a structure available, the chance that this protein can be expressed is higher since someone else has done it before. The different selection options can be combined using Must have or Prefer options. Must have will result in a smaller number of groups, because not all groups will have a structure available. Usually, it is a good idea to start with a larger number of groups that you want to have in your panel and delete some groups with these different options. Now play with the different options and see if you understand the result. To see the effect that the different options have, you can click on SELECT PROTEINS. Make a panel of approximately 96 sequences for which you require SWISS-PROT proteins, literature- and structure data.

Sometimes you need to make a panel of a subset of sequences. For instance, say you want to find the most active enzyme with a certain specificity. In this case you should first make a subset that contains only sequences that have this specificity. This sounds simple, but due to wrong notation of proteins it is actually quite tricky. The best way to do this is to first do a keyword search to find enzymes that are likely to have the correct specificity and make a subset of this set of sequences. Use this subset to make a motif with 4 to 7 amino acids that is specific for this subset. It is best to use the Subset specific residues plot (consult the OAH questions for more information). Make a new subset that contains all sequences that have this motif. With this approach you will not only find sequences with the correct specificity that are annotated as hypothetical protein, but you will also delete the sequences that are wrongly annotated. This approach does not make sure you have all sequences with the correct specificity, but it does maximise the chance that the ones that are in your subset all indeed have the correct specificity.

Patent analysis tool

3DM contains an advanced tool for the analysis of patent data for complete protein families. How the tool works is explained in the Patent analysis presentation. Go through the slides and then try out the tool yourself in the GPCR protein family via the Patents option in the 3DM menu. 

3DM also contains a tool for analysing patent data in a target sequence. You upload a target sequence to 3DM via System → Custom proteins → Add custom protein.

Designing mutation experiments with the de-convolution tool

The de-convolution tool is a protein-engineering tool that is made for designing mutation experiments for a specific target sequence. With this tool, many positions can be targeted at once without having to screen a large number of clones. The results of the measurements can be uploaded to the system and the system will calculate which mutations have the biggest effect on the feature(s) (e.g. activity or thermostability). The positive mutations can be combined to make the best performing sequence that can subsequently be used for a next round of engineering. Let's see how the tool works.

Go to Genomes (genomes.bio-prodict.nl) and login with these details:

  • Login: course_demo@bio-prodict.nl

  • Password: K7XlQAK23QvE8Bl%1i^L5#Ax460lC3

Once logged in you will see that we have set-up the de-convolution tool for one GPCR protein (P07550). At the right top you see an Erlenmeyer flask.

  • Click on the Erlenmeyer to open the tool.

  • Click on Create new project.

  • Give the project a name.

  • Select the target protein as the template of this protein engineering experiment.

  • Create the project

  • Click on this project once it is available (sometimes the page needs a refresh to make it visible).

  • Click on Add round to start designing the first round of mutations.

  • Give it the name Round 1 and save it.

  • Select Use the round 0 starting sequence (wild-type protein). Note that if you already have uploaded experiment data in a previous round, you can use a sequence from that round.

  • Click on Save base sequence.

  • Now we can select mutations to introduce to the sequence. Click on Mutations on the left.

  • We will use the hotspot basket generated previously. Select the Hotspots tab and select the thermostable basket.

  • Click on Add all with most frequent as variant type. This option ensures that each hotspot is mutated to the most common amino acid according to the alignment. If the target sequence already has the most common amino acid it selects the second most common residue.

  • Click Save.

Note that the left box is a tool for the selection of mutations. Once you have the desired mutations you can drag them into the box on the right. Now the system automatically generated the mutations for all of the hotspots based on amino acid occurrences, but you can also manually add mutations.

  • Select edit.

  • Select the "Manual" tab in the left box, type “V117C” and hit enter. Click Add all.

  • Save the mutations by clicking Save.

  • Click on Sequences on the left side of the page.

  • At Maximum # of mutations per sequence enter 4 and select Fill up each sequence to contain maximum # of mutations. This will ensure that all sequences will have 4 mutations. If you do not use this option, sequences can have single, double, and triple mutations. Be sure to check the option Create demo measurement data.

  • Select 96 sequences to generate and set the minimum number of observations to 2.

  • Click on Convolute mutations.

This convolution step calculates the best sequences to measure. The tool maximally distributes the mutations over the 96 sequences. In real life the 96 sequences are ordered at a gene synthesis company and the thermostability of these 96 clones are then measured in the lab. These measurements must be stored in an excel file which can be uploaded to the system. You can download the template of this file at the bottom of the sequence list (export CSV). In this file, you should enter the measured values and then upload the file via the "measurements" option on the left.

Because you are doing the course and we, of course, can not ask you to make the measurements in the lab we generate random measurement values for your 96 constructs and you can right away jump to the "analysis" option on the left.

  • Click on Analysis in the menu on the left.

To calculate the contributions of each mutation to the observed thermostability, click on Deconvolute measurements. When done, the results will appear in a chart. There is a lot of information in these charts, but for now, let's keep it simple. In the top chart, deselect all coloured boxes except the lm: single mutations option. This will leave only the contributions calculated from a simple linear model visible. For more details about the various models available, see box 1.

Clicking on the point in the chart will load a box with details about that mutation on the right side of the chart. Select the top-performing mutations. You can do this by dragging a box around them in the chart. Note that the selected mutations appear below the chart as Selected mutations. Click on Add to selection. This makes the selected mutations available for further use. After clicking, they end up in a box on the right of the chart. Here you can visualise them in the structure. Select a structure and make a visualisation. Review it in YASARA or PyMol.

Structure visualisations combined with detailed information about individual variants allow you to get a very thorough understanding about what might be happening in your protein.

Now, let’s build a novel round.

  • In the left bar, add a novel round by clicking on the Add round link.

  • Give your round a nice name and click on Create round.

Now you would like to use a different base sequence than what you used in the first round. It makes sense to pick the best performing sequence from the previous round as the starting point for the new round. Let's do that.

  • Click Select the best performing sequence from a previous Round based on a metric. Select the first round and pick the thermostability metric.

  • Hit the Save base sequence button.

Now it is time to pick the mutations that you want to test in the next round.

  • Click on Mutations in the bar on the left.

  • Open the Selections tab.

Here you can see the mutations that you selected in the previous round. These were the best performing mutations from your previous round. It makes sense to include them in this round.

  • Click on Add all to move them to the right.

Your list of mutations for the next round is not complete yet. What you would do now in a real life scenario is to use all the insights you gained by analysing the mutations you previously made to select novel ones. Which positions worked, which did not, etc. When your set of mutations is complete, you would generate novel sequences, measure them in the lab, analyse the individual contributions and build novel rounds. Iterate you way towards a great protein!

 

Box 1. Detailed descriptions of the statistical models used in ‘classic’ deconvolution.

 

Linear model: single mutations

The simplest model available is the linear model that takes into account only single mutations. This is a very straightforward model, basically just one formula in the form of:

EffectSize = a * mutation1 + b * mutation2 + c * mutation3 + ...

Given all the measurements (effect sizes) the numbers a,b,care optimised so that the final formula approaches the actual measured values best. The numbers a,b,c. are what you see in the deconvolution chart.


Glmnet: single mutations

This is basically a little smarter version of the linear model described above. What this model tries to do is the same as the linear model, but it tries to throw out all the parameters (or, mutations) that minimally affect the models performance. This model will usually consist of less mutations, and that is because it has thrown away mutations it deemed unnecessary. So, without those mutations, the model is still able to predict the actual measurements.

Same as with the linear model, the numbers a,b,c ... are what you see in the deconvolution chart.


Random forest regressor: single mutations

This is the least interpretable of the models I guess. The predictive model that tries to predict the measured values based on the mutations present in the sequences is a random forest model. This model consists of a number of 'decision trees' that are optimised to predict the measure value. A tree is basically a series of decisions, like "if mutation1 is present, but not mutation2 , predict measurement value x". The good thing about this is that interactions between mutations are inherently taken into account. The values that you see in the deconvolution chart are derived from the number of times a mutation is used in a "split" in the tree, together with the position of the split in the tree, where higher-up splits are more important.

Then there are a number of models that include only interactions, or interactions combined with single mutations. Interactions are terms in the models that basically 'group' multiple mutations as one term. So if you have a sequence where mutation1 and mutation2 are present, their interaction will be one term (e.g. mutation1+2 ). So in the case of a linear model you get something like

EffectSize = a * mutation1 + b * mutation2 + c * mutation3 + d * mutation1+2

This is an example of a model with both single mutations and interactions, but you can also do this of course with just the interaction terms.
The downside of including interactions is that you get a very large number of parameters and you tend to overfit very easily, which is why there is no linear model that includes interaction terms, but only glmnet ones.