/
3DM advanced: 3DM applied to GPCRs

3DM advanced: 3DM applied to GPCRs

Index

Before doing this exercise it is best to first do the Introduction: 3DM applied to the nuclear receptorscourse.

General

The complete course consists out of three parts:

To follow these courses, you need a Bio-Prodict account. If you do not have one yet, please start by filling out the form on the registration page. Are you eligible for an academic license? Make sure to register with your academic email address. Once you submit the registration form, you will receive a confirmation email with further instructions on how to activate your account.

For some exercises in this section, you need either PyMOL or YASARA installed which has the 3DM plugin. If you do not have YASARA or PyMOL or you are missing the 3DM functionality, please consult the Installation instructions 3DM. Make sure you have the latest version installed before you start this exercise.

In case you have any questions about this course, please get in touch with our support team via support@bio-prodict.com.

Introduction

After entering the login details on the login page, you land on the 3DM dashboard page. Here, you can find an overview of all available systems. In the 3DM COURSE tab, click on the GPCR (2016) system. This is the 3DM system we will be working with during this course. 

By default, the GPCR 3DM database shows data of the GPCR family class A sequences. There are other classes of GPCRs too which you will investigate later. Unless specifically requested otherwise, always use the GPCR class A protein family to answer the questions in this course.

Correlated mutations

Open 3DMs correlated mutation analysis tool CorNet by clicking on Correlated mutations in the left menu. You can see two big networks and a few smaller ones. Start thinking about whether the two bigger networks really are two separate networks, or if they are simply disconnected because of the fact that the default cut-off for the number of positions in the network is 25.

Question 1: What does "independent networks" mean? When are two networks independent?

When two networks are independent, the residues in one of the networks mutate in sync with each other. Residues from the other network mutate in a different rate.

You can explore this idea by making an alignment with two groups of proteins. Each group should contain functionally similar proteins, but the function differs between the groups. For instance, a group can be made with enzymes that all convert a specific substrate, and a second group that converts another substrate. The correlated mutations will likely reveal hotspots that determine the substrate specificity change.

A very important note is that correlated mutations reflect functional changes and not environmental changes. This can easily be explained, because you can not use correlated mutations to find hotspots for thermostability. You could make an alignment of two groups: one composed of proteins from thermophilic organisms and another made from non-thermophilic organisms. Then, the differences between these groups should show up as correlated mutations, right? This is not the case, because in fact these two groups cover all sequences of the complete superfamily alignment. You can investigate the differences between these groups just by comparing the amino acid contents between the groups. This concept will be discussed later.

Another important note: The problem with functional grouping is that the functional annotations of proteins is often based only on sequence similarity and can therefore be incorrect. You need to make sure that at least a high percentage of the sequences share the same function to make functional based hotspot detection work. Examples are known in which the alignment was successfully grouped simply by using keyword searches in the protein descriptions and was used to find novel enzymes. The group of professor Uwe Bornscheuer has found R-selective amine transferases (only S-selective enzymes where known) by deleting groups of proteins from the protein family that had other activities based on motifs that were associated with the different functions. This exercise was repeated with 3DM. The protein family was first grouped by giving keywords that were related to enzymes with unwanted reaction mechanism. The keywords “lyase”, for instance, was used to select enzymes that likely had the unwanted lyase activity. Then, this set of sequences was used to find a sequence motif specific for lyase activity. All sequences that had this lyase specific sequence motif were subsequently deleted from the alignment. This step ensures that all lyases that were not annotated as such still got deleted from the alignment. This procedure was repeated for all different unwanted enzyme activities. This exercise resulted in 42 sequences that are highly likely all R-selective amine transferases.

Question 2: How could you investigate if two networks truly are two independently co-evolving networks?

There are multiple ways of doing this. One easy way would be to look at the amino acids distributions in each network. If the number of different motifs/different groups in the networks differ (e.g. the positions of the first network all have only two different residue types and the positions of the other network all have five different residues), you know the two networks mutate independently. We will come back to this in question 6. Structural separation of the two networks might also provide a hint.

Question 3: Enter the keyword "specificity" in the searchbox of the literature & mutations window on the right. What do you find? Does this provide a clue about the separation of the two networks?

Yes, only one network clearly contains hotspots for specificity. This means that changes in specificity were the driving evolutionary force behind the correlated mutation behavior of this sub-network. For the other network, there is no literature in the 3DM database describing mutations effecting specificity. As this database contains huge amounts of mutation data, it is unlikely that these positions are hotspots for specificity.

Question 4: Do you think it is likely that position 92 or 93 have an important role for specificity? How would you find out?

Yes, it is likely they are also important for specificity because they are in the same network. If you change the default Keyword cut-off setting from 2 to 1, you see that more nodes are coloured light blue including position 93. This means that one reported mutation at position 93 is related to specificity. You could have a look in the 3D structure to see if they are close to the other members of this network (and/or close to the ligand binding pocket). However, the only way to really make sure is to make mutations in the lab and test the binding affinity of the ligand. In enzymes, such mutations would have an effect on the Km but not so much on the Kcat. Such mutations typically indicate a hotspot for specificity. The Km on other substrates might improve at such positions.

Correlated mutations often divide the superfamily into several groups based on sequence motifs. Sometimes this is just an evolutionary separation, for example in the case of bacterial sequences versus those of eukaryotes. However, because 3DM uses superfamily alignments containing proteins with different functions, the correlations are often a result of the changes between the different functional groups.

Question 5: What is the difference between evolutionary grouping and grouping based on functional differences?

Usually, 3DM alignments are composed of sequences from organisms from the entire tree of life. If you would make a phylogenic tree of the whole alignment and you see that it nicely overlays the tree of life, then correlated mutations can reflect the different domains (evolutionary grouping). However, multiple sequences from the same organism are generally in a 3DM alignment and these sequences can sequentially be very different. They almost always have different functions. Because the alignment holds proteins with different functions, the different groups of the corresponding phylogenetic tree are function based. Proteins with the same function form the large groups/clusters in the tree, instead of the different domains or kingdoms of the tree of life. The different groups in such alignments may result in correlated mutations that lead to functional changes during evolution. Note that you can have both types of differentiation in a single alignment.

Click on the main nodes of the two big networks (position 88 and 74). Clicking on a node gives information about this position, such as the amino acid distributions.

Question 6: How many different residues do you see in the distribution pie if you click on position 88? And what about 74?

The pie chart of position 74 shows that two residues (W and F) are very abundant at this position. Note that the * in the pie chart indicates all residues that are less than 5% present in the alignment. The variability at position 88 is much higher: 8 different residue types are available in more than 5% of the sequences. The two networks clearly show different mutation rates. To get the exact distributions, you can click on OPEN POSITION INFO PAGE below the pie chart.

Question 7: Click on the other members of each network. Do you see a pattern? What does this tell you about the separate co-evolving networks?

There is indeed a pattern: most positions in the network of 74 have only two types of residues. The positions of the other network all have many more. This clearly shows that these two networks are independently co-evolving.

You can investigate the correlation between residues by selecting them simultaneously. This can be done by making a selection box around the positions, or by selecting them while holding the Ctrl or command key.

Question 8: Select position 105 and 197 simultaneously. Looking at the amino acid distribution pie chart, do you think the residues on these positions might form a salt-bridge? You can visualize this by clicking on SELECTED NODES in the Visualize alignment position in structure section below the pie chart.

The amino acid distribution of these two positions indicates that it is indeed possible that they form a salt-bridge, because in many sequences from the alignment they have an opposite charge which indicates that they might structurally be close to each other and be able to form a salt-bridge. On the other hand, there are quite some sequences that have E-E. In those cases, no salt-bridge would be formed.

Again, enter the keyword “specificity” in the keyword search box of the Literature & Mutations section. Now, look at the enrichment-score (overall enrichment) below the search box.

Question 9: What does this value tell you? Is this a high/meaningful value? Is it what you expect after seeing the network or do you expect something to be going wrong? You hover over the information icon if you do not know what the enrichment-score (E-score) is.

The E-score of 1.17 is quite low. An E-score of 1 means that there is no enrichment for the specificity related mutations in the network (just as likely to find specificity mutations at a position in the network as a position outside the network). An enrichment of 3 means that there are three times more mutations reported inside the network compared to outside the network. However, there is nothing wrong, because not all positions of the complete network are hotspots for specificity. Only one small sub-network seems to be specificity related, thus the E-score for the complete network is low.

In the Literature & Mutations section you can change the Keyword cut-off. With the default value of 2, 3DM will only show positions in the network for which at least two different mutations have been published that had an effect on specificity. This is to ensure no false positives are included. In protein families for which a lot of mutations are available, it is usually smart to leave this at at least two. If there is not so much data available you can change it to 1 to get reliable E-scores.

Question 10: Now select all the nodes (both the green and the light blue ones) of the sub-network with the light blue position (e.g. specificity position). What is the enrichment score of this selection?

The E-score for the sub-network is 3.4. An E-score of 3.4 indicates that the chance that the positions in this sub-network are specificity hotspots is 3.4 times higher than random. Or better, the chance they are specificity hotspots is 3.4 times higher than other positions. Because these positions are important for specificity, there has been an evolutionary pressure at these positions that has resulted in the alignment of this correlated mutational behavior. This is why correlated mutations are often referred to as "co-evolution".

Different protein features (e.g. activity, dimerisation) can be the underlying cause of an evolutionary pressure that has resulted in correlated mutations. In 3DM we pre-calculate the E-scores of the different protein features at different correlated mutation cut-offs. The resulting plot shows quickly which evolutionary pressure is behind a correlated mutation network. At high correlated mutation cut-offs specificity mutations are clearly over-represented in the GPCR CorNet network (Figure 1).

image-20241023-053454.png
Figure 1. E-scores of different keywords.

Question 11: With the keyword cut-off set to 1, you can see that there is only one article describing a mutation that has effects on specificity at position 93. Do think that it is likely to be a false positive or is it more likely that this really is a specificity related position? And what about position 248?

As position 93 is part of the sub-network that has a high E-score for specificity, it is likely that 93 also is a hotspot for specificity. For position 248, on the other hand, this is less clear. Note that 3DM "just" gives you the data. To make sure, always read the article, especially if there is only one article found for a certain position.

With the keyword cut-off set back to the default value of 2, select the other big network (with position 74 in its center) and color the nodes blue with the node coloring option that appears on the right when you select positions in the window. Select positions 247 and 248 and make them yellow. Select positions 42 and 32 and make them Magenta. Finally, select positions 60, 84, and 98 and make them red. Click on ALL NODES in the Visualize alignment position in structure menu. Keep the default settings on the Visualize page and create your scene by clicking on VISUALIZE. Open the downloaded file in YASARA or PyMol.

Question 12: Do the nodes of the specificity related network cluster together in the structure? How about the other big network from the correlated mutation (coloured in blue) ?

The specificity related network does cluster together in the structure. The other big network does not cluster structurally, and the function remains unclear based on structural data.

Question 13: Do you think that position 60 really belongs to the specificity related network?

Considering the 3D location (near the ligand binding pocket and close to the specificity related network) position 60 probably indeed belongs to the specificity related network.

Question 14: What about the small separate network 247 and 248 (in yellow) for which we found one article describing effects on specificity? Might they actually belong to the specificity related network? And what about the other two small networks (e.g 32-42 and 153-251). Do you think they are related to the bigger network too?

Just like position 60, positions 247 and 248 are located close to the light blue positions of the specificity related network and are therefore likely specificity related positions. The roles of the other residues are not clear based on this exercise.

In case you are using YASARA, load the 1F88A co-crystalized compound using via 3DM → Structures → Load structure from 3DM.

If you use Pymol, start the 3DM plugin using plugin → initialize plugin system. Then, plugin → legacy plugins → 3DM. A new window should start. Use your 3DM credentials to login and click in this window on 3DM menu → structures → load structures from 3DM, select 1F88A and only select Download drug.

Question 16: Do you think it makes sense to find the ligand there?

There are several compounds crystalized in 1F88A, one of which is located inside the GPCR. This ligand is located exactly in between the specificity related positions. This is a good example of how to use 3DM. With 3DM you can look for correlations between different data types. In this case, there is a good correlation between the CorNet network and the structural location. If you find something like this, your alarm bells should start ringing. In this case, this screams specificity. If correlated mutations are located around the ligand (or substrate in case of enzymes) they are almost always specificity hotspots. Other examples of correlation between different data are: 

  1. High B-factor & High RMSD → Hotspots for thermostability

  2. Motifs and sequence groups. For example: A sequence motif that is highly conserved in a subset composed of sequences with the same function and a different, again conserved, motif is found in a subgroup of sequences with another function. Often these positions determine the functional swap between the two subgroups (e.g. specificity hotspots).

  3. Ligand contacts and activity data of ligands. Some positions are bound only by inhibiting- or activating ligands → hotspots for inhibition and/or activation. A nice example was shown in the NR assignment.

Families and numbering schemes

There are different families in the GPCR database. An alignment has been generated for each distinct GPCR class. These alignments are structure based and all have their own structural conserved core. Because these cores differ in length, the different alignments have their own numbering schemes. Via the Alignment menu at the top of the 3DM website you can switch between the different families.

In the left menu, click on System → System info to open a nice overview of the different family alignments.

Question 17: How many aligned sequences do you see in the class A family?

38,324 aligned sequences.

Question 18: How many sequences are in the other alignments?

  • 45,796 for the full alignment (Superfamily 3DM)

  • 2,940 for class B

  • 3,978 for class C

  • 4,008 for class F

Note that these numbers might be outdated when the database has been updated.

Question 19: Group A is by far the largest group of GPCRs. Superfamily 3DM is a combination of all the classes. Why do you think the superfamily 3DM is not as big as all groups in total?

There can be different reasons for this. 3DM sometimes deletes sequences during the alignment generation procedure if they ca not reliably be aligned. It might be that some sequences can reliably be aligned in the smaller family alignments because they have a bigger core. This means that there are more residues that can be used to make the alignment, resulting in more reliably aligned sequences. In the complete superfamily 3DM alignment only the trans-membrane helices are aligned (as those are the only parts that are structurally conserved between the different families). This means that there can be more uncertainty on the quality of the alignments. Another reason might be that some sequences are aligned in more then one class.

Now using the Alignment scroll down menu select Superfamily 3DM and look at where the conserved residues are (which 3D numbers) using the Alignment statistics page. Now do the same for the group A sequences (this you can do best in another tab so you can switch between the tabs).

Question 20: What is the first residue that is conserved > 95% in the full alignment? What is it's 3D number in the full alignment? And in the class A alignment? Can you explain this?

The first conserved residue (>95%) is Cys at position 68 in the full 3DM alignment. The same conserved Cys has number 81 in the class A family. There are less structural conserved positions if a structural conserved core is made for all the GPCRs. There are more positions structural conserved if only Class A GPCRs are used to determine the structural conserved core.

Select from the Numbering scheme scroll-down menu at the top of 3DM the gpcr_A numbering in both alignments.

Question 21: Compare the 3D numbers of the two conserved Cys residues again. What do you see? What is the number when using the class B numbering scheme?

They now have the same number using the class A numbering scheme: 3.25a. In the class B alignment this residue has number 3.29b. Obviously, there are more structural conserved positions before this residue in the class B alignment compared to the class A alignment. When this common numbering scheme was designed, the makers did not realise that they could have made a numbering scheme that could be applied to all classes. As everyone uses this numbering scheme now, it is too late to synchronise the numbering schemes of the different families.

Realise that between different 3D alignments, the 3D numbers can not be compared (you can noy select gprc_a numbers if you have the GPCR class B alignment selected). Because the size of the structural conserved cores differ between the different families, different numbering schemes have been developed for the different subclasses. Each of these numbering schemes indicates for which subclass the numbering is. Although it would have been even better if they had used a common numbering for the transmembrane parts that are shared by all classes, such that the conserved Cys would have had the same number in all classes. Although this is not the case, the commonly used numbering schemes designed for the GPCRs is a luxury that is only available in the GPCR protein family and has not been developed for any other protein family (as far as we know).

Making a structural model with 3DM

3DM contains an automated homology-modeling module. It uses the 3DM alignment between a sequence and a structure to generate models. At the protein detail page of a sequence you can find a MODELS tab. Here you can select a suitable template to generate a homology model of the protein. 3DM already makes a pre-selection of suitable templates based on sequence similarity. You can choose any of the pre-selected templates. Which of these templates is best is for you to decide. This decision should be based on what you want to do with your model. For instance, enzymes can be in the open or closed conformation. If you want your model in the closed conformation, then start with a template that is also in the closed conformation. Other factors can be e.g. with or without ligand, with an activating or inhibiting ligand bound in the ligand binding pocket, etc. So, always first investigate the different starting templates to see which one fits best to your needs.

Hirschsprung disease is a disease that is caused by mutations in a GPCR.

Question 22: Using the Literature option of the Search module, try to find a protein that is known to be involved in Hirschsprung disease.

 

If you have searched for “Hirschsprung disease” in the titles of the articles, you have found a list of articles that. The first hit is about P24530.

Question 23: Go to the protein detail page of P24530. In 3DM you can simply click on the protein name. On this page you see MODELS tab where you can create a structural model of a protein. Explain the different numbering options that are available under Download models.

  • Alignment numbering: This option uses the 3D numbers. Note that the active numbering scheme is used. This can also be a user defined numbering scheme.

  • Residue numbering: This option uses the numbering of the sequence that is being modeled. 

  • Template numbering: This option uses the numbers of the template PDB structure.

Select a template and make a model. Make sure that you have the GPCR class A alignment selected. Use YASARA or PyMol to open it. Creating the model may take a few minutes. If 3DM was not able to model parts of the sequence, these parts will be missing from your model. These missing positions are indicated as purple dots in YASARA and as lines in PyMol. Usually, this is because the alignment between the sequence and the template can not reliably be made (often the parts outside the core) due to very low sequence similarities. Realise that those parts can not be modeled reliably using the selected template, because the sequence similarity is so low that the two proteins will likely fold differently in those parts. Sometimes it helps to make a model choosing a different template (if available), but usually this means that those parts can simply not be modeled reliably.

Question 24: Find positions where mutations can cause Hirschsprung disease using the Literature hotspots option of the 3DM menu in YASARA or PyMol. How many positions can you find?

If you search with the keyword "Hirschsprung" with the literature cut-off set on 2, you will find that at 7 positions the literate reports Hirschsprung related mutations. In PyMol you can use the "Show scene content details" to get a list of the search results. Note that using "Hirschsprung disease" as keyword does not change anything. However, sometimes you have to think about which keyword exactly to use to find what you need. For instance, if you search for mutations that resulted in a more active enzyme, you can use "increase activity" but also "high activity". The search works such that the words do not necessary have to be adjacent. For example a sentence like: "the A345Q mutant resulted in a enzyme of which the activity was increased", will also be found with the keyword "increase activity". If you put the literature cut-off on 1, you will see that there are many more positions mentioned in the literature in relation to Hirschsprung disease. 

Visualize data in structures

To visualize data in a structure you can use the Visualize tool in 3DM. Here you can select multiple structures that are all superimposed. You can select one or more of the templates that are used to generate the alignment if you select the Show only templates option from the quick filter menu. If you activate the advanced setting, a PDB type can be selected. The Aligned setting gives any of the other structures that could be superimposed on the templates. The Non-aligned setting contains structures that were detected by BLAST but which 3DM could not superimpose on the templates. These structures usually do not belong to the superfamily, or are structurally too distantly related. Models that you have generated with 3DM can be selected from the Model PDB type setting. The name of a model indicates which protein is modeled, and between brackets you see which structure was used to make the model. In the Compounds section you can select any co-crystalized compound. These are also superimposed, so usually they will be in the pocket of the structures. You can select any combination of protein structure and ligand, so you can insert any ligand into any structure.

Question 25: Select a couple of structures and ligands. Click on, ADD POSITIONS in the Positions section, go to the CORRELATED MUTATIONS tab and select the top five correlated mutations. From the CONVERSATION tab, select all residues that are >90% conserved. From the CONTACTS tab, select all positions make at least 100 contacts with a ligand (note the different contact types). Click on VISUALIZE and open the resulting file. Did you get back what you expected?

To create this picture the first three templates were selected, the top five correlating positions, the first three ligands and all positions that are >90% conserved. You can see that the correlated mutation nicely surround the ligands located inside the GPCRs.

Navigate to Alignment statistics in 3DM. Scroll down to the Keyword mutations plot. Here, you can enter a keyword to see how many mutations related to this specific keyword are found for a given position.

Make sure you have the GPCR class A family selected under Alignment at the top of the page and consult the Human variation/Position histogram. Compare this histogram with the Amino acid conservation histogram using the compare with option above the histogram. Set the cut-off for conservation on 50% and the SNP cut-off on 0,3.

Question 26: Do you find more SNPs at Conserved regions? Does this finding make any sense?

SNPs are less common at conserved positions. With these cut-offs you see that the enrichment is 0,55 which means that SNPs are found less at conserved positions than at other positions. Note that it is important to set the cut-offs in this exercise because otherwise too many positions fall within the cut-offs to calculate a meaningful enrichment score.

Question 27: Now compare the Human variation/Position histogram with the histogram of Ligand contacts. Do you see a common SNPs at a ligand contacting position?

Not really. Maybe position 84. As SNPs are from genome sequencing projects of healthy humans, you don't expect many SNPs at important positions as they might be damaging to the functioning of the GPCR, which in turn can result in a disease. This information can also be used the other way around: at positions where you find many SNPs it is likely that mutations are allowed. This means that they might be good candidates if you are interesting in changing the thermostability of a protein, because you can be quite sure that mutations will not harm its functioning.

Question 28: Reset the SNP graph by clicking Reset all. Move the cut-off to 1% using the slider bar and export the result to YASARA or PyMol using the Visualize structure button left to the slider bar. Which positions were exported to the structure? Which of these very common SNPs is located near the ligand binding pocket?

There are 7 positions: 29, 85, 130, 134, 145, 195, and 205. Position 85 is a common SNP closest to the ligand binding pocket.

Note that the  button can be found at many places in 3DM for direct visualization of data in.

There are other export buttons too, such as the Export to hotspot basket button. Let's see how this works. Put the cut-off of the SNP graph again on 1 and open the hotspot basket tool by opening the Hotspots menu in the top right corner of 3DM. At the SNP histogram this sign will appear (possibly with a different numnber): . This icon is for the insertion of selected positions in a hotspot basket. Click this button to upload the SNP positions into a new hotspot basket, give the basket a name, and save the basket.

Question 29: Go back to YASARA or PyMol and use the 3DM → Hotspot Baskets → Load a Hotspot Basket to open the basket that you just created. In PyMol you need to define which hotspots you want to visualize from the basket. Simply select all of them. Is the result the same as you had before?

The result should be the same.

A hotspot basket is nothing more than a selection of alignment positions. You can generate hotspots for different protein features (e.g. correlated mutations, specificity hotspots, thermostability hotspots, etc.) and those can be found using 3DM. At a later stage you can open the basket in different 3DM tools. Let's see how this works.

Question 30: Go to the Correlated mutations page in 3DM and use the keyword “specificity” in the Literature & Mutation window with a keyword cut-off of 2. Make sure you have the GPCR class A selected. Open the hotspot basket tool and select a NEW hotspot basket. Select the 9 positions of the sub-network that contains the mutations reported to effect specificity upon mutation. Click on the Add to hotspots button at the left top corner of the network visualisation to store these specificity related mutations in a NEW basket. Give it a logical name and save it. We will use it later on.

If all OK, you should have these positions in your basket: 60 84 88 92 93 98 109 218 245.

A good trick for increasing the thermostability of proteins is to make mutations at flexible regions. Go to the Alignment Statistics module of 3DM. Here you will find two histograms that display flexible positions, the RMSD (a measure for how tight the structures superimposed) and the average B-factor. The B-factor is a measure for how sharp the X-ray diffraction was for the atoms of a residue. A very sharp X-ray indicates that the amino acid is tightly positioned in the structure. The average B-factor is the average over all amino acids from all structures at an alignment position. Often these two plots show a similar pattern. If at a position both plots are high usually this is a good hotspot for changing thermostability.

Question 31:

  • A: Make sure you have the hotspot basket window open. Increase the B-factor cut-off using the slider bar until you have <20 positions selected and store these in a new basket. Open YASARA or PyMol and open this basket in any GPCR you prefer. Does the result make any sense?

  • B: Do you think these are positions that are likely good candidates to improve thermostability of GPCRs, or is there an artifact in the B-factor method in GPCRs? Can you think of a way to solve this?

  • C: Use the Literature hotspot option with the keyword “thermostability” in YASARA or PyMol to find mutations from the literature that have an effect on thermostability. Do these overlap with the B-factor hotspots?

  • D: Do you think this is a better way of finding hotspots for thermostability in this protein family?

  • A: Let’s see!

  • B: The B-factor method does not seem to be a good method for the GPCRs. Likely this is because the transmembrane helices are so tightly bound that automatically the rest of the protein is much more flexible. It is known that making mutations in the transmembrane helices can make GPCRs more stable. This is nicely demonstrated by the "thermostability" literature search.

  • C: Clearly the two methods do not overlap, indicating that the B-factor method does not apply to GPCRs.

  • D: It is in this protein family probably much better to first try the positions that are already published to have an effect on the stability. The best way to do this is to find amino acids described in literature that are known to stabilise GPCRs. If the stabilising residue is not in your target GPCR, it might be a good idea to try that residue. If your target sequence has a different residue than the consensus, it might be smart to try the consensus residue as there are several papers demonstrating that making the consensus often has a beneficial effect on the stability. Note that if it is easy to screen your protein for thermostability, randomising each hotspot might be smarter. There are many other tricks too. Here are some examples: 

  1. Introducing prolines at positions where a proline is common and your target does not have a proline.

  2. Inserting negative charges at the N side of a helix (if there is not any), or a positive residue at the C-side. This is known as helix capping because the N-terminus of a helix is slightly positive and the C-terminus is slightly negative.

  3. Creating salt-bridges by inserting positive or negative residues at positions on the outside of the protein. Always check if the residue that you want to use is actually common in the alignment at that position.

  4. Replacing glycines where, according to the alignment, glycines can be replaced by something else.

Panel design

The idea of the panel design tool is to select sequences from the alignment such that the selected sequences are maximally distributed over the superfamily. This is done in two steps: First, the sequences of the superfamily are grouped. This can simply be based on sequence similarity (similar sequences are within the same group). Groups can also be based on sequence motifs found at user selected positions. The last option is used to group sequences based on a protein feature. For instance, the user can pick positions important for specificity. The idea is that sequences that have the exact same residues (the same motif) at those positions they are likely to have the same specificity. Both methods can be combined. In the second step a user defined number of sequences (usually one or two) are selected from each group. The selection step contains all kinds of options to maximise the chance that these proteins are likely to express. Let's see how it works. 

Select the Panel design option in 3DM. First, we will divide the super-family based on sequence motifs. Because we want maximise the specificity range in the panel, we will use the "specificity hotspot" basket you have generated in question 30. This way, all sequences with the same motif at our specificity hotspots (thus with likely the same specificity) will be in one group. Use the Add hotspots option to select the hotspot basket that you made in question 30.

Once they are selected, click on SHOW GROUPS.

Question 32: Based on these criteria, into how many protein groups is the alignment divided?

The alignment is divided into 636 protein groups.

Question 33: By default, the tool only selects groups with at least two members and deletes sequences that have a gap in one of the selected positions. What is the difference between "unique motifs" and "total protein groups"?

Motifs for which only one sequence is available are not put in a group. The hypothesis behind this is that those might be sequence errors. Also, sequences that have a gap at a selected position are not grouped. If you put Minimum protein group size on 1 and deselect Hide motifs with gaps, the number of unique motifs is equal to the number of groups.

You can also divide the alignment phylogenetically. This separation can be combined to the motif grouping. In the box under Phylogenetic groups you can give a number. The superfamily will then be divided into this number of groups purely based on phylogenetic distances. Reload the panel design tool (refresh the page), reload the hotspot basket, type 10 in the Phylogenetic groups size box and click again on SHOW GROUPS.

Question 34: How many groups are added by this exercise?

There are now 733 protein groups in total, so 97 were added.

After defining the groups, you have to select from each group the sequences that you would like to have in the panel. There are several selection options that can be used to pick sequences from each group. First, you select the number of sequences per group with the Proteins per group option in the expanded Protein selection panel. Then, several options can be used to determine which sequences are selected. These options are there to maximize the chance that selected sequences can be expressed. If literature is available for a sequence for instance, or if there is a structure available, the chance that this protein can be expressed is higher since someone else has done it before. The different selection options can be combined using Must have or Prefer options. Must have will result in a smaller number of groups, because not all groups will have a structure available. Usually, it is a good idea to start with a larger number of groups that you want to have in your panel and delete some groups with these different options. Now play with the different options and see if you understand the result. To see the effect that the different options have, you can click on SELECT PROTEINS. Make a panel of approximately 96 sequences for which you require SWISS-PROT proteins, literature- and structure data.

Note that you can make the total panel smaller by clicking the Panel size option. This will remove sequences by ensuring maximum diversity in the remaining panel sequences.

Sometimes you need to make a panel of a subset of sequences. For instance, say you want to find the most active enzyme with a certain specificity. In this case you should first make a subset that contains only sequences that have this specificity. This sounds simple, but due to wrong notation of proteins it is actually quite tricky. The best way to do this is to first do a keyword search to find enzymes that are likely to have the correct specificity and make a subset of this set of sequences. Use this subset to make a motif with 4 to 7 amino acids that is specific for this subset. It is best to use the Subset specific residues plot (consult the OAH questions for more information). Make a new subset that contains all sequences that have this motif. With this approach you will not only find sequences with the correct specificity that are annotated as hypothetical protein, but you will also delete the sequences that are wrongly annotated. This approach does not make sure you have all sequences with the correct specificity, but it does maximise the chance that the ones that are in your subset all indeed have the correct specificity.

Patent analysis tool

3DM contains an advanced tool for the analysis of patent data for complete protein families. How the tool works is explained in the Patent analysis presentation. Go through the slides and then try out the tool yourself in the GPCR protein family via the Patents option in the 3DM menu. 

3DM also contains a tool for analysing patent data in a target sequence. You upload a target sequence to 3DM via System → Custom proteins → Add custom protein.

Designing mutation experiments with the de-convolution tool

The de-convolution tool is a protein-engineering tool that is made for designing mutation experiments for a specific target sequence. With this tool, many positions can be targeted at once without having to screen a large number of clones. The results of the measurements can be uploaded to the system and the system will calculate which mutations have the biggest effect on the feature(s) (e.g. activity or thermostability). The positive mutations can be combined to make the best performing sequence that can subsequently be used for a next round of engineering. Let's see how the tool works.

Go to Genomes (genomes.bio-prodict.nl) and login with these details:

  • Login: course_demo@bio-prodict.nl

  • Password: K7XlQAK23QvE8Bl%1i^L5#Ax460lC3

Once logged in you will see that we have set-up the de-convolution tool for one GPCR protein (P07550). At the right top you see an Erlenmeyer flask.

  • Click on the Erlenmeyer to open the tool.

  • Click on Create new project.

  • Give the project a name.

  • Select the target protein as the template of this protein engineering experiment.

  • Create the project

  • Click on this project once it is available (sometimes the page needs a refresh to make it visible).

  • Click on Add round to start designing the first round of mutations.

  • Give it the name Round 1 and save it.

  • Select Use the round 0 starting sequence (wild-type protein). Note that if you already have uploaded experiment data in a previous round, you can use a sequence from that round.

  • Click on Save base sequence.

  • Now we can select mutations to introduce to the sequence. Click on Mutations on the left.

  • We will use the hotspot basket generated previously. Select the Hotspots tab and select the thermostable basket.

  • Click on Add all with most frequent as variant type. This option ensures that each hotspot is mutated to the most common amino acid according to the alignment. If the target sequence already has the most common amino acid it selects the second most common residue.

  • Click Save.

Note that the left box is a tool for the selection of mutations. Once you have the desired mutations you can drag them into the box on the right. Now the system automatically generated the mutations for all of the hotspots based on amino acid occurrences, but you can also manually add mutations.

  • Select edit.

  • Select the "Manual" tab in the left box, type “V117C” and hit enter. Click Add all.

  • Save the mutations by clicking Save.

  • Click on Sequences on the left side of the page.

  • At Maximum # of mutations per sequence enter 4 and select Fill up each sequence to contain maximum # of mutations. This will ensure that all sequences will have 4 mutations. If you do not use this option, sequences can have single, double, and triple mutations. Be sure to check the option Create demo measurement data.

  • Select 96 sequences to generate and set the minimum number of observations to 2.

  • Click on Convolute mutations.

This convolution step calculates the best sequences to measure. The tool maximally distributes the mutations over the 96 sequences. In real life the 96 sequences are ordered at a gene synthesis company and the thermostability of these 96 clones are then measured in the lab. These measurements must be stored in an excel file which can be uploaded to the system. You can download the template of this file at the bottom of the sequence list (export CSV). In this file, you should enter the measured values and then upload the file via the "measurements" option on the left.

Because you are doing the course and we, of course, can not ask you to make the measurements in the lab we generate random measurement values for your 96 constructs and you can right away jump to the "analysis" option on the left.

  • Click on Analysis in the menu on the left.

To calculate the contributions of each mutation to the observed thermostability, click on Deconvolute measurements. When done, the results will appear in a chart. There is a lot of information in these charts, but for now, let's keep it simple. In the top chart, deselect all coloured boxes except the lm: single mutations option. This will leave only the contributions calculated from a simple linear model visible. For more details about the various models available, see box 1.

Question 35: What is the best performing mutation? Are there mutations that have a destabilising effect?

The 'measured data' is randomly generated, so the best and worst performing mutations will be slightly different for everybody.

Clicking on the point in the chart will load a box with details about that mutation on the right side of the chart. Select the top-performing mutations. You can do this by dragging a box around them in the chart. Note that the selected mutations appear below the chart as Selected mutations. Click on Add to selection. This makes the selected mutations available for further use. After clicking, they end up in a box on the right of the chart. Here you can visualise them in the structure. Select a structure and make a visualisation. Review it in YASARA or PyMol.

Structure visualisations combined with detailed information about individual variants allow you to get a very thorough understanding about what might be happening in your protein.

Now, let’s build a novel round.

  • In the left bar, add a novel round by clicking on the Add round link.

  • Give your round a nice name and click on Create round.

Now you would like to use a different base sequence than what you used in the first round. It makes sense to pick the best performing sequence from the previous round as the starting point for the new round. Let's do that.

  • Click Select the best performing sequence from a previous Round based on a metric. Select the first round and pick the thermostability metric.

  • Hit the Save base sequence button.

Now it is time to pick the mutations that you want to test in the next round.

  • Click on Mutations in the bar on the left.

  • Open the Selections tab.

Here you can see the mutations that you selected in the previous round. These were the best performing mutations from your previous round. It makes sense to include them in this round.

  • Click on Add all to move them to the right.

Question 39: You might see warnings like this: 'wildtype of mutation M279V did not match round base sequence'. How can this happen?

Your base sequence will contain some mutations from previous rounds and in this case, there's a 'V' at position 279.

Your list of mutations for the next round is not complete yet. What you would do now in a real life scenario is to use all the insights you gained by analysing the mutations you previously made to select novel ones. Which positions worked, which did not, etc. When your set of mutations is complete, you would generate novel sequences, measure them in the lab, analyse the individual contributions and build novel rounds. Iterate you way towards a great protein!

 

Box 1. Detailed descriptions of the statistical models used in ‘classic’ deconvolution.

 

Linear model: single mutations

The simplest model available is the linear model that takes into account only single mutations. This is a very straightforward model, basically just one formula in the form of:

EffectSize = a * mutation1 + b * mutation2 + c * mutation3 + ...

Given all the measurements (effect sizes) the numbers a,b,care optimised so that the final formula approaches the actual measured values best. The numbers a,b,c. are what you see in the deconvolution chart.


Glmnet: single mutations

This is basically a little smarter version of the linear model described above. What this model tries to do is the same as the linear model, but it tries to throw out all the parameters (or, mutations) that minimally affect the models performance. This model will usually consist of less mutations, and that is because it has thrown away mutations it deemed unnecessary. So, without those mutations, the model is still able to predict the actual measurements.

Same as with the linear model, the numbers a,b,c ... are what you see in the deconvolution chart.


Random forest regressor: single mutations

This is the least interpretable of the models I guess. The predictive model that tries to predict the measured values based on the mutations present in the sequences is a random forest model. This model consists of a number of 'decision trees' that are optimised to predict the measure value. A tree is basically a series of decisions, like "if mutation1 is present, but not mutation2 , predict measurement value x". The good thing about this is that interactions between mutations are inherently taken into account. The values that you see in the deconvolution chart are derived from the number of times a mutation is used in a "split" in the tree, together with the position of the split in the tree, where higher-up splits are more important.

Then there are a number of models that include only interactions, or interactions combined with single mutations. Interactions are terms in the models that basically 'group' multiple mutations as one term. So if you have a sequence where mutation1 and mutation2 are present, their interaction will be one term (e.g. mutation1+2 ). So in the case of a linear model you get something like

EffectSize = a * mutation1 + b * mutation2 + c * mutation3 + d * mutation1+2

This is an example of a model with both single mutations and interactions, but you can also do this of course with just the interaction terms.
The downside of including interactions is that you get a very large number of parameters and you tend to overfit very easily, which is why there is no linear model that includes interaction terms, but only glmnet ones.

 

 

Related content