Login at 3DM (app3dm.bio-prodict.nl) with your 3DM account. If you don't have a 3DM account you can request one via the "get 3DM" tab. To be able to do this course you need at least a course login. After you have requested an account you can request a course login by sending an email to email@example.com
Open the GPCR database at 3DM (app3dm.bio-prodict.nl).
By default the GPCR 3DM database shows data of the GPCR family class A sequences. There are other classes of GPCRs too. You will investigate these later. Unless specifically requested otherwise, always use the GPCR class A protein family to answer the questions.
Open 3DMs correlated mutation analysis tool CorNet by clicking on 'correlated mutations' in the left menu. You can see two big networks and a couple of small ones. You wonder if the two bigger networks really are two separate networks or if they are simply disconnected because of the fact that the default cut-off for the number of positions in the network is 25. Now answer the following questions:
Correlated mutations often divide the superfamily into several groups based on sequence motifs. Sometimes this is just a evolutionary separation (e.g bacterial sequences vs eukaryotes), but because 3DM uses superfamily alignments containing proteins with different functions the correlation often are a result of the changes between these different functional groups.
Give again the keyword specificity in the search box and look at the enrichment-score (overall enrichment) below the search box.
Now select the other big network and color the nodes blue with the "node coloring" option that appears when you select positions in the window on the right. Select 247,248 and make them yellow. Select 42 and 32 and make them Magenta. Select position 60 (the only position of the sub-network that is not light bleu yet) and make it red. Click on "all nodes" in the "Visualize alignment position in structure" option. A new window will open where you can select structures to visualize the data. 3DM selects by default one structure. Just leave it as it is and click on either "visualize selection in Yasara" or "visualize selection in Pymol" depending on which program you have installed on your computer.
If you use Yasara load the 1F88A co-crystalized compound using: 3DM → Structures → Load structure from 3DM and load the 1F88A drug. If you use Pymol start the 3DM plugin using "plugin → "initialize plugin system". Then "plugin" → "legacy plugins" → "3DM". A new window should start. Use your 3DM credentials to login and click in this window on 3DM menu → structures → load structures from 3DM, select 1F88A and only select "download drug".
C: Families and numbering schemes
There are different families in the GPCR database. For each different GPCR class an alignment has been generated. These alignments are structure based and all have their own structural conserved core. Because these cores differ in length the different alignments have their own numbering schemes. Via the "Alignment" scroll down menu at the top of the 3DM website you can switch between the different families.
In the left menu, click on "System" → " System info" to open a nice overview of the different family alignments.
Now using the "alignment" scroll down menu select the superfamily 3DM and look at where the conserved residues are (which 3D numbers) using the "Alignment statistics" option from the left menu. Now do the same for the group A sequences (this you can do best in the other tab so you can switch between the tabs).
Select from the "Numbering scheme" scroll-down menu at the top of 3DM the gpcr-A numbering in both alignments.
Realize that between different 3D alignments the 3D numbers cannot be compared (do you understand why you can't select gprc_a numbers if you have the GPCR class B alignment selected?). Because the size of the structural conserved cores differ between the different families different numbering schemes have been developed for the different subclasses. Each of these numbering schemes indicates for which subclass the numbering is. Although it could have been even better if they had used a common numbering for the transmembrane parts that is shared by all classes such that the conserved Cys would have had the same number in all classes. Even though this is not the case the commonly used numbering schemes designed for the GPCRs is a luxury that is only available in the GPCR protein family and has not been developed for any other protein family (as far as we know).
D: Making a structural model with 3DM
3DM contains an automated homology-modeling module. It uses the 3DM alignment between a sequence and a structure to generate models. At the protein detail page of a sequence you can find a model tab. Here you can select a suitable template to generate a homology model of the protein. 3DM already makes a pre-selection of suitable templates based on sequence similarity. You can choose any of the pre-selected ones. Which of these templates is best is for you to decide. This decision should be based on what you want to do with your model. For instance, enzymes can be in the open or closed conformation. If you want your model in the closed conformation, then start with a template that is in the closed conformation. Other factors can be: with or without ligand, with an activating/inhibiting ligand bound in the ligand binding pocket, etc. So, always first investigate the different starting templates to see which one fits best to your needs.
Hirschsprung disease is a disease that is caused by mutations in a GPCR.
Select a template and make a model. Make sure you have the GPCR class A alignment selected. Use Yasara or Pymol to open it. Making the model may take a few minutes. If 3DM wasn't able to model parts of the sequence, these parts will be missing in your model. These missing positions are indicated as purple dots in Yasara and as lines in Pymol. Usually this is because the alignment between the sequence and the template cannot reliably be made (often the parts outside the core) due to very low sequence similarities. Realize that those parts cannot be modeled reliably using the selected template, because the sequence similarity is so low that the two proteins will likely fold differently in those parts. Sometimes it helps making a model choosing a different template (if available), but usually this means that those parts can simply not be modeled reliably.
If 3DM wasn't able to model parts of the sequence, you will see purple dots in Yasara. Usually this is because the alignment between the sequence and the template cannot reliably be made (often the parts outside the core) due to very low sequence similarities. Realize that those parts cannot be modeled using this template because if the sequence similarity is so low the two proteins will likely fold differently in those parts. Sometimes it helps making a model choosing a different template (if available), but usually this means that those parts can simply not be modeled reliably.
E: Visualize data in structures
To visualize data in a structure you can use the "visualize" option. Here you can select multiple structures. All structures are superimposed. You can select one or more of the templates used to generate the alignment from, when "Show only templates" is selected from the quick filter menu. When activating the advanced setting, a PDB type can be selected. The "aligned" setting gives any of the other structures that could be superimposed on the templates can be selected. The "non-aligned" setting contains structures that were detected by BLAST but 3DM could not superimpose them on the templates. These structures usually are not belonging to the superfamily or are structurally too distantly related. Models you have generated with 3DM can be selected from the "model" setting. The name of a model indicates which protein is modeled and between brackets you see which structure was used to make the model. Do you see the one you made? From the "compounds" menu you can select any co-crystalized compound. These are also superimposed, so usually they will be in the pocket of the structures. You can select any combination of protein structure and ligand, so you can insert any ligand into any structure.
Click on "alignment statistics" in 3DM . Make sure you have the GPCR class A family selected and consult the "Human variation/Position" histogram. You can compare this histogram with other data types using the "compare with" option above the histogram. Choose here "Amino acid conservcation". Set the cut-off for conservation on 50% and the SNP cut-off on 0,3.
There are other export buttons too, such as the "export to hotspot basket" button. Let's see how this works. Put the cut-off of the SNP graph again on 1 and open the hotspot basket tool by clicking on "HOTSPOTS" at the right corner of 3DM. At the SNP histogram this sign will appear: . This icon is for the insertion of selected positions in a hotspot basket. Then upload the 7 SNP positions into a new hotspot basket by clicking on the icon, give the basket a name, and save the basket.
A hotspot basket is nothing more than a selection of alignment positions. You can generate hotspots for different protein features (e.g. correlated mutations, specificity hotspots, thermostability hotspots, etc) and those can be found using 3DM. At later stage you can open the basket in different 3DM tools. Let's see how this works.
A good trick for increasing thermostability of proteins is to make mutations at flexible regions. Go to the "Alignment Statistics" module of 3DM. Here you will find two histograms that display flexible positions. The RMSD (a measure for how tight the structures superimposed) and the average B-factor. The B-factor is a measure for how sharp the X-ray diffraction was for the atoms of a residue. A very sharp X-ray indicates that the amino acid is tightly positioned in the structure. The average B-factor is the average over all amino acids from all structures at an alignment position. Often these two plots show a similar pattern. If at a position both plots are high usually this is a good hotspot for changing thermostability.
G: Panel design
The idea of the panel design tool is to select sequences from the alignment such that the selected sequences are maximally distributed over the superfamily. This is done in two steps: First the sequences of the superfamily are grouped. This can simply be based on sequence similarity (similar sequences are within the same group), but groups can also be based on sequence motifs found at user selected positions. The last option is used to group sequences based on a protein feature. For instance, the user can pick positions important for specificity. The idea is that sequences that have the exact same residues (the same motif) at those positions they are likely to have the same specificity. Both methods can be combined. In the second step a user defined number of sequences (usually one or two) are selected from each group. The selection step contains all kinds of options to maximize the chance that these proteins are likely to express. Lets see how it works.
Select the "panel design" option in 3DM. First we will divide the super-family based on sequence motifs. Because we want maximize the specificity range in the panel, we will use the "specificity hotspot" basket you have generated in question 31. This way all sequences with the same motif at our specificity hotspots (thus will likely have the same specificity) will be in one group. Use the "add hotspots" button to select the hotspot basket you made in Q30 (it should contain 9 positions).
Once they are selected click on "show groups".
You can also divide the alignment phylogenetically. This separation can be combined to the motif grouping. In the box under "phylogenetic groups" you can give a number. The superfamily will then be divided into this number of groups purely based on phylogenetic distances. Reload the panel design tool (refresh the page) and type 10 in the box and click again on "show groups".
After defining the groups you have to select from each group the sequences you want in the panel. There are several selection options that can be used to pick sequences from each group. First you select to number of sequences per group with the "proteins per group" option. Then several options can be used to determine which sequences are selected. These options are there to maximize the chance that selected sequences can be expressed. If there is literature available for a sequence, for instance, or if there is a structure available, then the chance that this protein can be expressed is higher since someone else has done it before. The different selection options can be combined using a "must have" or a "prefer" options. "must have" will result in a smaller number of groups, because not all groups will have a structure available, for instance. Usually it is a good idea to start with a larger number of groups that you want to have in your panel and delete some groups with these different options. Now play with the different options and see if you understand the result. To see the effect that the different options have you have to click on "select proteins". Any surprises? Can you make a panel of approximately 96 sequences for which you require swiss-prott, literature- and structure available.
H: Literature on thermostability
Select "Alignment statistics" on the left of 3DM and scroll down to the "keyword mutation" histogram. Use the keyword "thermostable". You will find 24 positions that are related to this keyword. Open the hotspot basket tool and insert these 24 positions in a new hotspot basket, give it the name "thermostable" and save it. You will use it later.
At position 30 there is one thermostable mutation found in the literature. You can click on the bar at position 30 in the histogram. This will link to a page showing the corresponding paper. From here you can download the paper if you have access. If you don't have access you can also find the paper here.
The R124Y mutation is in the human Endothelin type B receptor (EDNRB_HUMAN).
3DM contains an advanced tool for the analysis of patent data for complete protein families. How the tool works is explained in the Patent analysis presentation. Just go through the slides and then try out the tool yourself in the GPCR protein family via the "patents" option in the 3DM menu.
3DM also contains a tool for analyzing patent data in a target sequence. You upload a target sequence to 3DM via system → custom proteins → add custom protein
The de-convolution tool is a protein-engineering tool that is made for designing mutation experiments for a specific target sequence. With this tool many positions can be targeted at once without having to screen a large number of clones. The results of the measurements can be uploaded to the system and the system will calculate which mutations have the biggest effect on the feature(s) (e.g. activity or thermostability). The positive mutations can be combined to make the best performing sequence that can subsequently be used for a next round of engineering. Let's see how the tool works.
Go to Genomes (genomes.bio-prodict.nl) and login with these details:
- Login: firstname.lastname@example.org
- Password: K7XlQAK23QvE8Bl%1i^L5#Ax460lC3
Once logged in you will see that we have set-up the de-convolution tool for one GPCR protein (P07550). At the right top you see an Erlenmeyer flask.
- Click on the Erlenmeyer to open the tool
- Click on "create new project"
- Give the project a name
- Select the target protein as the template of this protein engineering experiment
- Create the project
- Click on this project once it is available (sometimes the page needs a refresh to make it visible)
- Click on "add round" to start designing the first round of mutations
- Give it the name "round 1" and save it
- Select "use the round 0 starting sequence (Note that if you already have uploaded experiment data in a previous round, you can use a sequence from that round)
- Click on "save base sequence"
- Now we can select mutations to introduce to the sequence. Click on "Mutations" on the left
- We will use the hotspot basket generated previously. Select the "hotspots" tab and select the "thermostable" basket.
- Click on "select all" to select all positions and select "Add all with most frequent as variant type" This option ensures that each hotspot is mutated to the most common amino acid according to the alignment. If the target sequence already has the most common amino acid it selects the second most common residue.
Note that the left box is a tool for the selection of mutations. Once you have the desired mutations you can drag them into the box on the right. Now the system automatically generated the mutations for all of the hotspots based on amino acid occurrences, but you can also manually add mutations.
- Select the "Manual" tab in the left box and type V117C. To add this mutation to the experiment, simply drag this mutation into the right box.
- Save the mutations by clicking "save" in the right box.
- Click on "Sequences" on the left.
- At "Maximum # of mutations per sequence" select 4 and select "fill up each sequence to contain maximum # of mutations". This will ensure that all sequences will have 4 mutations. If you don't use this option then sequences can have single, double, and triple mutations. Be sure to check the option “create demo measurement data”.
- Select 96 sequences to generate and set the minimum number of observations on 2 and click on "convolute mutations".
This convolution step calculates the best sequences to measure. The tool maximally distributes the mutations over the 96 sequences. In real life the 96 sequences are ordered at a gene synthesis company and the thermostability of these 96 clones are then measured in the lab. These measurements must be stored in an excel file which can be uploaded to the system. You can download the template of this file at the bottom of the sequence list (export CSV). In this file you should enter the measured values and then upload the file via the "measurements" option on the left.
Because you are doing the course and we, of course, cannot ask you to make the measurements in the lab we generate random measurement values for your 96 constructs and you can right away jump to the "analysis" option on the left.
To calculate the contributions of each mutation to the observed thermostability, click on 'Deconvolute measurements'. When done, the results will appear in a chart. There is a lot of information in these charts, but for now, let's keep it simple. In the top chart, unselect all colored boxes except the 'lm: single mutations" option. This will leave only the contributions calculated from a simple linear model visible. For more details about the various models available, see box 1.
- clicking on the point in the chart will cause a box with details about that mutation to load on the right side of the chart.
- select the top-performing mutations. You can do this by dragging a box around them in the chart. Note that the selected mutations appear below the chart as 'selected mutations'. Click on 'add to selection'. This makes the selected mutations available for further use. After clicking, they end up in a box on the right of the chart. Here you can visualise them in the structure. Select a structure and make a visualisation and review it in yasara.
- structure visualisations combined with detailed information about individual variants allow you to get a very thorough understanding about what might be happening in your protein.
Let's build a novel round.
- in the left bar, add a novel round by clicking on the 'add round' link.
- give your round a nice name and click on 'Create round'.
Now you would like to use a different base sequence than what you used in the first round. It makes sense to pick the best performing sequence from the previous round as the starting point for the new round. Let's do that.
- click 'Select the best performing sequence from a previous Round based on a metric', select the first round and pick the thermostability metric.
- then, hit the 'save base sequence' button below
Now it is time to pick the mutations that you want to test in the next round.
- click on 'mutations' in the bar on the left.
- click on the 'selections' tab
Here you can see the mutations that you selected in the previous round. These were the best performing mutations from your previous round. It makes sense to include them in this round.
- click on 'add all' to move them to the right.
Your list of mutations for the next round is not complete yet. What you would do now in a real life scenario is to use all the insights you gained by analysing the mutations you previously made to select novel ones. Which positions worked, which didn't, etc. When your set of mutations is complete, you'd generate novel sequences, measure them in the lab, analyse the individual contributions and build novel rounds. Iterate you way towards a great protein!
Box 1. Detailed descriptions of the statistical models used in ‘classic’ deconvolution
Linear model: single mutations
The simplest model available is the linear model that takes into account only single mutations. This is a very straightforward model, basically just one formula in the form of:
EffectSize = a * mutation1 + b * mutation2 + c * mutation3 + ...
Given all the measurements (effect sizes) the numbers a,b,care optimised so that the final formula approaches the actual measured values best. The numbers a,b,c. are what you see in the deconvolution chart.
Glmnet: single mutations
This is basically a little smarter version of the linear model described above. What this model tries to do is the same as the linear model, but it tries to throw out all the parameters (or, mutations) that minimally affect the models performance. This model will usually consist of less mutations, and that is because it has thrown away mutations it deemed unnecessary. So, without those mutations, the model is still able to predict the actual measurements.
Same as with the linear model, the numbers a,b,c ... are what you see in the deconvolution chart.
Random forest regressor: single mutations
This is the least interpretable of the models I guess. The predictive model that tries to predict the measured values based on the mutations present in the sequences is a random forest model. This model consists of a number of 'decision trees' that are optimised to predict the measure value. A tree is basically a series of decisions, like "if mutation1 is present, but not mutation2 , predict measurement value x". The good thing about this is that interactions between mutations are inherently taken into account. The values that you see in the deconvolution chart are derived from the number of times a
mutation is used in a "split" in the tree, together with the position of the split in the tree, where higher-up splits are more important.
Then there are a number of models that include only interactions, or interactions combined with single mutations. Interactions are terms in the models that basically 'group' multiple mutations as one term. So if you have a sequence where mutation1 and mutation2 are present, their interaction will be one term (e.g. mutation1+2 ). So in the case of a linear model you get something like
EffectSize = a * mutation1 + b * mutation2 + c * mutation3 + d * mutation1+2
this is an example of a model with both single mutations and interactions, but you can also do this of course with just the interaction terms.
The downside of including interactions is that you get a very large number of parameters and you tend to overfit very easily, which is why there is no linear model that includes interaction terms, but only glmnet ones.