Index
Before doing this practice it is best to first do the Introduction: 3DM applied to the nuclear receptors course.
General
The complete course consists out of three parts:
To follow these courses, you need a Bio-Prodict account. If you do not have one yet, please start by filling out the form on the registration page. Are you eligible for an academic license? Make sure to register with your academic email address. Once you submit the registration form, you will receive a confirmation email with further instructions on how to activate your account.
For some exercises in this section, you need either PyMOL or YASARA installed which has the 3DM plugin. If you do not have YASARA or PyMOL or you are missing the 3DM functionality, please consult the Installation instructions 3DM. Make sure you have the latest version installed before you start this exercise.
Introduction
First, start by logging into 3DM with your Bio-Prodict account. After logging in, you will land on the dashboard page where you can select the system that you want to work on. Open the GPCR (2016) database.
By default, the GPCR 3DM database shows data of the GPCR family class A sequences. There are other classes of GPCRs too. You will investigate these later. Unless specifically requested otherwise, always use the GPCR class A protein family to answer the questions.
CorNet
Open 3DMs correlated mutation analysis tool CorNet by clicking on 'correlated mutations' in the left menu. You can see two big networks and a couple of small ones. You wonder if the two bigger networks really are two separate networks or if they are simply disconnected because of the fact that the default cut-off for the number of positions in the network is 25. Now answer the following questions:
Question 1
What does "independent networks" mean? When are two networks independent?
Question 2
How can you investigate if these two networks truly are two independently co-evolving networks? Don't do it yourself yet. Just think about how you would do this.
Question 3
Give the Keyword "specificity" in the searchbox of the literature & mutations window on the right. What do you find? Gives this a clue about the separation of the two networks?
Question 4
Do you think it is likely that position 92 or 93 have an important role for specificity? How would you find out?
Correlated mutations often divide the superfamily into several groups based on sequence motifs. Sometimes this is just a evolutionary separation (e.g bacterial sequences vs eukaryotes), but because 3DM uses superfamily alignments containing proteins with different functions the correlation often are a result of the changes between these different functional groups.
Question 5
Do you understand the difference between evolutionary grouping and grouping based on functional differences?
Question 6
Click on the main node of the two networks (e.g. position 88 and 74). Clicking on a node gives information about this position, such as the amino acid distributions. How many different residues do you see in the distribution pie if you click on position 88? And what about 74?
Question 7
Click on the other members of each network. Do you see a pattern? Why does this tell you that these are truly separate co-evolving networks?
Question 8
Select at the same time 105 and 197. You can do this by making a selection box around the two positions (or selecting them while holding the Ctrl key). Looking at the amino acid distribution pie chart, do you think they might form a salt-bridge? Check this by clicking on "selected nodes" in the "Visualize alignment position in structure" option.
Give again the keyword specificity in the search box and look at the enrichment-score (overall enrichment) below the search box.
Question 9
Is this a high/meaningful value? Is it what you expect thinking about what you see in the network or is something going wrong? You can click on the information icon if you don't know what the enrichment-score (E-score) is.
Question 10
Now select all the nodes (so both the green and the light blue nodes) of the sub-network containing the light blue positions (e.g. specificity positions). What is the enrichment score of this selection?
Question 11
Just below the search box there is the Keyword cut-off option. Set this on 1 and repeat the "specificity" keyword search. What does this option do? Why do you think 3DM has this value on 2 as default?
Question 12
You can see that there is only one article describing a mutation having effects on specificity at position 93. Do think that it is likely to be a false positive or is it more likely that this really is a specificity related position? And what about position 248?
Now select the other big network and color the nodes blue with the "node coloring" option that appears when you select positions in the window on the right. Select 247,248 and make them yellow. Select 42 and 32 and make them Magenta. Select position 60 (the only position of the sub-network that is not light bleu yet) and make it red. Click on "all nodes" in the "Visualize alignment position in structure" option. A new window will open where you can select structures to visualize the data. 3DM selects by default one structure. Just leave it as it is and click on either "visualize selection in Yasara" or "visualize selection in Pymol" depending on which program you have installed on your computer.
Question 13
Do the nodes of the specificity related network cluster together in the structure? How about the other big cluster?
Question 14
Do you think that position 60 really belongs to the specificity related network?
Question 15
What about the small separate network 247 and 248 for which we found one article describing effects on specificity. Might they actually belong to the specificity related network? And what about the other two small networks (e.g 32-42 and 153-251). Do you think they are related to the bigger network too?
If you use Yasara load the 1F88A co-crystalized compound using: 3DM → Structures → Load structure from 3DM and load the 1F88A drug. If you use Pymol start the 3DM plugin using "plugin → "initialize plugin system". Then "plugin" → "legacy plugins" → "3DM". A new window should start. Use your 3DM credentials to login and click in this window on 3DM menu → structures → load structures from 3DM, select 1F88A and only select "download drug".
Question 16
Do you think it makes sense to find the ligand there?
Families and numbering schemes
There are different families in the GPCR database. For each different GPCR class an alignment has been generated. These alignments are structure based and all have their own structural conserved core. Because these cores differ in length the different alignments have their own numbering schemes. Via the "Alignment" scroll down menu at the top of the 3DM website you can switch between the different families.
In the left menu, click on "System" → " System info" to open a nice overview of the different family alignments.
Question 17
How many aligned sequences do you see in the class A family?
Question 18
How many sequences are in the different alignments?
Question 19
Group A is by far the largest group of GPCRs. The "superfamily 3DM" is a combination of all the classes. Why do you think the superfamily 3DM is not as big as all groups in total?
Now using the "alignment" scroll down menu select the superfamily 3DM and look at where the conserved residues are (which 3D numbers) using the "Alignment statistics" option from the left menu. Now do the same for the group A sequences (this you can do best in the other tab so you can switch between the tabs).
Question 20
What is the first residue that is conserved > 95% in the full alignment? What is it's 3D number in the full alignment? And in the class A alignment? Can you explain this?
Select from the "Numbering scheme" scroll-down menu at the top of 3DM the gpcr-A numbering in both alignments.
Question 21
Now compare the 3D numbers of the two conserved Cys residues again. What do you see? What is the number when using the class B numbering scheme?
Realize that between different 3D alignments the 3D numbers cannot be compared (do you understand why you can't select gprc_a numbers if you have the GPCR class B alignment selected?). Because the size of the structural conserved cores differ between the different families different numbering schemes have been developed for the different subclasses. Each of these numbering schemes indicates for which subclass the numbering is. Although it could have been even better if they had used a common numbering for the transmembrane parts that is shared by all classes such that the conserved Cys would have had the same number in all classes. Even though this is not the case the commonly used numbering schemes designed for the GPCRs is a luxury that is only available in the GPCR protein family and has not been developed for any other protein family (as far as we know).
Making a structural model with 3DM
3DM contains an automated homology-modeling module. It uses the 3DM alignment between a sequence and a structure to generate models. At the protein detail page of a sequence you can find a model tab. Here you can select a suitable template to generate a homology model of the protein. 3DM already makes a pre-selection of suitable templates based on sequence similarity. You can choose any of the pre-selected ones. Which of these templates is best is for you to decide. This decision should be based on what you want to do with your model. For instance, enzymes can be in the open or closed conformation. If you want your model in the closed conformation, then start with a template that is in the closed conformation. Other factors can be: with or without ligand, with an activating/inhibiting ligand bound in the ligand binding pocket, etc. So, always first investigate the different starting templates to see which one fits best to your needs.
Hirschsprung disease is a disease that is caused by mutations in a GPCR.
Question 22
Using the "Literature" option of the "search" module try to find a human protein that is known to be involved in Hirschsprung disease.
Question 23
Go to the "protein detail page" of P24530. In 3DM you can simply click on the protein name. At this page you can see a "models" tab where you can create a structural model of a protein. Do you understand the different numbering options here?
Select a template and make a model. Make sure you have the GPCR class A alignment selected. Use Yasara or Pymol to open it. Making the model may take a few minutes. If 3DM wasn't able to model parts of the sequence, these parts will be missing in your model. These missing positions are indicated as purple dots in Yasara and as lines in Pymol. Usually this is because the alignment between the sequence and the template cannot reliably be made (often the parts outside the core) due to very low sequence similarities. Realize that those parts cannot be modeled reliably using the selected template, because the sequence similarity is so low that the two proteins will likely fold differently in those parts. Sometimes it helps making a model choosing a different template (if available), but usually this means that those parts can simply not be modeled reliably.
Question 24
Find positions where mutations can cause Hirschsprung disease using the "Literature hotspots" option of the 3DM menu in Yasara or Pymol. How many positions can you find?
Visualize data in structures
To visualize data in a structure you can use the "visualize" option. Here you can select multiple structures. All structures are superimposed. You can select one or more of the templates used to generate the alignment from, when "Show only templates" is selected from the quick filter menu. When activating the advanced setting, a PDB type can be selected. The "aligned" setting gives any of the other structures that could be superimposed on the templates can be selected. The "non-aligned" setting contains structures that were detected by BLAST but 3DM could not superimpose them on the templates. These structures usually are not belonging to the superfamily or are structurally too distantly related. Models you have generated with 3DM can be selected from the "model" setting. The name of a model indicates which protein is modeled and between brackets you see which structure was used to make the model. Do you see the one you made? From the "compounds" menu you can select any co-crystalized compound. These are also superimposed, so usually they will be in the pocket of the structures. You can select any combination of protein structure and ligand, so you can insert any ligand into any structure.
Question 25
Select a couple of structures and ligands. Then add positions, select the Correlated mutations tab and select the top five correlated mutations. From the "Conservation" select all residues >90% conserved. From the "Contacts" tab select all positions make at least 100 contacts with a ligand. Click on "visualize selection in Yasara" or "visualize selection in Pymol" and open the resulting file. Did you get back what you expected?
Click on "alignment statistics" in 3DM . Make sure you have the GPCR class A family selected and consult the "Human variation/Position" histogram. You can compare this histogram with other data types using the "compare with" option above the histogram. Choose here "Amino acid conservation". Set the cut-off for conservation on 50% and the SNP cut-off on 0,3.
Question 26
Do you find more SNPs at Conserved regions? Does this finding make any sense?
Question 27
Now compare to ligand contacts. Do you see a common SNPs at a ligand contacting position?
Question 28
Reset the SNP graph by clicking "reset all". Move the cut-off to 1% using the slider bar and export the result to Yasara using the button left to the slider bar
. Which positions were exported to the structure? Which of these very common SNPs is located near the ligand binding pocket?Note that the
button can be found at many places in 3DM for direct visualization of data in Yasara.F: Hotspot Baskets
There are other export buttons too, such as the "export to hotspot basket" button. Let's see how this works. Put the cut-off of the SNP graph again on 1 and open the hotspot basket tool by clicking on "HOTSPOTS" at the right corner of 3DM. At the SNP histogram this sign will appear:
. This icon is for the insertion of selected positions in a hotspot basket. Then upload the 7 SNP positions into a new hotspot basket by clicking on the icon, give the basket a name, and save the basket.Question 29
Go back to Yasara or Pymol and use the 3DM → Hotspot Baskets → Load a Hotspot Basket and open the basket you just created. In Pymol you need to define which hotspots you want to visualize from the basket. Simply select all 7. Is the result the same as you had before?
A hotspot basket is nothing more than a selection of alignment positions. You can generate hotspots for different protein features (e.g. correlated mutations, specificity hotspots, thermostability hotspots, etc) and those can be found using 3DM. At a later stage you can open the basket in different 3DM tools. Let's see how this works.
Question 30
Go to the "correlated mutations" option in 3DM and use the keyword specificity in the "Literature & Mutation" window with a keyword cut-off of 1. Make sure you have the GPCR class A selected. Open the hotspot basket tool and select a NEW hotspot basket. Select the 9 positions of the sub-network that contains mutations reported to effect specificity upon mutation. Click on the "add to hotspots" button
to store these specificity related mutations in a NEW basket. Give it a logical name and save it. We will use it later on.
A good trick for increasing thermostability of proteins is to make mutations at flexible regions. Go to the "Alignment Statistics" module of 3DM. Here you will find two histograms that display flexible positions. The RMSD (a measure for how tight the structures superimposed) and the average B-factor. The B-factor is a measure for how sharp the X-ray diffraction was for the atoms of a residue. A very sharp X-ray indicates that the amino acid is tightly positioned in the structure. The average B-factor is the average over all amino acids from all structures at an alignment position. Often these two plots show a similar pattern. If at a position both plots are high usually this is a good hotspot for changing thermostability.
Question 31
Make sure you have the hotspot basket window open. Increase the B-factor cut-off using the slider bar until you have <20 positions selected and store these in a new basket. Open Yasara or Pymol and open this basket in any GPCR you prefer. Does the result make any sense? Do you think these are positions that are likely good candidates to improve thermostability of GPCRs or is there an artifact in the B-factor method in GPCRs? Can you think of a way to solve this? Use the "literature hotspot" option (keyword thermostability) in Yasara or Pymol to find mutations from the literature that have an effect on thermostability. Do these overlap with the B-factor hotspots? Do you think this is a better way of finding hotspots for thermostability in this protein family? Right click on some of these positions and choose "3DM" → "amino acid distribution". Do any of these positions have a different residue than the consensus? Would it be smart to make the consensus if you want to increase thermostability? Is there a way of checking which mutations might be beneficial?
Panel design
The idea of the panel design tool is to select sequences from the alignment such that the selected sequences are maximally distributed over the superfamily. This is done in two steps: First the sequences of the superfamily are grouped. This can simply be based on sequence similarity (similar sequences are within the same group), but groups can also be based on sequence motifs found at user selected positions. The last option is used to group sequences based on a protein feature. For instance, the user can pick positions important for specificity. The idea is that sequences that have the exact same residues (the same motif) at those positions they are likely to have the same specificity. Both methods can be combined. In the second step a user defined number of sequences (usually one or two) are selected from each group. The selection step contains all kinds of options to maximize the chance that these proteins are likely to express. Let's see how it works.
Select the "panel design" option in 3DM. First we will divide the super-family based on sequence motifs. Because we want maximize the specificity range in the panel, we will use the "specificity hotspot" basket you have generated in question 30. This way all sequences with the same motif at our specificity hotspots (thus will likely have the same specificity) will be in one group. Use the "add hotspots" button to select the hotspot basket you made in Q30 (it should contain 9 positions).
Once they are selected click on "show groups".
Question 32
Based on these criteria, into how many groups is the alignment divided?
Question 33
By default the tool only selects groups with at least two members and deletes sequences that have a gap in one the selected positions. Do you understand the difference between "motifs" and "total groups"?
You can also divide the alignment phylogenetically. This separation can be combined to the motif grouping. In the box under "phylogenetic groups" you can give a number. The superfamily will then be divided into this number of groups purely based on phylogenetic distances. Reload the panel design tool (refresh the page) and type 10 in the box and click again on "show groups".
Question 34
How many groups are added by this exercise?
After defining the groups you have to select from each group the sequences you want in the panel. There are several selection options that can be used to pick sequences from each group. First you select to number of sequences per group with the "proteins per group" option. Then several options can be used to determine which sequences are selected. These options are there to maximize the chance that selected sequences can be expressed. If there is literature available for a sequence, for instance, or if there is a structure available, then the chance that this protein can be expressed is higher since someone else has done it before. The different selection options can be combined using a "must have" or a "prefer" options. "must have" will result in a smaller number of groups, because not all groups will have a structure available, for instance. Usually it is a good idea to start with a larger number of groups that you want to have in your panel and delete some groups with these different options. Now play with the different options and see if you understand the result. To see the effect that the different options have you have to click on "select proteins". Any surprises? Can you make a panel of approximately 96 sequences for which you require swiss-prott, literature- and structure available.
Note that you can make the total panel smaller by clicking the "panel size" option. This will remove sequences by ensuring maximum diversity in the remaining panel sequences.
Note that sometimes you need to make a panel of a subset of sequences. For instance, say you want to find the most active enzyme with a certain specificity. Then you should first make a subset that contains only sequences that have this specificity. This sounds simple, but due to wrong notation of proteins is tricky. The best way to do this is to first do a keyword search to find enzymes likely to have the correct specificity and make a subset of this set of sequences. Then use this subset to make a motif with 4 to 7 amino acids that is specific for this subset. It is best to use the "subset specific residues" plot (consult the OAH questions about this plot). Make a new subset that contains all sequences that have this motif. With this approach you will not only find sequences with the correct specificity but that are annotated as "hypothetical protein", but you will also delete the sequences which are wrongly annotated. This approach doesn't make sure you have all sequences with the correct specificity, but it does maximise the chance that the ones that are in your subset all indeed have the correct specificity.
Literature on thermostability
Select "Alignment statistics" on the left of 3DM and scroll down to the "keyword mutation" histogram. Use the keyword "thermostable". You will find 24 positions that are related to this keyword. Open the hotspot basket tool and insert these 24 positions in a new hotspot basket, give it the name "thermostable" and save it. You will use it later.
At position 30 there is one thermostable mutation found in the literature. You can click on the bar at position 30 in the histogram. This will link to a page showing the corresponding paper. From here you can download the paper if you have access. If you don't have access you can also find the paper here.
Question 35
Can you find back the mutation at position 30? Is R the most common residue there, or can you easily mutate the R to something else based on the amino acid distribution. What is the most common residue at this position? Would Y be the first residue to try?
The R124Y mutation is in the human Endothelin type B receptor (EDNRB_HUMAN).
Question 36
Can you find a structure of this protein in 3DM?
Question 37
Go to the protein detail page of P24530, make a model of this sequence and open it in Yasara. Look at R124 → use F3 to see all the atoms. Make the R124Y mutation (right click residue → swap → residue). Do you understand why this mutation might be stabilizing? Investigate all other mutations that are described in the abstract of the paper. Look at conservation and structural location to see if you would have picked these positions/mutations in order to change thermostability.
Patent analysis tool
3DM contains an advanced tool for the analysis of patent data for complete protein families. How the tool works is explained in the Patent analysis presentation. Just go through the slides and then try out the tool yourself in the GPCR protein family via the "patents" option in the 3DM menu.
3DM also contains a tool for analyzing patent data in a target sequence. You upload a target sequence to 3DM via system → custom proteins → add custom protein
Designing mutation experiments with the de-convolution tool
The de-convolution tool is a protein-engineering tool that is made for designing mutation experiments for a specific target sequence. With this tool many positions can be targeted at once without having to screen a large number of clones. The results of the measurements can be uploaded to the system and the system will calculate which mutations have the biggest effect on the feature(s) (e.g. activity or thermostability). The positive mutations can be combined to make the best performing sequence that can subsequently be used for a next round of engineering. Let's see how the tool works.
Go to Genomes (genomes.bio-prodict.nl) and login with these details:
Login: course_demo@bio-prodict.nl
Password: K7XlQAK23QvE8Bl%1i^L5#Ax460lC3
Once logged in you will see that we have set-up the de-convolution tool for one GPCR protein (P07550). At the right top you see an Erlenmeyer flask.
Click on the Erlenmeyer to open the tool
Click on "create new project"
Give the project a name
Select the target protein as the template of this protein engineering experiment
Create the project
Click on this project once it is available (sometimes the page needs a refresh to make it visible)
Click on "add round" to start designing the first round of mutations
Give it the name "round 1" and save it
Select "use the round 0 starting sequence (Note that if you already have uploaded experiment data in a previous round, you can use a sequence from that round)
Click on "save base sequence"
Now we can select mutations to introduce to the sequence. Click on "Mutations" on the left
We will use the hotspot basket generated previously. Select the "hotspots" tab and select the "thermostable" basket.
Click on "select all" to select all positions and select "Add all with most frequent as variant type" This option ensures that each hotspot is mutated to the most common amino acid according to the alignment. If the target sequence already has the most common amino acid it selects the second most common residue.
Note that the left box is a tool for the selection of mutations. Once you have the desired mutations you can drag them into the box on the right. Now the system automatically generated the mutations for all of the hotspots based on amino acid occurrences, but you can also manually add mutations.
Select the "Manual" tab in the left box and type V117C. To add this mutation to the experiment, simply drag this mutation into the right box.
Save the mutations by clicking "save" in the right box.
Click on "Sequences" on the left.
At "Maximum # of mutations per sequence" select 4 and select "fill up each sequence to contain maximum # of mutations". This will ensure that all sequences will have 4 mutations. If you don't use this option then sequences can have single, double, and triple mutations. Be sure to check the option “create demo measurement data”.
Select 96 sequences to generate and set the minimum number of observations to 2 and click on "convolute mutations".
This convolution step calculates the best sequences to measure. The tool maximally distributes the mutations over the 96 sequences. In real life the 96 sequences are ordered at a gene synthesis company and the thermostability of these 96 clones are then measured in the lab. These measurements must be stored in an excel file which can be uploaded to the system. You can download the template of this file at the bottom of the sequence list (export CSV). In this file you should enter the measured values and then upload the file via the "measurements" option on the left.
Because you are doing the course and we, of course, cannot ask you to make the measurements in the lab we generate random measurement values for your 96 constructs and you can right away jump to the "analysis" option on the left.
Click on "analysis".
To calculate the contributions of each mutation to the observed thermostability, click on 'Deconvolute measurements'. When done, the results will appear in a chart. There is a lot of information in these charts, but for now, let's keep it simple. In the top chart, unselect all colored boxes except the 'lm: single mutations" option. This will leave only the contributions calculated from a simple linear model visible. For more details about the various models available, see box 1.
Question 38
What is the best performing mutation?
Are there mutations that have a destabilising effect?
clicking on the point in the chart will cause a box with details about that mutation to load on the right side of the chart.
select the top-performing mutations. You can do this by dragging a box around them in the chart. Note that the selected mutations appear below the chart as 'selected mutations'. Click on 'add to selection'. This makes the selected mutations available for further use. After clicking, they end up in a box on the right of the chart. Here you can visualise them in the structure. Select a structure and make a visualisation and review it in yasara.
structure visualisations combined with detailed information about individual variants allow you to get a very thorough understanding about what might be happening in your protein.
Let's build a novel round.
in the left bar, add a novel round by clicking on the 'add round' link.
give your round a nice name and click on 'Create round'.
Now you would like to use a different base sequence than what you used in the first round. It makes sense to pick the best performing sequence from the previous round as the starting point for the new round. Let's do that.
click 'Select the best performing sequence from a previous Round based on a metric', select the first round and pick the thermostability metric.
then, hit the 'save base sequence' button below
Now it is time to pick the mutations that you want to test in the next round.
click on 'mutations' in the bar on the left.
click on the 'selections' tab
Here you can see the mutations that you selected in the previous round. These were the best performing mutations from your previous round. It makes sense to include them in this round.
click on 'add all' to move them to the right.
Question 39
You might see warnings like this: 'wildtype of mutation M279V did not match round base sequence'. How can this happen?
Your list of mutations for the next round is not complete yet. What you would do now in a real life scenario is to use all the insights you gained by analysing the mutations you previously made to select novel ones. Which positions worked, which didn't, etc. When your set of mutations is complete, you'd generate novel sequences, measure them in the lab, analyse the individual contributions and build novel rounds. Iterate you way towards a great protein!
Box 1. Detailed descriptions of the statistical models used in ‘classic’ deconvolution Linear model: single mutations mutation is used in a "split" in the tree, together with the position of the split in the tree, where higher-up splits are more important. |