Virus-X 3DM tutorial 2019

Welcome to the 3DM tutorial. A 3DM system is a data integration platform, we collect and integrate data about a protein superfamily.

Registration (skip this step if you already have a 3DM account)

Most importantly, to go through this walkthrough you will need a 3DM account.

Go to 3dm.bio-prodict.com and click on the SIGN-UP button (in the 'Get 3DM' section). You'll be redirected to a log in page, where you need to click on the Sign-up button.

 

Once you submit the registration form you will receive an e-mail with an activation link. Follow the link to activate your account you're ready to use the 3DM services.

Login (once you have a 3DM account)

Go to 3dm.bio-prodict.com and click on the OPEN 3DM button (in the upper right corner). You'll be redirected to a log in page, where you need to click on the Sign-in button.

Leave the ‘Two-factor authentication code' field empty if you did not enable two factor authentication.

3DM dashboard

You should now be at the 3DM dashboard website which (depending on the user) looks more or less like this:

 

 

From here you can either directly go to a 3DM system of your choice or if you don’t know yet which system would you like to use you can find the right one by following the Search by sequence / Search by keyword links in the menu on the left.

Systems listed under My 3DM systems are manually curated systems that were made specifically for the Virus-X project, PDB-wide 3DM systems are automatically created systems - most of them much smaller than the manually curated equivalents, however they cover many more protein families (almost all scop families and pdb-clusters). Public 3DM systems are systems that are available to all 3DM users.

 

Once you start using 3DM you will also see on the dashboard your recently accessed proteins and proteins that you have marked as favourite.


Project:

Increasing DNA polymerase activity

What mutations do we need to introduce to increase DNA polymerase activity of our protein of interest?


Sequence:
>M813_3RPJ35_ACHX_2 | Family A DNA Polymerase v_polI_1

MYQLINNLPDLPRDKTLFIDTETTDLYGDIVLLQLYQENMKDVLIVNAKNIPKSTILQYL
KSFKHIVGYNLQYDWEVLGATFDDVKERYFYDDLYFASKIIYYDQESFNLYDILNSVLKL
DIKIDKKKMQKQGFGGLFFTKEQLEYASTDVLYLPKLYKAILNLEPNFFIRNRVYRMDLF
VSKMMLDIHKIGLKVNKKKLNERKSELEAKLKEFNFSFNPLSPQQVARVLNTLKSDKEVL
LDLAYKGNEIAKQILEYRKIAKLLNFIAKFNKDRVYGKFNVVGAKSGRMTCSNENIQQIP
RELRSVFGFTDDEDKVYVVADFPQIELRLAALIWQEENMIKAFKEGIDLHKYTASVIYNK
DIEAVDKTERQISKSANFGLLYGMSGKAFAKYVYTNAGIVLSEDEGEFIKYKWLETYPMI
ARKHMQVKEKLYSSQYFESSTILGRKYRTQNFNEALNLQIQGSGAELLKQTLINLKQKYH
SLNIANVIHDEIIIECNKDEAQDIANALKQEMEQAWDTICSKAKIPIKYFKLEVEQPEVL
KSIAKA

 

 

Question 1. Which 3DM system are we going to use?

 

Since we already have the exact sequence of our protein of interest we can use the Search by sequence functionality from the dashboard.

 

 

 

Here, you can simply copy your sequence (either FASTA or plain sequence) and click the SEARCH button

We have to go through a lot of data to find the right systems, thus the search might take up to 2 minutes.

 

This is the system search result page for our DNA polymerase:

 

Let’s now have a look around this page. First of all, the two colourful bars right below represents Pfam domains found in this sequence - in this case, the blue bar is the 3'-5' exonuclease domain and the orange bar is the DNA polymerase family A domain ( you can see more detailed information if you mouse over the domain bars).

This gives us important information - since we want to increase DNA polymerase activity we will be mostly interested in the region marked with orange, the DNA polymerase domain (so residues 188-519 in our sequence)

When choosing the right 3DM system we have to make sure that the part of sequence we’re interested in is actually covered by this system - this is indicated by the grey bars in the Query sequence column:

 

If part of the bar is light grey, that means it’s not covered by this system, for example here, the 3rd system on the list doesn’t contain the 3'-5' exonuclease domain and contains only part of the DNA polymerase domain (according to Pfam).

 

Next things we want to look at when choosing a system are blast E-values and system sizes.

The last system on the list contains only 611 sequences, so we won’t be taking it into account. The first two systems both have a lot of sequences. While it might seem that the first one has way more so we should choose this one, it’s actually better here to focus on the system having sequences closest to our query sequence - and the blast e-value is better for the hit found in the second system DNA Polymerase (2016).

If you click on the downward pointing arrows on the right side you can unfold the system view, to see some more information about this particular system. Since our sequence of interest doesn’t have an exact match in this system (the closest sequence is 40% identical), we would like to add it as a ‘private protein'. Let’s not do it now, I’ve already added this sequence to the system and shared it with all Virus-X users so you’ll be able to use it from your accounts.

 

System search summary

Important things to look for when choosing the right system for your project:

  • does it contain the region that you’re interested in

  • system size (number of sequences and mutations in the system)

  • how close is the best hit to your query sequence (Blast E-value, Blast identity)

 

View your sequence in 3DM system

 

Since our sequence is already added to the system, we can now go directly to the system (you can open the 3DM system either by clicking on its name or on the 3DM SYSTEM button right beneath the word cloud:

To find our sequence click on Proteins and then on Private proteins in the menu on the left (if you're lost follow this link).

 

Private proteins

Proteins added by the users after the system was created, visible only to them and people who they shared them with.

The table that you see now is a list of all private sequences in this system that were either added by you or shared with you by other Virus-X users.

The sequence that we’re looking for is M813_3RPJ35_ACHX_2. Click on its accession to go to the protein detail page.

Here you can see all kinds of information about the sequence (however, for private proteins fields like Organism, Gene name, etc. are missing unless the user has filled them in).

Right below the sequence accession on top there’s a row with tab names, currently we’re in the tab information, other tabs are models (for homology modelling), sequence, sequence projection (to see superfamily data mapped onto your sequence), alignment quality check (advanced, to check how well your sequence is aligned in this system), and other 3DM systems (to access other systems to which this sequence is connected).

Numbering scheme

First thing we want to do here is create a numbering scheme based on our sequence - that means that all alignment positions will be translated to the residue numbers of our sequence. That is, for example whenever you’ll see alignment position 5 it will also mean position 5 in your sequence.

To create a numbering scheme click on the green CREATE NUMBERING SCHEME button and once you’re done switch to your new numbering scheme with the button at the top of the page.

Let’s now proceed to investigating our sequence.


Finding residues of interest (hotspots)

 

Question 2. What residues are important for the DNA polymerase activity?

  • Which residues are responsible for DNA binding?

  • What are the residues mentioned in literature in the context of activity?

  • Which residues are highly conserved or involved in the correlated mutations network?

We can find answers to these questions in the SEQUENCE PROJECTION tab

Once we’re in the sequence projection tab we should see something like this:

Sequence projection lets you visualize all sorts of superfamily data onto your protein.

The underlined residues are the residues that are aligned in our system - in practice, that means that these are the residues onto which we’re able to transfer data from other superfamily members.

Let’s now try to answer our questions using this tool.

Which residues are responsible for DNA binding?

To do that click on the round green plus button below Visualizations on the left and from the menu choose ‘Contact'. By default it shows us ‘Ligand contacts’, but we can switch to 'DNA/RNA contacts’ by clicking on the radio buttons.

 

Now you see that some of the residues are highlighted with green - the darker the colour the more structures have DNA/RNA contacts on this positions. For example, if you mouse over Arginine 288 you can see that there are 392 DNA/RNA contacts on this position. That’s quite a clear indication that this residues might be important for DNA binding.

Which residues are mentioned in the context of activity in the literature?

For answering this question we will need another visualization from sequence projection - the so called 'keyword mutations' - this will show us all mutations that we found in articles mentioned in the same sentence with the keyword that you provided. The keyword can be any word or phrase you want.

In this case we will use keyword ‘activity.

Again, click the round green plus button, choose Keyword mutations, in the block that appears type in 'activity’, change min.keyword occurrence value to 1 and click RETRIEVE.

Now keyword mutations should be highlighted on our sequence with orange like this:

 

Now we can see multiple interesting residues, that both have a lot of contacts and a lot of keyword mutations, for example Aspartate 490 - if we click on this residue we will see some more detailed information:

First thing this tab shows us is amino acid conservation on this position, here aspartate is conserved in almost 100% - that’s a red flag, we probably should not mutate this residue. But let’s also have a look at the KEYWORD MUTATIONS tab:

 

This table lists all the keyword mutations together with articles in which they were found (if you want to go to the article follow the PubMed link on the right). Accession field tells you onto which protein this mutation was mapped, and the mutation score indicates how certain we are this mutation is mapped correctly. Mutations in the 'Mutation' column use numbering of the sequence onto which they were mapped, so for example if you want to look up the first mutation in the article you should be looking for residue 882 (and not 490).

We had a look at one of the articles (“A carboxylate triad is essential for the polymerase activity of Escherichia coli DNA polymerase I (Klenow fragment). Presence of two functional triads at the catalytic center.“) and we found this information:

Of the four carboxylate residues at the polymerase active site of E. coli DNA polymerase I, two aspartates at positions 705 and 882 are known to be absolutely essential for catalysis.

That is yet another hint that we should not be mutating this residue if we want to increase polymerase activity.

 

We know from literature that metal binding can affect polymerase activity - so we would also want to see what residues have contacts with metal ions. Click on the 'Contact | DNA/RNA' block on the right and switch to Ion contacts.

There’s a block of residues in the middle that have both a lot of keyword mutations and ion contacts - let’s have a look there.

Click on the first residue from this block (Valine 316) and have a look in the keyword mutations tab.

There’s a link to an article DNA polymerase active site is highly mutable: evolutionary consequences.

And that's finally something we were looking for! The article mentions that they measured a 20-50% increase in activity by a single point mutation (Leucine to Arginine) on this position. Although our protein has a Valine on this position there’s a high chance that a Valine to Arginine mutation will have a similar effect. Here’s a table from the article that shows us mutations that were introduce at each position and the activity relative to the wild type protein:

We also from that table that mutating an Arginine a couple of residues further to a Phenylalanine resulted in a 100% activity increase. This is also an Arginine in our protein, so this position is definitely worth adding to our list of hotspots.

 

In a similar fashion you can continue looking for more hotspots - by combining 3DM and reading the articles.

What residues mutated together throughout evolution? (advanced)

Now let’s try another visualization - correlated mutations, you can again choose it with the plus button on the right.

Since we already know from the previous paper that the block of residues in the middle plays a significant role for the polymerase activity, we’ll have a look if these residues correlate with any residues located further away. Let’s have a look at Isoleucine 325. The correlated mutations tab says that it is highly correlated with residue on position 461 (Glutamine).

The chart on the right shows you what are the most often seen combinations of these two residues in this superfamily. So, for example 59% proteins in this family have a combination Isoleucine on position 325 and Glutamine on position 461. Then, 29% have Leucine on position 325 and Threonine on position 461, etc. These residues are clearly onto something together! If you have a look at the contact data for both of them you can see that there are contacts with a Sodium cation on both positions, maybe they're both involved in coordinating the ion?

We can have a look at the structure to check if our ion hypothesis is plausible. With the menu on the left switch to the Visualize page. In the POSITIONS block click ADD POSITIONS, and enter your positions (325 and 461) in the 'Custom positions' field, then click the VISUALIZE button.

Wait for a bit (this might take some time) and a yasara scene should be downloaded.

Open it in YASARA, and zoom in on the residues coloured in yellow - they aren’t very far from each other (8Å) so it is plausible they do indeed cooperate with each other.