Mutator: automated extraction of mutation data from the literature

Background

Mutator automatically scans thousands of scientific articles to identify mentioned protein and DNA mutations (both coding and non-coding). This dataset of mutations is available within 3DM and as a standalone product. The first version of Mutator was published in 2010, https://doi.org/10.1002/humu.21317.

 

Mutator finds many references for each (human) mutation

Mutator scans hundreds of thousands of documents to find many mutations and many mentions of the same mutation in different articles.

 

Here we show an overview of the number of scientific papers identified for 3 different genes:

 

For three genes the number of articles identified by HGMD vs. Mutator.

 

Mutator identifies many more references that contain mutations for human proteins/genes.

Of the distinct mutations retrieved from the literature, 20-30% are not present in the HGMD.

Mutation types

Mutator supports the retrieval of both amino acid (AA) mutations and DNA mutations. For DNA mutations both coding and non-coding mutations are supported.

  • AA mutations

  • CDS mutations

  • Intronic mutations

  • 5” & 3” UTR mutations

  • Up/down-stream mutations

How to interpret Mutator scores for mutations found in the literature?

When a mutation is found in a paper it needs to be assigned to a protein or gene record. Mutations can fit in many protein/gene sequences and thus a smart algorithm is needed for them to be assigned to the correct protein/gene record.

Mutator uses a scoring metric to link mutations identified in text to their corresponding protein or gene records from UniProt/genbank. Mutations that fit in multiple proteins/genes will be assigned to the highest scoring protein or gene. Below is a list of features that help make up this score:

  • Species mentioned in the text (required)

  • Mutation fits in the sequence (required)

  • Other mutations encountered in the paper also fit in the sequence

    • if 20 mutations are mentioned and they all fit in the sequence they score each other.

  • If the counterpart of a DNA and amino acid mutation combination (if found in the paper) fits in the sequence of this protein/gene

  • Gene name (synonyms) are mentioned in the paper

    • extra points if these are in the title or in the sentence of the mutation

  • Protein/gene identifiers mentioned in the paper

    • extra points if these are in the title or in the sentence of the mutation

  • Protein descriptions mentioned in the paper

  • Related disease keywords mentioned in the paper

  • etc.

 

The score thus depends on the information available in the publication!

 

False positive (FDR) analysis

Any automated approach will occasionally make a mistake. We have evaluated the False Discovery Rate of mutator for different mutator scores. As you can see below, mutations with a score higher than 7 will most likely be correct and by far most mutations we find score higher than 20.

 

A False positive (FDR) analysis off the mutations for 3 human proteins/genes resulted in the following:

Score: 3.5–5

Score: 5–7

Score: 7–9

Score: >9

Score: 3.5–5

Score: 5–7

Score: 7–9

Score: >9

Correct: 10%

Correct: 70%

Correct: 92%

Correct: 99%

 

Most mutations score higher than 20

 

The 3.5-5 bin contains many FP's, above that we observe a more consistent frequency of mutations that have also been reported by the HGMD